I have fallen behind since day 6, but am attempting to make up some ground.

Day 7 proved to be pretty tough. I kept track of the “current” directory by appending all of the cd commands. I then removed the directory before the .., ie one/dir/.. -> one. I attempted to use a recursive regex for this, but unfortunatly BigQuery uses the re2 regex library which does not support recursive syntax. So instead I used a JavaScript UDF for that.

After finding the size of each directory, I then created a CTE with a row for each parent of a directory and the child’s size. I then grouped by each of those directories to sum the sizes of all of the children. There might be a better way to do that, but that’s the solution I arrived at.

Most of these solutions are ending up as a chain of transformations put into CTEs. 🤷

CREATE TEMP FUNCTION
  moveUpDir(directory STRING)
  RETURNS STRING
  LANGUAGE js AS r"""
  while (directory.includes("..")) {
    directory = directory.replace(/,\w+,\.\./g, '');
    directory = '/' + directory.replace(/\/(,\.\.)?/g, '');

  }
  return directory;
""";
WITH
  raw_input AS (
  SELECT
    """$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k""" AS raw_text ),
  raw_rows AS (
  SELECT
    o,
    raw_row
  FROM (
    SELECT
      SPLIT(raw_text, '\n') raw_row_array
    FROM
      raw_input ),
    UNNEST(raw_row_array) raw_row
  WITH
  OFFSET
    o ),
  directory_size AS (
  SELECT
    *,
    REGEXP_EXTRACT(raw_row, r'^\d+') AS size,
    REGEXP_EXTRACT(raw_row, r'^\d+ (.*)') AS filename,
    STRING_AGG( REGEXP_EXTRACT(raw_row, r'^\$ cd (.*)'), ',' ) OVER (ORDER BY o) AS current_directory
  FROM
    raw_rows ),
  processed_dir AS (
  SELECT
    DISTINCT moveUpDir(current_directory) AS current_directory,
    raw_row,
    o,
    size,
    filename
  FROM
    directory_size ),
  cleaned_dir AS (
  SELECT
    DISTINCT CONCAT(current_directory, ',', filename) AS filename,
    size,
    current_directory
  FROM
    processed_dir
  WHERE
    filename IS NOT NULL ),
  total_sizes AS (
  SELECT
    SUM(COALESCE(CAST(size AS INT), 0)) AS total_size,
    current_directory
  FROM
    cleaned_dir
  GROUP BY
    current_directory ),
  all_dirs AS (
  SELECT
    total_size,
    SPLIT(current_directory, ',') dirs,
    current_directory
  FROM
    total_sizes ),
  unnest_dirs AS (
  SELECT
    *
  FROM
    all_dirs,
    UNNEST(dirs) d
  WITH
  OFFSET
    o ),
  agg_dirs AS (
  SELECT
    STRING_AGG(d, ',') OVER (PARTITION BY current_directory ORDER BY o) AS dir,
    total_size
  FROM
    unnest_dirs ),
  grouped_dirs AS (
  SELECT
    SUM(total_size) dir_size,
    dir
  FROM
    agg_dirs
  GROUP BY
    dir
  HAVING
    dir_size <= 100000 ),
  part_1 AS (
  SELECT
    SUM(dir_size)
  FROM
    grouped_dirs ),
  all_grouped_dirs AS (
  SELECT
    SUM(total_size) AS dir_size,
    dir
  FROM
    agg_dirs
  GROUP BY
    dir ),
  min_size AS (
  SELECT
    30000000 - (70000000 - MAX(dir_size)) AS size
  FROM
    all_grouped_dirs )
-- SELECT * FROM part_1;
SELECT
  dir_size
FROM
  all_grouped_dirs
JOIN
  min_size
ON
  min_size.size <= all_grouped_dirs.dir_size
ORDER BY
  dir_size
LIMIT
  1