I was reading a post about BigQuery cost optimization which stated this:
Also remember you are charged for bytes processed in the first stage of query execution. Avoid creating a complex multistage query just to optimize for bytes processed in the intermediate stages, since there are no cost implications anyway (though you may achieve performance gains).
I was curious about this, because I have written some queries to minimize reading data multiple times. I decided to write two simple queries to test out documentation’s advice.
Query 1 - CTE
WITH a AS (
SELECT
invoice_and_item_number,
vendor_name,
zip_code
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE zip_code IN ('52722', '50314') )
SELECT
COUNT(invoice_and_item_number) AS total_sales,
'52722' AS zip
FROM a
WHERE zip_code = '52722'
UNION ALL
SELECT
COUNT(invoice_and_item_number) AS total_sales,
'50314' AS zip
FROM a
WHERE zip_code = '50314';
Query 2 - No CTE
SELECT
COUNT(invoice_and_item_number) AS total_sales,
'52722' AS zip
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE zip_code = '52722'
UNION ALL
SELECT
COUNT(invoice_and_item_number) AS total_sales,
'50314' AS zip
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE zip_code = '50314';
I would think query 1 would be cheaper since it only reads the data once (in the CTE). Query 2 reads the data twice. But the BigQuery UI says that both queries process the same amount of data (476.4 MiB). I decided to look into the query plans to see how the executions compare.
Query 1 execution plan
Let’s look at the execution plan for query 1. You can focus on S00
and S01
since the other stages are exactly the same between the two queries.
S00: Input
READ
$10:invoice_and_item_number, $11:zip_code
FROM bigquery-public-data.iowa_liquor_sales.sales
WHERE and(in($11, '52722', '50314'), equal($11, '50314'))
AGGREGATE
$80 := COUNT($10)
WRITE
$80
TO __stage00_output
S01: Input
READ
$1:invoice_and_item_number, $2:zip_code
FROM bigquery-public-data.iowa_liquor_sales.sales
WHERE and(in($2, '52722', '50314'), equal($2, '52722'))
AGGREGATE
$70 := COUNT($1)
WRITE
$70
TO __stage01_output
S02: Aggregate+
READ
$80
FROM __stage00_output
COMPUTE
$40 := '50314'
AGGREGATE
$60 := SUM_OF_COUNTS($80)
WRITE
$40, $60
TO __stage02_output
S03: Aggregate+
READ
$70
FROM __stage01_output
COMPUTE
$30 := '52722'
AGGREGATE
$50 := SUM_OF_COUNTS($70)
WRITE
$30, $50
TO __stage03_output
S04: Output
READ
$30, $50
FROM __stage03_output
READ
$40, $60
FROM __stage02_output
WRITE
$20, $21
TO __stage04_output
I was slighty supprised to see that there are two reads from the iowa_liquor_sales.sales
table. I naively assumed that it would read data only once when it is selected in the CTE.
Query 2 execution plan
S00: Input
READ
$10:invoice_and_item_number, $11:zip_code
FROM bigquery-public-data.iowa_liquor_sales.sales
WHERE equal($11, '50314')
AGGREGATE
$80 := COUNT($10)
WRITE
$80
TO __stage00_output
S01: Input
READ
$1:invoice_and_item_number, $2:zip_code
FROM bigquery-public-data.iowa_liquor_sales.sales
WHERE equal($2, '52722')
AGGREGATE
$70 := COUNT($1)
WRITE
$70
TO __stage01_output
[... ommiting the rest of the stages]
Comparison
As you can see, both plans read from the public dataset twice. The only difference between the two is in the WHERE
clauses. The first query has redundant logic:
WHERE and(in($11, '52722', '50314'), equal($11, '50314'))
There is no need to check if the zip code is in ('52722', '50314')
and also equal to 50314
. Compare that to the second, more straightforwad WHERE
clause:
WHERE equal($11, '50314')
Query 3
The first two queries prove a point - reading the same data twice from a table does not cost more. However, a better version of the query would be:
SELECT
zip_code,
COUNT(invoice_and_item_number) AS total_sales,
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE zip_code in ('52722', '50314')
GROUP BY zip_code;
The amount of data processed is the same as the first two queries (476.4 MiB). However, the query plan is much simpler and only reads from the dataset once:
S00: Input
READ
$1:invoice_and_item_number, $2:zip_code
FROM bigquery-public-data.iowa_liquor_sales.sales
WHERE in($2, '52722', '50314')
AGGREGATE
GROUP BY $30 := $2
$20 := COUNT($1)
WRITE
$30, $20
TO __stage00_output
BY HASH($30)
S01: Output
READ
$30, $20
FROM __stage00_output
AGGREGATE
GROUP BY $40 := $30
$10 := SUM_OF_COUNTS($20)
WRITE
$40, $10
TO __stage01_output
Results
Query | Bytes Processed | Slot Time |
---|---|---|
Query 1 | 476.4 MiB | 8.219 sec |
Query 2 | 476.4 MiB | 9.834 sec |
Query 3 | 476.4 MiB | 4.761 sec |
As the documentation says, we shouldn’t worry about the number of times the data is read from a cost perspective. However, there may be performance gains if we can avoid reading the data multiple times. The last query consumes about 50% of the slot time of the first two queries. This is not a thorough comparison, but it does indicate that the third query is much faster.
Note
If you are using flat rate billing then you should be worried about the slot time from a cost perspective. As the post mentions:
You might be tempted to think that you don’t have to worry about query optimizations with flat-rate at all. The reality is that it still impacts performance. The faster your query (job) executes, the more number of jobs you will be able to complete in the same amount of time with fixed slots. If you think about it, that’s cost optimization in itself!