Google Cloud BigQuery comes with all sorts of built-in analytics and AI capabilities. Because of its nature as a columnar data store, however, BigQuery SQL syntax can sometimes be non-intuitive to work within some regards. This includes removed nested structs in BigQuery.
Google recommended best practices for query performance and cost optimization are to denormalize your data, or in other words to take normalized data (i.e., data that is split across multiple tables) and combine it into a single table using specialized data types like arrays and key-value stores.
Query Syntax for Denormalized Data Structures
Querying denormalized data may be more performant than joining data at scale but querying the data isn’t that easy if you’re used to working with data in normalized tables. This goes over some query syntax for common and less-common denormalized data structures in BigQuery, using a mock GCP billing dataset that has the same structure as the BigQuery billing export.
Basic structs, or key-value fields, are straightforward enough, as you can simply use dot notation to select subfields:
— service is a STRUCT field and id is one of its keys
SELECT service.id AS service_id
FROM `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export`
LIMIT 1000
When working with arrays or repeated fields, things get a little bit more complicated. You can select individual components of an array using the array index, which can be useful if the order of elements is predetermined or doesn’t matter (you can see from the first two rows of the results below that in this case, order is not predetermined)
SELECT credits[OFFSET(0)] as credit_offset_0,
credits[OFFSET(1)] AS credit_offset_1
FROM `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export`
WHERE ARRAY_LENGTH(credits) > 1 limit 1000
You also have the option to flatten the data using what’s called a correlated cross join. This takes any repeated field, pivots it so that each element in the array is a new row, and then joins that new tabular data with the original table, creating a flattened schema with repeated rows for every element in the original repeated field.
SELECT billing.credits,
c.*
FROM `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export` billing, UNNEST(credits) c
WHERE ARRAY_LENGTH(credits) > 1 limit 1000
This is useful if you need to flatten your data and calculate aggregate values or metrics based on the data contained in an array – for example, if you need to calculate the total credit amount per credit type for GCP consumption that was invoiced in the month of December 2019, you would use the following query:
SELECT c.NAME AS credit_type,
SUM(c.amount) AS total_credit_amount
FROM `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export` billing,
unnest(credits) c
WHERE billing.invoice.month = ‘201912’
GROUP BY credit_type
ORDER BY credit_type
But what if you simply need the arrayed data to be pivoted so that it can be consumed by another system or exported into spreadsheet software like Excel or Google Sheets? Again, because of its nature as a columnar data store, BigQuery doesn’t easily support syntax that would be the equivalent of something like the Pandas library’s unstack method.
Repeated Struct Objects
As a specific example, let’s take two of the four repeated fields in the BigQuery billing export:
-
- <
li
-
- ” aria-level=”1″>
ProjectLabels
-
- <
li
-
- ” aria-level=”1″> Resource
Labels
If you’re not already familiar with the concept, project and resource labels allow you as an organization to apply business concepts and processes to GCP compute resources. Because the labels applied to your compute resources are propagated to your billing data, if you set up labels correctly and consistently, you can break down your GCP cost consumption by different environments, organizational departments, or initiatives. You can also initiate processes like chargebacks to individual departments.
Long story short, labels are or should be an important aspect of your GCP cost management strategy. This makes it all the more frustrating that both the project labels and resource labels in the BigQuery billing export are in fields of repeated STRUCT objects, or in other words, are written to the table as arrays of key-value pairs.
If you do a correlated cross join on the project labels in this instance, because the cost field is not contained within the project label array in the same way the credit costs are associated with specific credit types within the credit array, flattening the project label data causes the cost field to be repeated once per project label. Same goes if you do a correlated cross join on the resource labels.
SELECT project,
project_labels,
cost
FROM `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export` billing,
unnest(project.labels) project_labels
WHERE cost > 1
If you tried to group by and sum in this instance, your GCP costs would potentially be much higher than your actual costs, depending on how many labels you applied to each project. If your table has a unique identifier for each row, you could use that to join against a flattened array that filtered out all but one label, and could do so repeatedly to integrate each different project label. But the billing data export does not have unique row identifiers, so that’s not really an option here. So what can you do?
Nested Structs in BigQuery
All hope is not lost, thanks to this article from Lak Lakshmanan of Google Cloud, which explores some powerful querying patterns in BigQuery. This is helpful in solving the billing problem described above, which was the application of expression subqueries to unpack nested struct objects in BigQuery tables. You can see below an example of how you can use this querying syntax to unstack project label data, creating a tabular result that can be easily worked within BigQuery or other data analytics systems:
SELECT project.labels,
(
SELECT value
FROM Unnest(project.labels)
WHERE KEY=’creator’) AS project_creator,
(
SELECT value
FROM Unnest(project.labels)
WHERE KEY=’account’) AS project_account,
cost
FROM `gcp-sandbox-213315.gcp_billing_demo.gcp_billing_export`
WHERE array_length(project.labels) > 0
AND cost > 1
Getting Started with BigQuery
Learn more about BigQuery and data modernization today by contacting one of our cloud experts.