Give Schemas the Flexibility to Work with Semi-Structured Data

Relational databases, by nature, tend to have a hierarchical structure. Individual records in one table have multiple related records in another table, and those individual records may have multiple related records in yet another table, and so on. While these one-to-many relationships are to be expected, they can cause issues when we want to aggregate all related records up to report as a single row in Looker.

As with all things SQL, there are many ways we could approach solving this challenge. Among the choices available, Snowflake’s object_agg() is an elegant, efficient, and above all, flexible way to address this problem.

The Rigidness of a Typical Dataset

As an example, let’s say you have a table with all your products; we'll call it "products". Your products are highly customizable, and when a product is selected, options are presented and selected. To keep your ordering system as flexible as possible, you create a field which will allow you to expand those options. For example, say some of your products have color and size options, while others have dimension and weight options. While this satisfies your needs now, you may want to add different product options in the future. Rather than creating separate fields for each option type, you create a table of all options called "options", and a single field called "option_type". Now you can write any value you want into this field without having to completely change your schema.

You’ll need another table to record the option values selected. We’ll call this table "option_responses" and record the values for the option selected in the "option_response_value" field. We’ll link the table to our products table through the product_id. We’ll also link this table to our options table through the optionID. Our schema looks like this:

Now, every product knows about all the options on those products. The difficulty is there are likely multiple records in the option_responses table for every product. The option_responses table looks something like this:

Now, we want to generate a report of every product, the option, and the option value. We need to aggregate all those option_response_value records to the single product.

We’ll need to create a derived table in Looker so we can do the necessary gymnastics to get everything into one record. The question is how we should code our derived table.

Using CTEs

We could create separate common table expressions for each option_type, then bring everything together in a final select statement. Something like this:

This will work, and though verbose it does clearly show where the values are coming from. 

Using Case Statements

A considerably more compact method would be to simply use case statements wrapped in a max() function to grab the only values that should be in the option_responses_value field given the proper option_type, then aggregating on product_id. For example, we could write:

This also works, but it’s a bit of a hack. In addition, neither option is very flexible. If we want to look for different options, or if we add new options to a particular product, or simply just want to see what options are available for a particular product we would need to change our query by either adding new CTEs, or altering the SELECT statement of every query on this data. Even in fairly small implementations this can be a daunting proposal. 

Introducing Flexible Schemas with object_agg()

Enter object_agg(). The object_agg() function is tailor-made for this scenario. You pass the function two fields of type VARIANT and it returns an object of key:value pairs where the key is the first field you pass it and the value is the second field. The syntax would look something like this:

This will return:

Look at the beauty of that code! So succinct! So flexible! No matter how many option_types are associated with the product, object_agg() will return them and the related values in the option_responses table.

Taking it a step further with LookML

LookML is Looker’s abstraction layer over SQL. It enables you to write code that is reusable, collaborative, flexible, organized, and controlled. With your derived table created, you simply define the dimensions you want to pull out of that table. Your LookML code then accounts for the flexibility. You simply create the dimensions which reference the option_type/s you’re looking for.

For example:

If you add, delete, or rename option_types, you only need to change the LookML. Your derived table remains unchanged.

This is only one practical scenario where object_agg() can come to the rescue. I encourage you to find your own use case and write about it so we can all learn from it. I stand on the shoulders of greatness and did not discover this on my own. A client showed it to me. For confidentiality reasons I won’t name them, and the example above is not their actual use case. They know who they are though and this is their shout-out. Thank you for showing me the coolest thing I’ve seen all year.

Jeff Springer is an Analytics Consulting Team Lead at DAS42.