At Dutchie, we've grown from handling a small number of transactions per day to a large number of transactions per minute. Through this growth, we've changed how we build analytics for our users and our company. This article is about a common concept and a simple tool that allowed us to scale our customer analytics quickly and cleanly.
As a startup, our first pass at analytics was very ad hoc. We had a simple page for sales analytics that was calculated on the fly. In its final incarnation it was a gnarly MongoDB aggregation. Many hours were spent squeezing the final bit of performance, and then we moved on.
We've been moving away from MongoDB for our relational data which is most of our application. The rest of this journey involves data summarized to the hourly level into PostgreSQL from MongoDB.
We rolled those up to daily tables using a materialized view. And then Dutchie's business exploded in 2020. Those materialized views took longer and longer to refresh. Significant time was spent optimizing them, but ultimately there was no way forward.
There were a number of options for how to resolve this issue. Building on top of a database like Snowflake was one option. Snowflake provides a very strong SAAS data warehouse capability that can be treated like your data store through its builder interface. We are building internal BI functionality on Snowflake. But we are early in learning about Snowflake and have some concerns about the amount of work we would have to put into getting data into Snowflake, processed, and then queried while meeting our data freshness requirements. So we choose not to use Snowflake for this problem at this time.
The issue with our current functionality wasn't that the design was bad. The idea of using rollups to provide performant analytics is a common one that we would employ in other systems. Using PostgreSQL materialized views to do this was the issue. So we turned to Data Build Tool (DBT). DBT is the simplest of tools that allows for the structure management of derived data (e.g. rollups). It provides the ability to define, relate, and schedule these transformations. The transformations by design are table to table within the same database. The models (definitions) are SQL statements slightly enhanced with macros, the relationship between those are represented by macros within those SQL statements, and the schedule is driven by job definitions. DBT does exactly what it says within a fairly narrow range of functionality. You can run DBT on your own infrastructure or on DBTCloud.
An example DBT model might look like
{{
config(
materialized='incremental',
unique_key='pmk'
)
}}
select concat(
date_trunc('day'::text, timezone('UTC'::text, ho.hour_in_utc), d.timezone_name),
'-', ho.dispensary_identifier,
'-', ho.order_type,
'-', ho.is_medical,
'-', COALESCE(ho.coupon_identifier, 'na')
) pmk,
date_trunc('day'::text, timezone('UTC'::text, ho.hour_in_utc), d.timezone_name)::date AS day_in_tz,
ho.dispensary_identifier,
ho.order_type,
ho.is_medical,
ho.coupon_identifier,
sum(ho.num_orders) num_orders,
sum(ho.amount_in_home_currency) amount_in_home_currency
FROM dw_hourly_orders ho
JOIN dw_dispensaries d ON d.dispensary_identifier = ho.dispensary_identifier
where not is_test
{% if is_incremental() %}
and hour_in_utc > (select now() - INTERVAL '1 day')
{% endif %}
group by pmk, day_in_tz, ho.dispensary_identifier, ho.order_type, ho.is_medical, ho.coupon_identifier
This builds a rollup of our hourly orders data reduced around the day, dispensary, order type, if the order is medical or recreational, and if a specific coupon was used. This reduces the data from the orders table by an order of magnitude, which significantly speeds up our ability to delivery daily order data to our customers.
When run incrementally, this will look at the data for the past day when refreshing the rollup. The macros both allow you to define that this is in incremental view as well as allowing the model to define what it means by incremental.
One of the core issues we had with materialized view refresh times was that PostgreSQL needed to look at a large data set to find the bits that changed. I'm not certain when this started breaking down, but at the point over 99% of the source data was non-changing, the refresh had gotten too slow. PostgreSQL doesn't have an incremental update for materialized views. DBT does have the ability for a model to be built for and run incrementally. This allows us to not only develop more specific rollups faster but also to have them run faster. That increase in engineering and run time efficiency is hard to beat.
Our growth pattern at Dutchie means that we will be putting much more time and effort into our internal and external analytics tooling. The standardization of how we build rollups is a small but important part of that process. There is a lot more to do in order to build data infrastructure at Dutchie. In addition to traditional business and customer analytics we will be adding support for modern analytics and machine learning.