It’s no secret that companies of all sizes are collecting more and more data with each passing day. Snowflake’s 162% net revenue retention, and the fact that more data has been created and captured in the last 2 years than every year prior combined gives us concrete data points on the sheer scale of data growth. Data is quickly becoming a businesses best asset and everyone wants to become data driven, but can they? Most of this data goes unused. Over the last couple years the tooling around managing data and data pipelines have become very strong (I’ve gone over this in my posts here and here). This has resulted in an explosion of uses cases for data across BI, machine learning and data science. However, as the amount of data companies consume increases, so does the complexity of their data pipelines. This has given rise to an entirely new problem for data teams to manage - monitoring the quality of their data and data pipelines.
The concept of monitoring key businesses process has existed for quite some time. Network teams use Solarwinds, Riverbed or Cisco. Application teams use New Relic, AppDynamics or SignalFX. Infrastructure teams use Datadog, Dynatrace or Elastic. And recently we’ve seen all of these monitoring “point solutions” merge together to offer a single observability platform. Companies like Datadog now have solutions for network, application and infrastructure monitoring, and have also layered in logging. The complete monitoring package is bundled up into an observability suite. But what do data teams have to monitor their data pipelines and the quality of their data? Nothing! I believe this is an incredibly nascent market that will explode in the coming years and I’ll get into why in this post.
Historically data teams haven’t needed anything to monitor their pipelines for two main reasons. First, the pipelines themselves simply weren’t complex enough. Data originated from one place, was moved to one database, and it was very easy to find anomalies or errors with simple unit testing-like functionality (this is an overly simplified take). The number of “steps” or “hops” data took along a typical pipeline was so small that monitoring could be managed manually. Second, data wasn’t (yet) mission critical. It wasn’t being used as frequently to make important strategic decisions, and it wasn’t used “in production.” As we’ve seen data used more frequently in production, managing the quality of the data has become mission critical. Let’s use an example of a modern lending platform or insurance business. Both the lending and insurance business in this example use complex machine learning models to underwrite loans and insurance policies. Data from many different business segments (internal claims data, external claims data, customer success department, onboarding team, marketing data, etc) are all piped into the underwriting models to train them. But what happens if that data is somehow “broken” at some point from when it’s created to when it’s fed into the model. Now you’re training models on incorrect, or incomplete data! These businesses may start underwriting unprofitable policies. And the hardest part of all this? You many not have even noticed you had a data quality issue until months later when issues started showing up in your P&L. By then it’s way too late. In summary - as we’ve seen data pipelines increase in complexity, and the mission criticality of data increase, the need for a data monitoring (or data observability) platform has dramatically increased.
Before moving on, I want to give a quick example that I’m sure will resonate with wide segment of the audience reading this post. I’m quite confident that any reader here at some point has prepared some sort of dashboard for management or an exec team. Maybe you work in customer success and you’re preparing a churn report / dashboard. Maybe you work in marketing and are preparing a report / dashboard summarizing the efficacy of a recent campaign. Or you’re in product preparing a report on product usage metrics. You get the point. 95% of the time the outputs of the charts you’re preparing are normal. But every so often something changes DRAMATICALLY. Something just looks “off.” You’re then left scrambling to figure out if something fundamental in the business really changed, or if there’s a data quality issue somewhere “upstream” in your pipeline and the results in your dashboard are artificially skewed. The main point I want to make - answering that question can be incredibly difficult and time consuming without a modern data observability platform. It takes forever to swim upstream to every hop of the data pipeline, talk to every data steward at each hop, to ultimately figure out where the error originated. Sometimes it’s as simple as “the marketing team changed the input field on the website form” and that wasn’t reflected in how the data was stored. Sometimes jobs fail to run and you’re just missing data. Sometimes data is transformed incorrectly. There are many ways data can go “wrong.”
There are many different vectors on which to evaluate your data quality. Instead of writing my own list, I will summarize the thoughts of an expert in this space. Barr Moses is the CEO of Monte Carlo, a modern data observability / reliability company, and she recently authored an incredibly blog post on the different pillars of data quality. She’s a very active writer discussing key challenges (and solutions) around data quality. Highly suggest following her blog and twitter! In her blog post (and again I’m paraphrasing / quoting her work) she described the different pillars of data quality as:
Freshness: This is one of the most common data quality problems - Is you data up to date? If you expect a table to be updated in certain intervals, or you expect a job (join, aggregation, update, etc) to be run in certain intervals and for some reason it doesn’t, you have a data freshness issues.
Distribution: This describes the “shape” of your data at a field level (ie looking at how frequently a certain type of value occurs). The simplest example is looking at null values in a data set. If you typically have 10% of values represented as null, but all of a sudden you have 20% of your values as null you probably have a data distribution issue.
Volume: This one is self explanatory. Looking / comparing your data volumes to historical levels is an easy way to spot deviations.
Schema: Data schema is the blueprint for how data is organized / constructed. Barr describes some typical data schema issues: “Fields are added or removed, changed, etc. tables are removed or not loaded properly, etc. So auditing or having a strong audit of your schema is a good way to think about the health of your data as part of this Data Observability framework.”
Lineage: Data lineage in many ways is the map of how data moves throughout your organization. It ultimately helps users figure out how any of the first 4 pillars effect downstream users (or spot which upstream system caused the initial issue). Lineage will also include metadata management.
These are all incredibly important pillars of data observability. In the next couple of years I think we’ll see an explosion of different solutions attacking this problem as it becomes more acute for data teams across the world (of all sizes). At the end of the day, if you can’t trust your data, how can you use it! I believe the best solutions will use machine learning to do anomaly detection on data quality issues, and tie in alerting and root cause analysis functionalities to quickly remedy issues that arise. Rules based frameworks simply won’t scale.
In addition to Monte Carlo there are a number of other startups in this space. Bigeye (formerly Toro Data), Soda Data, Datakin, Datafold, Atlan and Superconductive (behind open source project Great Expectations) are all amazing businesses attacking the core issue of data reliability / observability. Fishtown Analytics also offers a data reliability solution I’m sure there are names that I’ve left off, and I can’t wait to see this space develop in the coming years!
Hey Jamin, I just came across this write-up, like you I am excited about this space, great stuff on the info! Have you heard of or come across Cribl yet?