Modern Data Quality Requires a Rethink
Why are we still talking about data quality? Twilight Zone? Deja Vu? Every vendor promises a new fangled approach that they claim is better and will cure all ills of the past. But, we know better.
But, do we?
What if the space of data quality is going through a rethink, driven by the tremendous demands for accurate data from AI and ML use cases? After all, what good do sophisticated models, like ChatGPT, do if they are trained on faulty data? That faulty data may not even be visible to the naked eye. The data itself may be correct, but arrived in the system a tad bit too late or is biased.
What if we have approached the topic of data quality incorrectly and hence have been unable to overcome poor data quality? Treating data quality as a technical problem and not a business problem may have been the biggest limiting factor in making progress. Finding technical defects, such as duplicate data, missing values, out of order sequences and drift from expected patterns of historical data are no doubt critical, but this is just the first step. A more demanding and crucial step is to measure the business quality which checks to see if the data is contextually correct.
Modern data quality is a top-down effort driven by business KPIs and strategic imperatives.
As business teams expand their use of data for new use cases, the stakes are much higher when the data quality lags. Businesses are in a race to leverage data assets faster and don’t want to be slowed down by pesky data quality roadblocks.
Business quality is not optional. Organizations embarking on digital transformations need to reset how they approach data quality in order to become more data-driven, and to use data as a competitive advantage.
This research explores modernization of the data quality space.
The New Rule Book
An organization with 1000 employees, in 2022, has over 150 SaaS applications. Most of these applications store data relevant to their needs, However, in order to perform cross-organizational analysis, this data needs to be aggregated, enriched and integrated. This process vastly increases the scope of data quality initiative from the past days, when all the data came from a handful of internal ERP or CRM applications that stored data in a structured manner. New AI and ML use cases are often using synthetic data which relies on good quality real-life data.
If we spent the past decade amassing more data, the current decade is more concerned with ensuring we have the right data. Gartner estimates that the cost of poor data quality to be on an average of $15M for each organization. This is the decade where new ways of delivering data, such as data mesh, data products, data sharing and marketplaces are starting to become mainstream.
Take the example of an orders table in a retail application. Sales taxes differ widely in the US across states, counties and cities and change frequently. Your data quality subsystem should detect if it infers that incorrect taxes may have been applied on a certain order. The sooner an organization can capture and rectify problems, the lower is the cost.
The title of this section is ironic, because so much of traditional data quality is rule-based. Yes, the rethink requires that we shift from static, predefined rules to embarking on a discovery of rules that are hidden inside data. These rules are inferred from patterns that exist in the data and using ML algorithms can predict the reliability of new incoming data. When the inferred rules are combined with existing rules, a much richer data quality system emerges.
We have realized the limitations of creating rules and policies in the dynamic and fast-changing world of data. The new frontier is to understand the “behavior” of data using sophisticated ML models and dynamically detect anomalies and recommend remediation steps. An example of discovering rules is based on the volume of data that typically enters a system This volume increases as a business grows by a steady rate which can be predicted using ML techniques. If suddenly there is an unexplained drift from the expected range, then the data quality product should alert the stakeholders. The faster this is done, the blast radius of the damage can be contained.
The figure below shows the fresh approach to handling data quality.
The four pillars of modern data quality are:
Top-down Business KPI
Perhaps the IT teams would have benefited if the term data quality had never been coined, and instead “business quality” was the goal. In that case, the raison d’être of ensuring data is correct would have been to ensure the business outcomes were being met. In this scenario, the focus shifts from data’s infrastructure to its context.
But, what exactly is “context?”
It is the application of business use to the data. For example, the definition of a “customer” can vary between different business units. For sales, it is the buyer, for marketing, it is the influencer, and for finance it is the person who pays the bills. So, the context changes depending upon who is dealing with data. Data quality needs to keep in lock-step with the context. In another example, country code 1 and region US and Canada may appear to be analogous, but they are not.
Different teams can use for vastly different purposes the same columns in a table. As a result, the definition of data quality varies. Hence, data quality needs to be applied at the business context level.
Product thinking
The concepts evoked by the data mesh principles are compelling. They evolve our thinking so older approaches that might not have worked in practice actually can work today. The biggest change is how we think about data: as a product that must be managed with users and their desired outcomes in mind.
Organizations are applying product management practices to make their data assets consumable. The goal of a “data product” is to encourage higher utilization of “trusted data” by making its consumption and analysis easier by a diverse set of consumers. This in turn increases an organization’s ability to rapidly extract intelligence and insights from their data assets in a low-friction manner.
Similarly, data quality should also be approached with the same product management discipline. Data producers should publish a “data contract” listing the level of data quality promised to the consumers. By treating data quality as a first-class citizen, the producers should learn how the data is being used and the implications of its quality.
Data products’ data quality SLA is designed to ensure that consumers have knowledge about parameters like the freshness of data.
Data Observability
Frequently, the data consumer is the first person to detect anomalies, such as the CFO discovering errors on a dashboard. At this point, all hell breaks loose and the IT team goes into a reactive fire-fighting mode trying to detect where in the complex architecture the error manifested.
Data observability fills the gap by constantly monitoring data pipelines and using advanced ML techniques to quickly identify anomalies, or even proactively predict them so that issues can be remediated before they reach downstream systems.
Data quality issues can happen at any place in the pipeline. However, if the problem is caught sooner, then the cost to remediate is lower. Hence, adopt the philosophy of ‘shift left.’ A data observability product augments data quality through:
- Data discovery extracts metadata from data sources and all the components of the data pipeline such as transformation engines and reports or dashboards.
- Monitoring and profiling — for data in motion and at rest. What about data in use?
- Predictive anomaly detection — uses built-in
- Alerting and notification
Data quality is a foundational part of data observability. The figure below shows the overall scope of data observability.
Overall Data Governance
The data quality subsystem is inextricably linked to overall metadata management.
On one hand, the data catalog stores defined or inferred rules, and, on the other hand, DataOps practices generate metadata that further refines the data quality rules. Data quality and DataOps ensure that the data pipelines are continuously tested with the right rules and context in an automated manner and alerts are raised when anomalies are inferred.
In fact, data quality and DataOps are just two of the many use cases of metadata. Modern data quality is integrated with these other use cases as the figure below shows.
A comprehensive metadata platform that coalesces data quality within other aspects of data governance improves the collaboration between the business users, such as data consumers and the producers and maintainers of data products. They share the same context and metrics.
This tight integration helps in adopting the shift left approach to data quality. Continuous testing, orchestration and automation help reduce error rates and speed up delivery of data products. This approach is needed to improve trust and confidence in the data teams.
This integration is the stepping stone for enterprise adoption of modern data delivery approaches of data products, data mesh, and data sharing options, like exchanges and marketplaces.
Benefits of the Modern Data Quality
The goal of a data quality program is to build trust in data. However, trust is an expansive, and often ill-defined term that can include many topics that control and manage data. Trusted data is possible when all the components of the metadata management platform work as a single unit. For example, without accurate data, it is very difficult to ensure that all the data security and privacy programs will work as envisaged.
This should be a primary goal of chief data officers (CDOs).
But so many organizations have failed to deliver on multiple data governance attempts, that this term is now banned. However, the reality is that global compliances are only increasing and irrespective of what we call the data governance program; it is imperative that business quality be addressed.
The benefits of modern data quality approach are:
- Accountability
In the decentralized data delivery world of data mesh and data products, the modern approach allows business teams to take charge of data quality. After all, the domain owners are the subject matter experts and know their data the best.
Business users augment the technical aspects of data quality by addressing context to meet critical KPIs. Data quality then becomes a committed SLA in the packaged data products. And it is constantly evolving as the data changes. Hence, data products have new versions. The data consumer no longer has to second-guess whether to trust data or not.
- Speed of delivery
‘Data quality latency’ is the time between arrival of new data and performing data quality checks and remediation on it. Modern tools should be able to
More data is now generated across multiple external data sources, such as SaaS products in multiple formats, and often arriving in real-time streaming than in internal systems. Past techniques of landing the data in a single target location and performing data quality as a batch operation are no longer sufficient. The old static approach treated data quality as a standalone effort on data at rest that ran only at fixed intervals.
The modern ‘continuous quality’ approach is proactive and dynamic. It is in sync with the DataOps principles that include orchestration, automation and CI / CD. This approach allows data teams to deliver data products faster. It permits organizations that were used to doing one release per quarter to accelerate and deliver many releases a week.
- Higher Productivity
One reason why traditional approaches to data quality are unsuccessful is because of the enormous amount of effort and time that is needed to achieve the ultimate goal. Precious staff are bogged down in manually fixing data quality problems in downstream systems. Often, the time-consuming reconciliation takes place in Microsoft Excel spreadsheet. This is treating the symptoms and not the problem.
The modern approach of identifying and remediating the problems close to their origin saves time and cost. Through various automation capabilities offered by DataOps and through integration with the other aspects of data governance, this approach leads to higher productivity of the data teams.
Once data quality issues are addressed, the data teams
- Cost
As data volumes keep increasing, to do continuous quality, the system needs to scale automatically. This is typically where cloud-based solutions help. However, even in the cloud, there are two ways to run data quality checks — one is via an agent that constantly monitors data-in-motion, and the other option is to push down data-at-rest in the cloud data warehouse and use pushdown features. Each option serves unique use cases and provides architecture and cost trade-offs.
In the former approach, data quality issues are detected before the data lands in a target analytical system. This is useful for anomaly detection in case of streaming data. However, it will require a processing engine, such as an Apache Spark cluster.
In the latter case, data first lands into an analytical system, such as Snowflake, and then the data quality product generates SQL queries to perform right inside the storage engine. This option minimizes data movement and hence, may be more secure. Also, it can take advantage of the auto scale features of the analytical system.
Architects should analyze the total costs of each option to assess the appropriate architecture.
Summary
In a world where so much emphasis has been placed on the speed and agility of analytics, data quality has suffered. However, the modern approach to data quality is once again making it a first order problem, without which modern analytics are rendered incomplete. The focus is shifting from solely checking for technical dimensions like completeness, uniqueness, and integrity to reliability, trust and contextual accuracy.
The fresh approach is also addressing the speed of delivering data quality in new concepts, like data mesh and data products. The modern data quality platforms increase data utility and directly contribute to improving companies’ strategic initiatives, such as operational excellence, competitive advantage, higher revenue, and established reputation.