Is Data Observability Critical to Successful Data Analytics?
Every CDO’s top priority is to show value from their data assets. However, the exponential growth of data volumes, consumers, and disparate use cases have put organizations under additional pressure to quickly tap data and deliver actionable insights. The data teams are also under pressure to improve data culture so that organizations can increase data-driven decisions. To meet these imperatives, data quality, reliability, and efficiency must be of the highest order. Organizations that excel at this aspect outperform their competitors.
However, how can one ensure high data quality, reliability, and efficiency?
The answer lies in data observability.
Borrowing from the principles of application and infrastructure observability, data observability provides transparency into the health of data assets from their creation to consumption throughout data pipelines. This transparency proactively detects anomalies, remediates issues faster and ensures data is available for various data intelligence initiatives. In short, data observability is the key to improving trust in data.
Introduction
Application and infrastructure teams readily adopted DevOps techniques, including observability over the last many years, while the data teams fell behind. These include monitoring code and packets for anomalies and proactively notifying the DevOps team of anomalies. Data is much more dynamic than apps and infrastructure. It changes rapidly and can be unpredictable. Its volume has been increasing precipitously and more often this data is created from several external sources, such as SaaS products. If an application crashes, a container management system, like Kubernetes, can quickly auto-recover. However, if the database crashes, it might cause inconsistent or corrupt data.
Hence, the application of observability concepts to data has taken until now to mature and become mainstream. Considering that organizations have increased their focus on analytics to drive business value there is renewed importance being attached to data, its provenance and associated trust.
Business users who want to leverage a wider range of data assets to derive intelligence and attain competitive advantage are pushing data teams to guarantee that the corporate data meets high standards. Just like apps and infrastructure adhere to strict service level agreements (SLA), data assets are now expected to provide the same level of guarantee.
Data observability provides transparency into real-time aspects of the data pipeline including quality, resource usage and operational metrics. It leverages the unique fingerprints and patterns of data and overlays it with metadata. In other words, it leverages metadata as figure 1 depicts.
Each of the use cases of metadata serves a different purpose. For example, data catalogs help map technical metadata to business attributes but it does not know whether or not the underlying data values are correct. Metadata is used to ensure authorized access to sensitive data and compliance with the relevant privacy regulations.
The newest use case of metadata is data observability.
Data observability diagnoses the internal state of the entire data value chain by observing its outputs with the goal of proactively rectifying issues.
The data observability space has witnessed a torrid growth in the last few years, fueled by overzealous venture capitalists eager to fund companies. The Cambrian explosion of vendors has led to a segmentation of capabilities offered. Many products offer a subset of capabilities and define data observability too narrowly. Hence, the next section defines what should be the scope of a comprehensive data observability offering.
Data observability scope and personas supported
Various vendors have defined data observability inconsistently in the market, leading to confusion among buyers. For some organizations, its only purpose is to improve data quality and for others, it is to make data pipelines reliable. Yet another emerging use case is to provide insights into costs, an area known as FinOps. These three macro data observability use cases address the needs of personas responsible for data, infrastructure, and business KPIs.
The figure 2 shows the scope of a comprehensive data observability solution.
A comprehensive data observability solution should address all the three areas — quality, pipeline and operations.
Data Quality (data)
Data quality has been a perennial problem ever since data has been collected. Most organizations have struggled to improve data quality, leading to a lack of trust in corporate data sets and expensive remediation steps. Some of the issues with typical data quality initiatives include:
- Errors are detected too late in the journey. Often, data consumers are the ones to point out bad-quality data. The latter a problem is discovered, the higher the remediation cost.
- Remediation is often manual. It is common for consumers to embark on a reconciliation effort in a tool like MS Excel. This accentuates the problem of multiple sources of truth.
- Treating it as a technical problem instead of a business issue.
In modern data environments, use cases, like ML-based predictive analysis, are rendered useless if the data’s quality is poor. These environments process vast amounts of multi-structured data, sometimes streaming and arriving at various intervals. The traditional data quality approaches were not designed to handle the new requirements and hence are ineffective. Data observability provides a more modern approach, and it differs from traditional approaches:
- It ‘shifts left’ monitoring and detection of data issues as soon as it enters the pipeline.
- It uses ML to automatically assess data quality, define rules based on time-series statistical analysis of historical data, and triage anomalies. Data observability products identify patterns and seasonality.
- It kicks off a customizable incident response workflow to remediate problems faster. This includes notification, and integration with other metadata management systems. It is helpful to deploy solutions that can remediate data quality issues besides observability aspects.
Finally, the data quality process has been elevated to the business layer. The process should operate at the semantic layer, and not at the technical metadata level. Through a bi-directional integration with data catalogs, decentralized business unit data teams can use the associated business glossary and stored rule templates that a centralized governance team has created. At the semantic layer, these products should detect anomalies in not just one value, but also in relation to the other values. For example, it is not enough to check whether the date format is as per the standard but to also check if its value is semantically correct regarding other fields.
Data Pipeline (infrastructure)
Data’s journey from its production to consumption is called a data pipeline. During this journey, the participating systems generate vast amounts of logs. As organizations adopt approaches like microservices architecture, the number of logs increases tremendously.
Imagine a bank where each retail branch sends end-of-the-day transactions to the head office. Every day, thousands of such pipelines run before the cumulative data is aggregated in the corporate office. It is quite possible that some pipelines will fail and lead to inaccurate corporate reporting. Data observability monitors and detects the ecosystem for errors in transmission.
Data observability helps reduce meantime-to-detect (MTTD) and meantime-to-resolve (MTTR) data quality and pipeline problems. Modern tools use no-code / low-code user interfaces that permit business users to participate in ensuring data reliability meets their threshold.
In fact, data observability is the glue between data producers and consumers. It builds the lineage of data movement with deep visualization ranging from high-level concepts that can be drilled to columns and code used to transform data, such as SQL statements.
The difference between data quality and data pipeline observability is that the former inspects data packets to detect drift, while the latter inspects metadata and logs to detect anomalies.
BizOps/FinOps (business)
The BizOps and FinOps use of data observability is the newest expansion of the scope of data observability and it shows its continuing maturity. It can be used to measure developer productivity and includes checks across various areas:
- Data quality: Measures such as the percentage of data that is complete and accurate, the number of data errors identified and corrected, and the percentage of data that is compliant with relevant regulations and standards.
- Data utilization: Measures such as the number of data-driven projects, the context for data use (e.g. analytics, model training, corporate reporting), releases or initiatives completed, the percentage of data assets that are being actively used, and the business value generated through the use of data.
- Data security and compliance: Measures such as the number of data security breaches or incidents, the percentage of data that is protected by appropriate security measures, and the percentage of data that is compliant with relevant regulations and standards.
- Data governance: Measures such as the effectiveness of the organization’s data governance framework, the percentage of data assets that are properly classified and managed, and the number of data-related policies and standards implemented.
- Data strategy: Measures such as the effectiveness of the organization’s data strategy in supporting business goals and objectives, the percentage of data-related goals and objectives that are being met, the level of buy-in, and support for the data strategy within the organization.
- Identifying opportunities for data-driven innovations: By monitoring and understanding the data within the organization, the CDO can identify opportunities to use data to drive innovation and business value. For example, the CDO may identify patterns in customer data that could be used to develop new products or services.
- Cost: Measures include identification of wasteful resource consumption and optimization of infrastructure. Trend analysis of data costs as organizations ingest ever-increasing data volumes and the use cases to explore data increase. This analysis helps in more accurate forecasting and budget allocations, especially to handle seasonality and other patterns.
The role of a CDO is to lead digital transformations, act as a change agent, and not operate at the transactional level of data. As mentioned earlier, CDO’s biggest business priority is to enable a pervasive data culture within an organization to derive the most business value from data. At the tactical level, they are responsible for data governance.
However, it is common knowledge that the tenure of CDOs is currently limited to about two years.
Why is that? There are two reasons for this. The first one being that the role is not as clearly well understood as the other c-level roles. And the second reason is that the CDO does not “own” all the data. Some aspects of data exist in the business analytics teams, some in the InfoSec team, and some with the infrastructure teams. Data observability can be the missing link that provides the cross-functional insights into business aspects of data’s usage.
Data observability plays a key role in meeting CDO’s KPIs when the metrics derived from the health of data and infrastructure are upleveled to show the data and analytics optimization. These metrics should support an organization’s business imperatives, such as reducing cost, increasing revenue, and reducing risk.
Data Observability Capabilities
Data observability is part of the DataOps category that manages agile development, testing of data products, and operationalization of data management.
Getting data from diverse data producers to data consumers to meet business needs is a complicated and time-consuming task that often traverses many products. The incoming data elements are enriched, correlated, and integrated so that the consumption-ready data products are meaningful, timely and trustworthy.
Figure 3 shows the key components of data observability.
Monitor
Continuous monitoring of the data and its accompanying metadata is the most fundamental capability. A data observability product should have connectors to the stack subsystems in order to profile data characteristics and calculate statistics and detect patterns. This information is then stored in a persistence layer and used for the subsequent steps.
A common question is what data should organizations monitor? The best practice is to not boil the ocean, and instead, identify critical data elements (CDE) based on your strategic business priorities. The business priorities vary across organizations. Some may be interested in identifying new sales opportunities, while others may be more concerned with meeting compliance regulations.
Another common question is what data sources should an organization monitor. To solve architectural issues, organizations often create multiple copies of data to serve different consumers. As a result, it is imperative to get stakeholder engagement to identify the correct data sources. Once there is an agreement, monitor the in-scope application logs, traces, and data packets. The scope of monitoring includes:
- Data and schema drift
- Volume of data
- Data quality dimensions, like completeness, missing values, duplicates, uniqueness etc. DAMA International’s Netherlands team has identified 60 dimensions that comply with the ISO 704 standard.
- Resource usage and configurations
Data monitoring is not a new concept. What is new is that, unlike static and point-in-time approaches in the past, modern tools are proactive, and continuous and they span various systems to not only inform about anomalies but also to analyze their root cause. Data monitoring can be an expensive operation. So, modern tools may use cheaper cloud spot instances.
Analyze
Monitoring is such a foundational need, that some basic data observability products simply provide monitoring with visualization as their only capability. However, most new products provide a rich statistical analysis of the data movement. The profiled data is compared to the baseline and hidden patterns are detected.
Similarly, analysis of data and metadata helps detect if a pipeline has failed or is taking more than the expected duration. By analyzing the volume of data, inferences can be made concerning undetected failure and drift. If these anomalies are not handled in a timely manner, they can cause downstream operations to fail or be rendered inaccurate.
Delays in data pipeline latencies could be because of under provisioning of resources when the volume of data increases. In such a case, the overall analytics SLAs may be impacted. A comprehensive data observability product should analyze the traffic and recommend the right resource types. Conversely, it should detect when the resources are over provisioned and wasting money.
Data observability products proactively and dynamically detect drift from expected outcomes. They perform time series analysis of incremental data over a number of periods of data, such as daily, weekly, monthly, and quarterly, etc. In this process, they continuously retrain ML models.
Alert
Data observability products proactively inform impacted teams on inferred anomalies. They contribute in resolving issues faster and leading to higher uptime of the pipelines.
One of the biggest problems of observability tools is “alert fatigue”. It happens when the system constantly generates more alerts than what the team can consume. As a result, many highly critical alerts are lost in an ocean of notifications and hence go unattended.
Data observability products intelligently handle notifications to reduce alert fatigue. For example, they may aggregate alerts by categories or through customization.
Incident Management
Traditionally, when errors are detected, they are fixed in downstream systems. In other words, we remediate the symptoms, and not the root cause. Fixing data problems in downstream applications creates technical debt. This approach is not sustainable, but is common as IT teams cannot detect the source of errors in a complex pipeline.
Data observability detects the problems closer to the origin and impact to the downstream applications (“impact analysis”). It uses the pipeline lineage to visually inspect hot spots. It then kicks off an incident management workflow to quickly remediate the problems.
The incident management process allows collaboration across various business units. It should also automate the necessary steps. However, some of the remediation of data quality happens in the source systems which can’t always be automated. Here, the data observability product will integrate with the necessary solutions of the stack to ensure quick resolution.
Often, some problems should be remediated automatically with no need for human intervention. For example, if the Spark clusters are under provisioned the product should not only give recommendations, but kick off a process to auto-tune the configuration parameters, within the cost threshold.
Feedback
We have seen the process to detect and remediate anomalies in data, pipeline, configuration, and business operations expediently. The last step in the process is to ensure that the data observability is a continuous process, and it evolves with the system. It is used to maintain SLAs.
Operational feedback, such as latency, missing data etc. is easy to calculate and analyze. But, business feedback is what will ensure data observability adoption. For example, the ability to perform multi-attribute data quality checks, or even the ability to deploy data products more frequently because of higher transparency. This feedback will ensure that data observability products are actively deployed and will open doors to even more use cases in the future as the next section explains.
Use Cases and Benefits
The recent spurt in interest in incorporating data observability arises from the unfortunate reality that a vast majority of data in organizations is unreliable. This inhibits new projects to be instantiated in a cost-effective manner. Even when data is leveraged, executives may lack confidence in its veracity.
This paper has mentioned many of the use cases, such as the ability to deploy data products (often as part of a data mesh initiative) at scale. This requires multiple pipelines to run, A/B testing to happen and a rapid release cycle. One way of democratizing data is through data sharing which extends governance and privacy to the data consumers. Data observability can provide transparency on how data is being consumed and policies are being applied consistently. Over all these products support digital transformation and cost optimization initiatives.
Why should organizations invest in this space? The key benefits of data observability include.
- Improved decision-making: Data observability can help organizations access and use data more effectively to inform decision-making. By providing visibility into data sources, data quality, and data usage, data observability can help organizations make more informed and accurate decisions.
- Enhanced data-driven innovation: Data observability can help organizations identify patterns and trends in their data that could drive innovation and business value. By providing visibility into how data is being used, data observability can help organizations identify opportunities to use data to drive new products, services, or business models.
- Improved efficiency: Data observability can help organizations identify and address inefficiencies in their infrastructure and processes. By providing visibility into how data is flowing within the organization and how it is being used, data observability can help organizations optimize their use of data and meet FinOps guidelines.
- Enhanced data security and compliance: Data observability can help organizations identify and address potential security risks or vulnerabilities within their data systems. By providing visibility into data access and usage, data observability can help organizations protect their data assets and ensure compliance with data privacy regulations.
- Improved customer experience: Data observability can help organizations use data to better understand and serve their customers. By providing visibility into customer data and usage patterns, data observability can help organizations identify opportunities to improve the customer experience.
- Higher reliability. Data observability proactively identifies and fixes issues which leads to higher availability and reliability of data pipelines.
Summary
Modern data stack has shifted the data team’s focus on speed and performance. Sometimes, at the cost of accuracy. However, the ultimate goal for any technical project should be to drive business impact through accurate and reliable decision-making. The operative words here are accurate and reliable. This is the single job focus of data observability. It helps to efficiently and proactively handle anomalies and defects in the data supply chain so they can be remediated faster.
As the complexity of system architectures increases, the need for data observability increases concomitantly. The current practices of do-it-yourself will be replaced by enterprise-ready products that offer a vast number of connectors and machine learning-powered analysis of metrics. Modern data observability tools should span the entire data supply chain to deliver a comprehensive view.
In fact, data observability is rapidly becoming such a fundamental and integral part of the pipeline, that the capabilities may be embedded in the pipeline subsystems in the future. Once this discipline is finally incorporated, data teams will achieve the same level of sophistication that the application and infrastructure teams have today.