Unveiling the Crystal Ball: 2024 Data and AI Trends

Sanjeev Mohan
25 min readDec 22, 2023

--

As we wrap up 2023, we can all agree that the world irreversibly changed with the introduction of ChatGPT. This momentum shows no signs of slowing either, as the mainstreaming of AI continues to advance with an unabated fury. However, how we react to these changing times requires a leap of faith. AI can be both potentially transformational and elegantly inaccurate, all at the same time! But our future is not just AI, because we still need to continue to up our game on data management.

It is a rite of passage for software providers, industry savants, and analysts to proffer their predictions for the upcoming year. However, with so much news flow, it becomes tricky for an enterprise leader to consolidate and draw a complete picture of what lies ahead. The problem with predictions is that one major new breakthrough can render them inadequate. In December 2022, ChatGPT was only days old and, naturally, most predictions failed to capture its gigantic impact on the 2023 priorities.

In this document, instead of predictions, Rajesh Parikh and Sanjeev Mohan explore top technologies that are posed to capture our attention in 2024 with a particular emphasis on those that could emerge in the coming year. Readers should use it as a guide to identify priorities and prepare their organizations to pick the right bets.

Figure 1 shows a summary of the 2024 trends which are classified as rising, stable, and declining:

  • Rising: The emergence of new solutions and growth of current one’s open doors for innovative AI-centered platforms and applications that will move us from copilots to autonomous agents.
  • Stable: Existing initiatives from recent years that will likely continue with a stable momentum in 2024, as there continues to be a fundamental need for those solutions.
  • Declining: These trends were strong during 2023 and are likely to lose momentum for reasons like they are difficult to use in practice, offer a fragmented approach, or are too idealistic in real world deployments.
Figure 1: Summary of top data and AI trends for 2024 categorized as rising, stable and declining

Just like ChatGPT was unleashed in the last few days of 2022, the last few days of 2023 have witnessed the surprising release of multimodal LLMs and embedded small language models on edge devices. This pace of innovation is unprecedented even in the dynamic IT space that we are all accustomed to. Let’s begin with the rising trends.

Rising Trends

The 2024 rising trends are about plumbing and activation, especially for AI with a focus on data quality, platform architecture, and governance. Autonomous agents and task assistants serving multiple information roles can potentially automate needed activity partly or entirely. In addition, tools for generating high-quality datasets can feed constantly improving models at various stages of the AI model development life-cycle. We have identified four rising trends for 2024:

  1. Intelligent Data Platform
  2. AI Agents
  3. Personalized AI Stack
  4. AI Governance

Intelligent Data Platform

Data platforms today are largely a “systems of record” stack, which bring together data from a variety of enterprise databases and applications in a common repository. The primary use case today for this stack is reporting and analytics, and in very few cases data-driven automation. What could be better than infusing intelligence within the data platform to accelerate adoption of AI data products and applications throughout the enterprise.

We define an intelligent data platform as the one where the large language model (LLM) infrastructure is part of the core data platform. This intelligence layer can be used to infuse intelligence into two kinds of applications:

  1. Core data applications: These applications include AI driven data operations, semantic search and discovery agents, AI aided ingestion tools, AI aided data preparation and transformation, and conversational AI agent for data analysis. The degree of automation of such applications only gets better as the agent reasons by learning from mistakes.
  2. Intelligent Applications: Intelligent AI agents is the second rising trend that we identify in this document.

Figure 2 shows the diagram of an intelligent data platform along with AI agents and applications.

Figure 2: Components of an intelligent data platform

An intelligent data platform is the next evolution from the current warehouse/lake centric data platform environment. Along with the urge for simplification of the consumption interface, intelligent applications will drive the next decade of productivity. In 2024, enterprises need to take a hard look at their current data platform architecture and address the challenges related to data silos, data quality and duplication, and fragmentation of stack components. High quality, curated data and metadata is key to success with generative AI initiatives. An intelligent data platform, along with associated data applications, is poised to provide the foundation data and modeling layer infrastructure for AI use case enablement.

AI Agents

The term “AI agent’’ became a buzzword during the latter half of 2023. An AI agent is a program or a system that can perceive its environment, reason, break down a given task into a set of steps, make decisions, and take actions to achieve those specific tasks autonomously, just like humans do.

The holy grail of language understanding meant that humans could converse, direct, and engage with the AI programs via the natural language interface. But could AI programs do more than aiding and answering questions related to information tasks, such as searching, extracting or generating code and/or images?

Can AI agents expand the frontier of task automation that today needs far more human intervention and cognitive tasks requiring high-level thinking, reasoning and problem-solving? For example, perform tasks, such as market analysis, risk assessment, and investment portfolio optimization. Or, perform complex tasks that were unlikely to be automated either because of complexity or cost thus far.

There’s certainly an economic incentive to test the ability of AI agents/technologies of the day to take on tasks that significantly improve the business productivity and human machine interface.

Earlier research attempts worked around math-related activity, chain/graph of thoughts, and LLM-based multi-step reasoning frameworks to demonstrate an ability to automate a complex task. These early announcements were far from achieving what’s needed to build a fully autonomous information agent application, but they showed the potential of what’s possible.

Figure 3 shows an architecture that provides a generalized paradigm that combines reasoning and acting advances and early work around the same such as “chain of thought” to solve various language reasoning and decision making tasks. This coupling of reasoning and acting with the language model enables these programs to perform decision-making tasks. This paradigm is called “ReAct”.

Figure 3: Schematic block diagram of an AI agent architecture.

AI agents could assist the automation of information tasks such as data analysis, BI dashboard development, process optimization, data entry, scheduling, or basic customer support. They can also automate the entire workflow, such as supply chain optimization and inventory management. The steps that AI agents take in Figure 3 are described below enable users to dynamically carry out a reasoning task, by dynamically creating a thought/plan and adjusting the plan for acting, while also enabling external interactions to incorporate additional information into the reasoning.

  1. The first step in the flow is to select a task, and to prompt the LLM to break down a question into a bunch of thoughts(sub-prompts).
  2. Step 2, 3, 4, further enables LLM to break these bunch of thoughts and think and reason out these sub-thoughts.
  3. Step 5 through 8 enable LLM to carry out external interactions such as extracting information as needed to complete the thought/task.
  4. The free-form thoughts/action integration is used to achieve different tasks such as decomposing questions, extracting information, performing commonsense/arithmetic reasoning, guide search formulation, and synthesizing final answer.

AI Information agents is a trend we believe will likely play out over multiple years; however, given their promise, we expect 2024 to be the year where significant progress will be made both in terms of agent infrastructure/tooling as well as early adoption. It is appropriate to point out that a lot of how we understand the potential of current AI architecture to take on more complex tasks is still largely about potential, and there are quite a few unresolved issues.

Despite this, enterprises must aim for a practical approach to building agent applications and at some point expect the gaps with current AI technology to take on more and more complex automation that will likely shrink with every passing year. It must also account for the degree of automation possible in the next 12 months use-case by use-case. An evolutionary path/journey to such projects is likely to yield far better success with such endeavors.

Personalized AI Stack

Our third rising trend pertains to personalizing or customizing the models and/or their responses through three approaches:

  • Fine-tuning models with more contextual data.
  • Improving data sets used to train or fine-tune models including synthetic data.
  • Using vector search to leverage models with relevant data.

Fine-Tuning Models

While foundational models like OpenAI’s GPT-4 present an opportunity for enterprises to prototype the potential of Generative AI model use-cases, they don’t sufficiently address concerns like privacy and security of corporate data, openness viz-a-viz data that was used to train such models, the ability to fine-tune them for specific requirements, achieving the desired accuracy for any given task, and the overall cost-value proposition.

To move beyond prototypes and the need for better results, we are likely to see the rise of custom or task-specific Small Language Models (SLM), especially in niche and vertical applications. These models will leverage base/pre-trained foundation models as a starting point to either train SLMs or fine-tune with domain/enterprise data.

Figure 4 shows the lifecycle of fine-tuning a model

Figure 4: Process of fine-tuning large language models

Streamlining the development of custom SLMs, achieving life-cycle management of such models and taking them from experimentation to deployment seamlessly continues to be a challenge:

  1. Base LLM selection: Availability of multiple, well-perceived options, but lack of supporting detailed evaluation can make choosing a base model confusing and daunting.
  2. Reference datasets: Reference datasets are needed during instruction and RLHF fine-tuning, as well as model evaluation and testing. Availability and creation of reference datasets continues to be laborious, often subjective and largely human dependent. Availability of original task/domain specific or synthetic data can significantly accelerate the speed of model development and shrink development time.
  3. Fine-tuning model: One of the key training steps required to align and adapt the instruction fine-tuned model to real world expectations of model performance is applying the human feedback. This step allows the model to reduce hallucination, bias, toxicity and improve safety. Parameter Efficient Fine Tuning (PEFT) and RLHF are popular common techniques that help fine-tune a base LLM with the task/domain specific context. While available techniques significantly improved, high-quality task/domain specific prompt-response pairs/datasets and reference datasets that include the human feedback required continue to be manual, laborious, and prone to variability due to the creative nature of language response validation.
  4. Test and Evaluate Model: Evaluation of fine-tuned models that work with intricacies of natural language depends on the task creativity and response evaluation, which is often manual and subjective. While various metrics and techniques are now available, they are often not sufficient and justifiable to evaluate a model. Techniques such as a response from another reference or superior model are used to generate reference evaluation datasets to help improve efficiency of the evaluation stage. Models also must be tested for safety, bias, and toxicity.

The success of tasks required to fine-tune models are dependent on the still-immature field of AI governance. A rising trend which is covered below, AI governance is needed to provide explainability of models to build trust and meet regulatory compliance. It is also used to monitor prompt responses in real-time for any degradation of performance, responsible use, cost, and issues with product reliability.

Rise of availability of tooling around AI model development, life-cycle management, deployment and monitoring that address above challenges and simplify the model development and lifecycle management is key to success with SLMs and task specific models .

Note that task-specific AI models are still experimental, and there are quite a few unresolved issues. As a result, a significant number of such experiments may fail. Despite that, this is a theme which will see a rise in investments across the ecosystem in 2024.

High Quality Data Ecosystem

While models trained on trillions of parameters, like OpenAI’s GPT-4, increase their knowledge base, recent experiments have shown that a much smaller model that uses better data may be able to outperform what OpenAI now calls the very large LLMs as “frontier models.”

Availability of high quality datasets for both general and custom models that are privacy and copyright free continues to be a big concern. Most of the LLM pre-training is based on internet-based web-scraped datasets, books, and a few experimental datasets which find their origin in academia or research. While there are a few datasets that can be sourced for the fine-tuning stage, the choice of such readily available datasets shrinks further depending on the task/domain.

Oftentimes, there just isn’t enough data to even train a model. Take the example of fraud. Presumably, organizations are not rife with rampant fraud and hence have limited visibility into fraud scenarios. But they need to train a model using a wide range of fraud possibilities. Synthetic data is the answer to making high quality data available to improve LLM research and development speed.

Synthetic data can be defined as data that is not directly obtained from any real-world data but is artificially created mimicking the properties and characteristics of real-world data. Synthetic datasets could be the answer to making high-quality data available to improve LLM research and development speed in many use cases.

One of the main advantages of using synthetic data is it protects the end user’s privacy, adheres to copyright issues, and enables enterprises to meet privacy requirements for the original source. It also avoids the inadvertent disclosure of information, while continuing to make progress with model research and development. Synthetic data is important to meet the ever-increasing need for training large language models. With the right solution, the high-quality data needed for large language models can be solved in a cost effective manner and enable sustained momentum on AI research and model development and evaluation. There are ideas on generating synthetic datasets using the frontier models themselves. Nonetheless, it’s clear that creating and using synthetic dataset has the potential to solve the need for more data for ever hungry models.

We understand, given the nature of the task, there are serious incentives for such an ecosystem/service line to play a role in solving the need for high-quality datasets. Various startups and service providers are today exclusively engaged in providing annotated image and text data around generalized training data needs. However, there is a potential to further expand those services to include a long tail of domain/task specific datasets. This trend will likely see momentum in 2024.

Integrated Vector Databases

Picking a vector database is challenging. There are a variety of factors at play including scalability, latency, costs, queries per second, etc. Primary use-case for traditional databases is querying on keywords vs. searching using the context. Most enterprise applications are likely going to need both functionalities. The choice therefore is the introduction of a vector database capability inside the traditional DBMS.

Most future enterprise AI applications will need to work with both structured and unstructured data. Managing multiple databases leads to inefficiency, potential race conditions, data inconsistencies between OLAP data and vector indices in vector databases and management overheads and potential race conditions leading to data inconsistencies.

An integrated vector database is therefore best suited for applications that need best of querying capabilities along with semantic search. For example, a vector database can not only embed an organization’s financial reports but also index these data models and store them in the same database while offering semantic/similarity search capabilities.

Seeing this new workload opportunity, many DBMS and lakehouse players are incorporating vector embedding and search functionality in their existing offerings. Integrated databases/lake houses with semantic search functionality will likely gain further traction in 2024 as enterprises build and deploy LLM use-cases.

The most common technique to build AI applications is retrieval augmented generation (RAG), which combines LLMs and private business data to deliver responses to natural language questions. RAG integrates a flow where vectorized data is first searched for similarity before the LLM completion API is invoked, leading to higher response accuracy.

We see two trends affecting the RAG use case. One has to do with the increasing LLM context size that can take the input data directly without the need to be routed through a database. This lowers the need to perform the extra (and complex) step of RAG. However, this does not reduce the need for vector databases as they pre-filter prompts to LLMs which makes AI apps cost-effective and performant. They can also cache prompts and their responses, which avoids unnecessary and costly API calls to LLMs for duplicate queries. This curated data can be used in the future to fine-tune the organization’s SLMs.

The second trend has to do with the rise of multimodal models which may make the RAG process more complicated. But, we expect in 2024, the RAG functionality will improve through the use of third-party products like LangChain and LlamaIndex.

AI Governance

Executives are asking their leaders to fast track AI projects, as they are keen to extract unprecedented insights from all their data assets — structured and unstructured. However, IT leaders know that applying AI to the underlying data infrastructure is anything but straightforward. They know that AI applications’ success is predicated on ensuring data quality, security, privacy, and governance. Hence, the need for AI governance. But what is it exactly?

AI governance, like its cousin, data governance, requires a common definition. In fact, AI governance should work hand-in-hand with data governance.

Compared to traditional AI, which was in the realm of a few data scientists, generative AI has a much broader range of users. In addition, gen AI has introduced new concepts of vector search, RAG, and prompt engineering. So, modern AI governance must cater to the needs of multiple personas, such as model owners and validators, audit teams, data engineers, data scientists, MLOps engineers, compliance, privacy and data security teams, etc.

At the highest level, AI governance needs to be applied across two levels

  • Model training or fine-tuning: governance tasks include identification of the right data sources, their fidelity, data drift, model weights, and evaluation results. Ability to compare model metrics between versions can further help understand trends in model performance. Specifically, training costs per iteration of using different models on CPUs and GPUs are an important considerations for AI governance. Currently, there are very few vendors who are involved in foundation model training due to the very high resource requirements. Far more teams are fine-tuning as those costs have come down in recent times. As the costs go down further, we may see more organizations or departments training their own models.
  • Model usage/inference: governance tasks need to ensure safe business usage. Tasks include identification of risks and risk mitigation, explainability of models, costs, and performance of using AI models to achieve the business use case goals.

Figure 5 shows the building blocks of the AI governance program.

Figure 5: Building blocks of AI governance Framework

The AI governance program is made up of four building blocks:

Model Discovery

Models are proliferating at a rapid pace, reflecting the dynamic and ever-expanding nature of the field. By the end of 2023, Hugging Face was nearing half a million models. The problem is that when these show up in your AI framework like Google Cloud’s Vertex Model Garden or AWS Bedrock, savvy developers will start using some of them, with or without approvals from the risk management and compliance teams. To overcome this, many have started to adopt model catalogs.

Here, the catalog’s purpose is to discover which models are in use, their version numbers, and approval status. It also documents the model’s owner, its purpose, and usage. For the approved models, the catalog will show what data sets were used to train the models, and how the models were evaluated and their fairness scoring. The risk scorecards capture the model’s vulnerabilities and its impacts, and should be reviewed regularly to ensure risks are within thresholds.

Ideally, a model catalog should be an extension of the data catalog so there isn’t a fragmentation of data and AI governance.

Model Consumption

In model consumption, the focus of AI governance is on mapping the business use cases to the approved models and identifying data security risks. This section of AI governance handles concerns with the unsafe use of corporate data, prompt injection, and data loss.

It is also responsible for tracking the entire model lifecycle lineage with steps like approvals from the legal, CISO, CDO, auditors, etc. all the way to model retirement. With controls in place, it speeds model deployments into production.

The governance tools should allow not only risk identification for areas like bias, toxicity, drift, IP infringement, but also document risk mitigation strategies. The AI governance tool should help with providing explainability of the models.

Continuous Monitoring

Once the approved models are deployed, they need to have a mechanism to track how they perform at scale and automatically scan the responses for hallucinations and other unsafe content. One of the biggest issues with AI models is their non deterministic responses can lead to hallucinations. Hence, monitoring for accuracy and relevance is highly critical. As more AI models are put into production in 2024, tracking their performance and cost will be critical.

The risk areas mentioned above need to be monitored constantly for unexplained changes and anomalies. Upon detection of aberrations, alerts and notifications should be raised intelligently, without causing ‘alert fatigue’.

Although data security and privacy tasks run through every section of AI governance, monitoring users, their entitlement, and related security policies is an important component.

Risk Management

Model scorecards, inference/usage monitoring dataset and dashboards, along with workflow automation are critical to maintain the health of AI applications and to initiate timely remedial actions to respond to any degradation in expected performance. Automated workflows can help create data and model inference KPIs and trigger alerts as required to ensure that the model owner can initiate remedial actions.

The tool should provide an incident management capability to document steps taken to resolve incidents. You may further integrate into ticketing systems like Jira and ServiceNow. Finally, workflows should allow assessments to adhere to relevant AI regulations, like the NIST AI Risk Management Framework.

AI governance is a foundational piece to success with any AI initiative. We expect a major focus on AI governance in 2024 from multiple vendors like the traditional data catalog companies as well as large platform providers, like IBM’s watson.governance. Databricks’ Unity Catalog already converges data catalog with AI models metadata.

Further accelerating this focus are several new regulations and standards that were released in the final days of 2023. From the EU AI Act to the ISO 42001 to OpenAI’s Preparedness Framework, they are all designed to promote the responsible use of AI. For example, the OpenAI framework has four goals — “track, evaluate, forecast and protect” against model risks. Anthropic also published its Responsible Scaling Policy.

Stable

Increase in data — be it real world or synthetic — puts more pressure on the underlying data infrastructure to continue its strides in becoming more efficient and interoperable. The trends in this section help with simplification of the data and AI stack to make it easier and cheaper to manage and to reduce the burden on organizations from spending an inordinate amount of time on technology and not business imperatives.

Unified Data Plane

Over the years, data and analytics platforms have had various mini-waves, as demonstrated most recently from cloud warehouses to today’s modern data stack. Over years of these waves, organizations tried to solve scalability issues and the versatility of features; however, with every passing wave, the stack became more fragmented. The TCO continued to climb while value derived from such a stack kept falling.

The primary driver continued to be business intelligence and reporting applications. While some agile enterprises did have ML applications riding on top of the base stack, more often these were just another parallel stack. Consequently, the fragmented data stack has become a key disabler of progress on many AI applications and the ability for businesses to derive value from their data stack investments.

Poor modeling, data and metadata duplications, low data/metadata quality and consistency, and lack of imagination of applications beyond business intelligence are other drawbacks that have led to higher TCO and lower business value.

To support the rising trend of intelligent data platforms, the need for a common storage subsystem for all kinds of data, structured and unstructured, becomes ever more critical. The unification of data warehouses and data lakes, aka lakehouse, started in earnest a few years ago, but now has gained steady momentum.

Figure 6 shows the importance of a unified storage that allows different types of workloads to act on different types of data.

Figure 6: Unified data plane reduces the need for separate governance and storage for different workloads.

Data warehouses have historically provided a better data model and management advantages are characterized by a better user experience for business intelligence analytics. On the other hand, data lakes provide the flexibility needed for data engineering and advanced analytics, but require higher engineering skills on low-cost raw data files in formats, like Parquet and CSVs. Now, with the growth in table formats like Apache Iceberg, Delta, and Apache Hudi, the data lakes start to resemble data warehouses. Data warehouses can access files in a data lake using a common and open format with ACID semantics. In addition, initiatives like OneTable from OneHouse and UniForm from Databricks are attempting to make all the open table formats interoperable.

A unified storage reduces the need to move data across different platforms, which also reduces the overhead of extra security and governance. The question then arises, does it incur higher query latency? Not necessarily, as Snowflake demonstrated at their 2023 Snowflake Summit, that the performance of its Iceberg tables have nearly the same latency as its native tables.

Another burgeoning trend in the unification has been the convergence of batch and streaming data. It used to be that analytical stores were batch-oriented but most vendors now consume streaming data in near real time and make it available for event stream processing (ESP). ESP use cases range from analytics to enriching streaming data with historical data, to anomaly detection.

This trend will continue as it allows a more diverse range of use cases to deliver greater business value faster and at lower cost.

Cross Cloud

The cross cloud trend, which SiliconAngle’s theCUBE calls “supercloud’’ and Gartner calls “intercloud”, differs from the older concept of hybrid or multi cloud which simply required a product to be supported by multiple cloud providers. “Cloud” is now considered an operating model, and not a destination. This operating model can run anywhere from edge to private/public data centers. This paradigm shift is important as we move AI to where data resides, rather than the other way around.

Since cross cloud seamlessly operates across various cloud environments, it gives users the independence to run workloads anywhere, which can help in negotiation of pricing and optimization. Additionally, some vendors offer the ability to run the same workloads on a cluster that has nodes across different cloud providers.

Why would one need this capability? For reasons like cost optimization, higher resilience, multi-vendor cloud strategy, and meeting data residency needs. Users expect cross cloud to give them the control to move their workloads to the most optimized location and attain vendor neutrality. A true cross cloud will lead to operational standardization across applications, security, access, and management.

In the era of AI, cross cloud becomes even more crucial, as model training and inference locations may be very different for price optimization and accessibility/availability reasons. Model training should take place where cost benefits are enormous, which may be on-premises or in the best cloud option of choice. Model inference may happen on edge devices or on cloud options that provide best price-performance per user query.

Converged Metadata Plane

The absence of a common metadata standard presents a significant challenge in data management and interoperability. Without a universally adopted framework for organizing and describing data, interpreting and utilizing metadata becomes fragmented and complex. This lack of standardization also hinders automation of data analysis, transfer, and aggregation, as users are required to decipher and reconcile disparate metadata formats. Additionally, the retroactive integration of metadata into existing systems is often costly and time-consuming, as it is not initially prioritized in system development. Finally, the absence of a common metadata standard impedes the seamless exchange and comprehension of data across different platforms and disciplines.

Initiatives like Open Metadata and Open Lineage have been attempting to converge metadata but are still in early stages. The outcome of this siloed scenario is that each product in the categories performs its own data discovery but with a different lens. Yet, this is a wasted effort especially when an organization has multiple products deployed for different personas.

Figure 7 shows various data governance use cases’ need to exchange metadata to avoid running into metadata silos.

Figure 7: A unified metadata plane converges various data governance use cases.

In the absence of a common metadata plane, data observability has grown as the single pane of glass to monitor data quality and pipeline reliability. Recent economic headwinds have led to two new use cases — DataFinOps and DataBizOps that help with financial governance and measuring productivity of data products which we anticipate is the next stable trend. If 2023 was characterized by cost optimization, 2024 will be the year of growth, but with guardrails and efficiency.

We also see the unification of metadata maintaining its momentum and even security vendors are now starting to participate.

DataOps and Data Products

Software engineering has achieved an efficient software development lifecycle by closely integrating software development and IT operations teams through DevOps. Now, the same DevOps principles are being applied to data and AI projects in a space known as DataOps.

DataOps consists of automation, CI/CD, continuous testing, orchestration, version management, and data observability across the lifecycle of the data outcomes. It helps in delivering projects faster, with more liability and with better cost efficiency. Like DevOps’ “shift left” concept, DataOps helps move activities like testing, security, and quality assurance, earlier in the development process.

Maturing DataOps practices are helping deliver data products better and faster, which in turn makes data more accessible, understandable, trusted, and reusable. They also bring product management philosophy into the data teams and provide a single point of accountability. Akin to microservices, data products solve specific business problems and provide an unprecedented opportunity to measure data teams’ productivity. This has been an elusive goal thus far for many CDOs. Also, data products encapsulate data access governance policies and help ensure data security and privacy.

From a technical perspective, a data product may be an existing customer-facing asset like a dashboard, a table/view, or an ML model but with additional emphasis on non-functional requirements, such as trust, reliability, availability, observability, traceability, security, and serviceability, etc.

Some organizations are monetizing data products and making them available through data exchanges and sharing platforms. The attributes of data products are clearly defined and communicated through another new concept called a data contract. The data contract could be as simple as a page in a data catalog with description or it may be defined programmatically in JSON or YAML for downstream systems to consume. A data contract is a capability and not a standalone product.

In 2024, we expect to see more more advanced analytics and machine learning data products, and RAG (including multi-modal) data products gain momentum.

The adoption of data products will increase as conversational interfaces start to mature. Due to their trust and accountability attributes, data products provide an excellent layer of abstraction for future “chat with data” initiatives. By leveraging data products, unknown risks associated with new attack vectors, like prompt/LLM injection on raw data, can be minimized.

The data products term became well known thanks to the data mesh concept which is covered next as the first declining trend.

Declining

Trends decline as a natural part of the life cycle. They decline when customer preferences or needs change. In this document we list two trends that we feel are losing momentum.

Data Mesh

For organizations challenged by inaccessibility of data, data mesh was a breath of fresh air when it first appeared in 2019. It allowed users to see opportunities to alleviate overburdened centralized IT teams and unlock insights faster. Data Mesh, as envisaged by Zhamak Dehghani, was a collection of four principles, none of which were new, but were neatly packaged in the concept. In fact, some of the principles, like the domain driven approach, were already delivering positive results in the software engineering space.

Then, why is data mesh a declining trend?

Right at the outset, data mesh was branded as a socio-technical concept and it purposely avoided any guidance on implementation details. As a result, organizations started cherry-picking components of data mesh and declaring victory. This is often labeled as “data mesh washing.”

On its own, data mesh has its merits, but the problem lies in semantics. There has been a constant debate about which is better — a centralized data team or decentralized. The reality is that business teams are not interested in debating terminologies that the IT teams like to discuss, but are focused on delivering insightful data outcomes. Experiences from many recent implementations of data mesh tell us that the best approach is a hybrid of centralized and decentralized. Certain functions like infrastructure, management, and governance need to be centralized, while application development and analytics need to be decentralized. In fact, it is this realization that has led to continued growth of unified data and AI platforms, which we list as a stable trend.

One of the biggest successes to come out of the data mesh movement is data products. We have continued to see its slow and steady success and hence it was covered earlier in this report as a stable trend.

Modern Data Stack

Yet another good concept that became a victim of its own success is the modern data stack (MDS). The idea of having a few best of breed systems to deliver deep specialization is undoubtedly important. But, when the specialization turns into micro specialization and the number of products in each category mushroom uncontrollably, the modern data stack becomes nothing more than a fragmented data stack. As a result, it became a modern version of the Hadoop zoo sprawl.

Like the centralized versus decentralized debate, the bundled versus unbundled debate does not serve the business needs. The reality is that at some point the cost and overhead of integrating multiple products reaches a tipping point and no longer delivers an adequate ROI on the data infrastructure investments.

Another reason why we consider MDS to be a declining trend is due to lack of interoperability. Each tool in MDS, for instance, may do its own data discovery and, due to the lack of a common metadata plane, is unable to communicate with other tools. This is the reason why many MDS tools boast about having a vast number of connectors. While looking good on paper, these connectors have to be manually developed and maintained which further increases the overhead of running a MDS.

An Intelligent Data Platform (Rising Trend) provides the next logical transition to enterprise data platform architecture.

Conclusion

If we take a broad brush stroke, then the rising trends are the ones that accelerate meaningful adoption of AI while the stable trends ensure that we keep improving the underlying data substrate. Declining trends are the ones that do not support the urgency business teams crave to get faster insights from all their data with the least friction and cost.

The purpose of this research is to focus on technology solutions rather than on the organizational impacts. We do expect that to achieve the suggested trends, new roles will be created and existing roles will need to unlearn older approaches and adopt new paradigms. This is not something to fear as it is a natural progression. We witnessed the same changes when cloud and mobile computing became pervasive. It is a journey we will all be taking together.

Lastly, and not without a twist, we haven’t discussed the evolution of the “base LLM model architecture” itself. Any breakthrough in the LLM model architecture such as cost-effective adaptive knowledge injection or significant improvement in understanding and reasoning capabilities has the potential to become a super-arching trend in itself. It could well be a wild-card entry that we will all be talking about as we move through 2024 but only time will tell.

--

--

Sanjeev Mohan

Sanjeev researches the space of data and analytics. Most recently he was a research vice president at Gartner. He is now a principal with SanjMo.