Sitemap

The Cambrian Explosion of Data: Navigating the Dynamic World of Data Stores

43 min readAug 20, 2025

The data storage layer serves as the foundation of any robust data and AI initiative. Traditionally, the phrase “data store” almost exclusively referred to a database management system (DBMS), a monolithic system optimized for structured data. However, the rapid evolution of data volumes, varieties, and velocities, along with the demands of cloud deployments and AI, have profoundly expanded this definition.

The data storage layer serves as the foundation of any robust data and AI initiative. Traditionally, the phrase “data store” almost exclusively referred to a database management system (DBMS), a monolithic system optimized for structured data. However, the rapid evolution of data volumes, varieties, and velocities, along with the demands of cloud deployments and AI, have profoundly expanded this definition.

Today, the term “data store” encompasses a much broader spectrum of technologies. It includes distributed file systems (such as the Hadoop Distributed File System, or HDFS), designed for massive, sequential reads and writes often associated with batch processing for big data analytics. Furthermore, object stores, sometimes referred to as “blob stores” (like Amazon S3, Azure Blob Storage, or Google Cloud Storage), have emerged as versatile, highly scalable, and cost-effective solutions for storing vast quantities of unstructured and semi-structured data. This includes everything from documents, images, and video to AI model artifacts and the raw input data for large language models and deep learning.

In fact, the use of object stores is no longer limited to unstructured data. The advent of open table formats has enabled them to be used for storing structured data as well, driving a fundamental move toward open standards. This shift reflects a core principle: the optimal storage solution is no longer monolithic, but rather dictated by the specific data characteristics, access patterns, and the varied requirements of AI-centric workloads.

In the early 2000s, NoSQL data stores emerged as a direct response to the limitations of traditional DBMS, specifically their requirement for rigid, upfront schemas and their challenges in scaling horizontally. For the price of “eventual consistency” — a trade-off where data might not be immediately consistent across all replicas but eventually converges — NoSQL databases delivered unprecedented scalability and high availability. These attributes were critical for the burgeoning demands of globally distributed web and mobile applications, which prioritized continuous uptime and the ability to handle massive, rapidly changing datasets over strict transactional guarantees.

This new breed of data stores also diversified access methods, moving beyond the sole reliance on SQL to embrace more flexible approaches like JSON APIs and other native data structures. Although this category initially resisted SQL as a query language, it eventually adopted SQL as an additional access interface. In this blog, we will refer to the NoSQL category as non-relational.

The current wave of Generative AI, especially agents, is significantly disrupting the data store space. They require native capabilities for storing, indexing, and efficiently searching vector embeddings, which are crucial for techniques like Retrieval Augmented Generation (RAG) and semantic search. This has made the ability to query data using natural language a credible alternative to traditional SQL. Open standards like Anthropic’s Model Context Protocol (MCP) are helping standardize how AI applications can interact with the underlying data stores.

As non-relational databases have matured, they have begun offering capabilities traditionally provided by relational DBMS (RDBMS), such as strong consistency. RDBMS vendors have, in turn, been incorporating native JSON storage and the ability to scale horizontally across distributed systems. This convergence of RDBMS and NoSQL features has produced multi-model databases, which handle different data models within the same product. As a result, business teams can now focus more on their use cases rather than on how to model the underlying technical substrate.

These blurring distinctions make creating a data store taxonomy more of a Sisyphean endeavor. In this piece, we will describe the different classes of data stores for transactional data. The topic of analytical data stores will be explored in a subsequent blog.

Data Types

Data has become more complex and comes in more flavors. The main ones include:

Transactions (structured operational data)

This represents business operations, such as invoices, banking activities, customer orders, and supply chain movements, originating from transactional systems like Point-of-Sale (POS), Enterprise Resource Planning (ERP), and Customer Relationship Management (CRM) platforms.

This data is typically characterized by a well-defined, highly normalized structure, demands an exceptionally high level of consistency, and traditionally resides in Relational Database Management Systems (RDBMS). Its primary use is for Online Transaction Processing (OLTP), ensuring the integrity and speed of individual business events. For AI, transactional data provides accurate, real-time ground truth for building predictive models (e.g., fraud detection, inventory forecasting) and understanding core business performance.

Interactions (semi-structured data)

With the meteoric rise of the internet, social media, mobile applications, and collaborative platforms, contextual data from social feeds, customer reviews, chats, and communication graphs became increasingly prevalent. This data is typically semi-structured, combining elements of defined schema with free-form content. For example, a Twitter feed contains structured metadata (user ID, timestamp) alongside the free-form text of the tweet itself.

This type of data often tolerates “eventual consistency,” meaning that immediate propagation to all users globally is not a strict requirement, prioritizing availability and scale. For AI, interaction data is invaluable for sentiment analysis, natural language processing (NLP), customer 360 views, personalization engines, and conversational AI, offering rich insights into human behavior and preferences.

Observations (unstructured data)

Observational data emanates from continuous measurements, events, and metrics generated by machines, applications, and sensors. Common examples include website clickstreams, application logs, and the burgeoning volume of Internet of Things (IoT) data. The proliferation of IoT devices, from smart home sensors to industrial machinery and autonomous vehicles, has significantly elevated the status of observational data.

Frequently, this event-based data is time-series in nature, generally requiring low-latency ingestion and high-throughput processing to capture transient states. Beyond structured metrics, observational data is often accompanied by rich unstructured data, such as video feeds, audio recordings, and sensor imagery. For AI, observational data is foundational for real-time anomaly detection, predictive maintenance, operational intelligence, computer vision applications, and complex event processing for scenarios like smart city management or industrial automation.

Organizations are generally diligent about collecting, processing, and analyzing transactional data but they are often less so with interactions and observations. Much of this rich, contextual, and event-driven unstructured data, despite its immense potential for training sophisticated machine learning models, never gets fully analyzed. Numerous studies indicate that 50% to over 90% of all corporate data falls into the category of unprocessed, unanalyzed, and effectively “dark data.” This represents a colossal untapped resource for competitive advantage through Gen AI.

Each of these distinct data types possesses specific characteristics that dictate its optimal storage, processing, and analytical requirements. Broadly, the two major uses of data are operational and analytical, each with its own processing paradigm:

  • Operational Processing (OLTP): This encompasses systems, designed for high-volume, low-latency inserts, updates, and deletes of individual records. Relational databases traditionally excel at handling these business transactions, ensuring ACID properties (Atomicity, Consistency, Isolation, and Durability). Increasingly, other operational data, such as interactions and observations, are finding homes in non-relational databases, including key-value stores, document databases, and time-series databases, which offer specialized performance for their respective data models.
  • Analytical Processing (OLAP): OLAP systems are optimized to deliver insights, complex aggregations, and reports over large volumes of historical or aggregated data. Traditionally updated in batch mode using ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes, modern analytical systems are now evolving to consume continuous streaming data and serve low-latency analytical use cases critical for real-time dashboards and feature generation for AI.

Table 1 compares key differences of transactional (e.g. OLTP) and analytical (e.g. OLAP) data stores. Note that although traditionally data stores have generally supported either transactional or analytical processing, some support both. Gartner calls them HTAP and Forrester calls them Translytic.

Press enter or click to view image in full size
Table 1. Transactional vs. Analytical workloads

Creating a definitive taxonomy of data stores in this rapidly evolving landscape is akin to drawing a line in shifting sands, especially as multi-model data stores become more prevalent, supporting diverse data types and access patterns within a single platform. Therefore, this blog categorizes data stores by their primary or original capabilities, acknowledging that many products have significantly evolved to support additional storage engines, data models, and specialized functionalities critical for the AI era.

Data Store Taxonomy

The first computerized databases arrived in the 1960s as part of the Apollo space program’s Saturn V moon rocket. We call this category Pre-relational data stores as it predates the relational category. Many organizations still use these databases, like IBM’s IMS and they were designed to run on mainframes with Cobol frequently used to develop applications. These early DBMSs typically had hierarchical or network data models that were designed to run on sequential storage devices before the creation of hard drives.

Mainframes may harken back to legacy days, but 90% of credit card transactions still run through them. However, the pre-relational databases now represent a very small percentage (under 4% in 2020) of the overall DBMS market and hence are not covered in this book.

Figure 1 shows a further refinement of RDBMS and NoSQL databases as well as the new category of vector databases. Some products may include many of these categories in a single product. For example, Microsoft Azure’s CosmosDB is a multi-model database, but AWS firmly believes in providing multiple independent “best fit” databases that fall in many of the categories in the figure below. Answering which is the better choice for your team depends on your overall ecosystem, integration needs, current investments and skill sets.

Press enter or click to view image in full size
Figure 1. The “Cambrian explosion” of data stores. Since the early 2000s, the data store landscape has diversified dramatically beyond traditional RDBMS, a trend that shows no signs of abating.

A quick note: while this guide mentions several specific products to illustrate key concepts, the database ecosystem is vast and constantly evolving. The examples I’ve included are representative, not a comprehensive list. For the most up-to-date information, always refer to the latest product documentation.

Relational DBMS

The RDBMS is a workhorse that has been a vital player in the industry since its beginnings in E.F. Codd’s relational set theory. This theory defines a tuple of information as a row, or record, with fields, or columns. A collection of tuples is a table. Based on this foundational theory, IBM created the declarative Structured Query Language (SQL).

A declarative language allows users to “declare” what information they want in a simple syntax without having to specify how the data is organized or retrieved from the storage layer. The complex execution details are left to a query optimizer. This is the opposite of the imperative approach, which was more common in pre-relational databases. The imperative approach requires coding specific steps and extensive knowledge of both the query language and the physical implementation of the data store.

Most mission-critical databases tended to be relational. They provide unmatched maturity and a robust set of capabilities that other database types are only now beginning to incorporate. Their strong consistency guarantee is a critical requirement for demanding use cases, such as financial transactions. While strong consistency is no longer unique to RDBMS, it remains a defining feature. It is far simpler for an application to operate with the certainty that any data returned is the latest committed version, rather than having to build additional logic to manage the complexities of eventual consistency. As a result, many non-relational databases have started implementing strong consistency to better serve enterprise workloads.

Transactional

One of the first RDBMS was the University of California Berkeley’s Ingres (stands for Interactive Graphical Retrieval System), which had a query language called QUEL. While Ingres was a pioneer, its language was quickly supplanted by SQL. Ingres later evolved into Postgres (post Ingres), which is today known as PostgreSQL. The earliest commercial RDBMS, such as Oracle and Sybase, were part of the broader effort to commercialize Codd’s theory, inspired by IBM’s System R project in 1973.

With the advent of the internet, mobile, and cloud computing, the NoSQL movement took hold in the late 2000s. However, the RDBMS category still constitutes the bulk of the entire DBMS market. The category continues to modernize, adding advanced capabilities to support newer use cases, such as handling real-time streaming data and incorporating semantic search of documents using vector embeddings.

To explore this category, transactional RDBMS databases are further sub classified into the following types:

1. Traditional

This is still the most mature and full-featured option, consisting of established players like Oracle, SQL Server, and IBM Db2. Other notable databases include Sybase (now SAP Adaptive Server Enterprise (ASE)) and Informix (now owned by IBM but maintained by HCL Technologies).

Originally developed for on-premises deployments, they often came with high license costs and vendor lock-in. These limitations led to the emergence of open-source and cloud-native alternatives.

2. Open-Source

MySQL and PostgreSQL loosened the grip of proprietary options by providing credible alternatives at no license cost. Both databases continue to be used extensively. For example, MySQL underlies nearly every WordPress site. MySQL is owned by Oracle, which led its creator to fork the project and create MariaDB. MariaDB was developed to be wire-compatible with MySQL but has since taken its own course, implementing client-compatibility at the cost of feature parity with MySQL’s latest versions.

A more recent open-source entrant, Neon, is a serverless, multi-cloud alternative to PostgreSQL that separates storage and compute to offer flexible scaling and cost efficiency. Neon was acquired by Databricks in May 2025.

3. Cloud Native

Every major cloud service provider (CSP) offers a managed PaaS database, also known as Database as a Service (DBaaS), that uses established proprietary and open-source RDBMSs. For example, AWS provides its Relational Database Service (RDS) for Oracle, SQL Server, MySQL, PostgreSQL, DB2, and MariaDB. AWS also offers its proprietary, fully managed RDBMS called Amazon Aurora, which is compatible with either MySQL or PostgreSQL, and provides performance optimizations and enhanced features like serverless operation, multi-master write capabilities, and automatic backups. Similarly, Google Cloud offers fully managed AlloyDB for PostgreSQL.

In addition to these CSP offerings, specialized vendors like Crunchy Data provide a managed service for PostgreSQL in the cloud, offering enhanced features and support. Crunchy was acquired by Snowflake in June 2025.

4. In-Memory

Very low-latency workloads benefit from in-memory databases. These databases take advantage of persistent memory, such as NVMe, for extreme speed. Examples include Oracle’s TimesTen and VoltDB, which evolved from Michael Stonebraker’s research project called HStore.

5. Distributed (NewSQL)

The marriage of distributed computing with the mature relational database model led to NewSQL databases. These systems are designed to provide strong consistency at extreme scale, even when a database is spread across geographically distributed nodes. A major challenge for this approach is ensuring the correct timestamps of concurrent transactions without relying on the unreliable system clocks of underlying servers.

Google’s Cloud Spanner solved this issue through its proprietary TrueTime technology, which provides a globally consistent clock. Cloud Spanner has since inspired many other NewSQL databases, such as YugabyteDB and CockroachDB, both of which offer PostgreSQL compatibility. Other databases, like TiDB from PingCAP, also provide distributed SQL capabilities with MySQL compatibility.

Meanwhile, traditional databases have also adopted these distributed features. For instance, Azure SQL Database provides a distributed SQL capability through its Hyperscale tier and Azure Cosmos DB for PostgreSQL uses the open-source Citus extension. Oracle and MySQL also have distributed features and extensions.

Databases in this category are designed to scale by adding more resources, such as CPU and memory. This type of scaling is called vertical scaling or scale-up. Distributed databases scale by adding more compute nodes, a process known as horizontal scaling or scaling out.

Analytical

While transactional databases are optimized for fast writes and high concurrency, analytical databases are designed for bulk, read-heavy queries but at lower concurrency. Analytical data stored are designed for high throughput on large datasets, but generally not for fast single-row ingestion or high-concurrency transactional queries.

These databases are also fundamentally different under the hood. Typically, when a query is executed in transactional RDBMS, entire rows are read, and then only the columns requested in the SELECT SQL clause are returned. While a transactional RDBMS typically reads entire rows, analytical databases use a column-oriented storage engine. This engine groups and compresses related columns together, which translates to much faster bulk reads because only the requested columns are read from disk. The downside is that these queries can take minutes to hours to run due to the larger data volumes and more complex joins across multiple tables.

Like row stores, column stores are also further classified into the following subcategories:

1. Traditional

These are the most widely deployed analytical databases that came into existence prior to the 2000s. They were developed in an era when hardware resources were scarce and expensive, and as a result, they are optimized to run very large workloads that may include joins of dozens of tables. Examples include Teradata, Netezza, Vertica, Greenplum, SAP IQ, and Actian.

2. Cloud-Native Data Warehouse

This category tends to be fully managed, serverless, and often offered as a SaaS product. Some products, such as Amazon Redshift, Azure Synapse, and Oracle Autonomous Data Warehouse, are designed to run on a single cloud platform. In contrast, multi-cloud platforms like Snowflake, Databricks, and Cloudera can run on multiple clouds, and Google BigQuery has also added the ability to run on AWS and Azure using its BigQuery Omni query engine.

3. Cloud-Native Lakehouse

Data lakehouses with a structured open table layer and support for unstructured data are making significant inroads into the data warehouse space. This topic is a major trend in its own right and is best covered in a separate blog. Key examples include Databricks’ Delta Lake.

4. Open-Source

Open-source space has seen an explosion of new, highly specialized analytical databases. Examples include DuckDB, which is an in-process analytical database ideal for edge and desktop analytics; ClickHouse, a column-oriented DBMS optimized for real-time analytics and high-concurrency queries. Apache Pinot, Apache Druid and Apache Doris are real-time analytics databases designed for high-concurrency and low-latency queries on streaming data.

Low-latency, ad-hoc analytical workloads are also handled by query engines that emulate data warehouses, such as Presto and Trino. These are often used for federated queries across multiple data sources.

Many transactional row-oriented databases also provide column-oriented analytical capabilities. This category is called Hybrid RDBMS.

Hybrid

To reduce the proliferation of separate databases for operational and analytical purposes, hybrid transaction/analytical processing (HTAP) systems were developed. This convergence is also known by other names, such as Translytical (coined by Forrester) and Hybrid Operational Analytical Processing (HOAP). In 2019, Gartner retired the HTAP moniker, and replaced it with Augmented Transaction Database, which also includes operational AI.

These systems are designed to handle both fast transactional writes and complex analytical queries on the same dataset in near-real time. They often achieve this by using a combination of technologies, such as a column store optimized for analytics and a traditional row store for transactions. Unlike separate databases that require ETL processes to synchronize data, HTAP systems allow for direct, low-latency analytics on live operational data. They have chosen different paths as depicted in table 2.

Press enter or click to view image in full size
Table 2. The diversity of models to deliver hybrid capabilities varies considerably.

To avoid redundant data pipelines, hybrid databases simplify architectures by enabling both transactional and analytical workloads within a single system. These systems eliminate the need for traditional Extract, Transform, Load (ETL) processes between an OLTP and an OLAP database.

Oracle and SAP HANA achieve this through a single, multi-model engine that intelligently handles both types of workloads. Other databases, such as MariaDB, use a multi-engine architecture, with a separate engine for OLTP and OLAP. To the user, this is transparent, as a router or query planner identifies the nature of each query and directs it to the appropriate engine. To prevent bottlenecks and resource contention, workload isolation techniques are also deployed.

Despite their advantages, hybrid stores can experience some latency when ingesting and synchronizing data from the transactional side for analytical replication. This latency, or “data delta,” is typically measured in seconds to minutes, depending on the system and the workload.

Non-relational Data Store

The non-relational category refers to data stores not based on relational set theory. Products in this category ingest and store data without being encumbered by a predefined, rigid schema. Hence, they are often considered schema-flexible and use a “schema-on-read” approach, where the data model is defined only when the data is queried.

This category includes a profusion of data models, such as key-value, document, graph, and more. Non-relational databases support multiple access patterns, primarily using APIs native to their data model. Although SQL was not originally a priority, it is the most common query language, so many non-relational databases have since added support for SQL-like APIs to improve their usability and appeal.

The approach of deploying multiple data stores, where each type is the best fit for a specific workload, is called “polyglot persistence.” This is a direct counterpoint to the traditional RDBMS approach, where a single, general-purpose data model was used for all applications.

While a “one size fits all” strategy for choosing a data store is never ideal, having too many specialized data stores can increase operational and integration overhead. Some data stores mitigate this by offering multi-model databases that provide support for multiple data models within the same product.

Figure 2 shows the main categories of non-relational data stores.

Press enter or click to view image in full size
Figure 2. Non-relational is an expansive category and is made of many different types of data models. The predominant ones and their key characteristics are shown in this diagram.

Vector databases are the latest category in this space and are designed for serving AI applications. However, vector databases don’t fit neatly into the traditional categories of relational or non-relational, transactional or analytical. To learn more on vector databases please see this blog.

Key value

The key-value store is the most fundamental non-relational data model, serving as a building block for other categories, most notably document databases. One can think of it as a table with two columns: a unique primary key and its associated value. The value can be almost anything, from a simple string or integer to an array, a file name, a URL, a complex JSON object, or an image.

Values in these stores are typically treated as opaque blobs, which means they are not indexed for search. This design choice enables incredibly low-latency storage and retrieval. To achieve this, key-value stores are designed for high-concurrency and high-throughput loads, leveraging memory or solid-state drives (SSDs). Data access is enabled through simple operations like GET, SET, and DELETE. A key advantage is that keys can be partitioned across multiple nodes, which allows the database to scale horizontally.

Common uses for key-value stores include session management, caching data, product recommendations, and leaderboards in online gaming. This category is well-represented by a variety of open-source projects, such as Aerospike, Amazon DynamoDB, Redis, RocksDB, and Memcached, as well as commercial offerings. Older databases that are no longer commercially available include FoundationDB, Riak, and Oracle Berkeley DB.

Document Database

The persistent “impedance mismatch” between the structured, tabular nature of relational databases and the object-oriented paradigms favored by application developers created a significant friction point. This enduring challenge led to the widespread adoption of workarounds, such as Object-Relational Mapping (ORM) tools like Hibernate and SQLAlchemy, which attempted to bridge the divide.

Earlier in this evolutionary path, object-oriented and XML databases emerged as potential solutions. Object-oriented databases promised a more natural fit for application data models, while XML databases offered flexible data structuring. However, both proved too complex to query and manage, which ultimately prevented their widespread adoption. XML turned out to be too verbose. Fundamentally, neither adequately addressed developers’ escalating needs for intuitive, performant, and flexible data persistence.

When new databases based on the JSON model appeared, developers found them to be a natural fit for applications. In a document database, data is stored in a JSON or BSON (Binary JSON) format, which maps directly to an application’s object model. Developers enthusiastically added these databases to their toolkits, with MongoDB being one of the first to ride this wave in 2007, followed by Couchbase (formed by the merger of the key-value store Membase and the document database CouchDB in 2021). Fully managed IBM Cloudant database is built on Apache CouchDB.

But why the name “document database”? Think of a document as a paper document that contains all relevant information in a single place — except it doesn’t give you a paper cut! A document is a self-describing, semi-structured collection of key-value pairs. It is analogous to a row in a relational table, except documents can have different attributes or complex structures with nested arrays and objects.

Documents contain what appear to be denormalized and pre-joined tables, which can lead to very fast reads and writes. However, multiple concurrent writes to the same document can cause latency issues and require document-level locking. This issue is typically alleviated by distributing documents across multiple nodes and joining them using application logic. Each document is identified by a unique ID, which is typically indexed by a B+ tree for fast retrieval.

In practice, data ends up in multiple documents, with a group of documents called a collection. A collection is analogous to a table in a relational database, though with a much more flexible schema.

Originally, document databases were developed to prioritize high availability over strong consistency, a choice that can be explained by the CAP theorem. Traditional RDBMSs were generally considered CP (Consistent and Partition-tolerant), meaning that in the event of a network partition, they would sacrifice availability to ensure data consistency. If a consistent view of the data could not be provided, the system would not return a result.

On the other hand, early document databases were generally AP (Available and Partition-tolerant). This meant they would provide the most available answer, even if the data on some nodes was not the latest version. When network partitions were resolved, the data would eventually converge, a concept known as “eventual consistency.”

Recently, modern document databases have evolved, adding support for ACID compliance for single documents and even multi-document transactions. This allows them to meet the strong consistency requirements of many enterprise workloads. However, document databases still do not have good support for modeling complex many-to-many relationships, which remains a strength of relational databases.

Wide column database

The topic of column-oriented databases was mentioned earlier in this blog when discussing relational analytical databases. This is distinct from wide-column databases, which fall into the non-relational category. Like document databases, they are also considered schema-flexible.

In a wide-column database, each row consists of a primary key, and the “value” is a dynamic collection of columns. This allows each row to have a different set of columns, making them excellent for storing variable information, such as user preferences.

The wide-column data model originated from the Google Bigtable paper, which inspired many other databases, including the open-source Apache HBase and Apache Cassandra.

These databases are designed for high-concurrency and high-throughput workloads, which makes them popular for time-series data from IoT devices, fraud detection, and log management. They are highly scalable and support geo-distributed replication. However, they can have slow update operations, and like many NoSQL databases, they require careful data modeling upfront that aligns with the intended query patterns. High management overhead and deep technical skills are needed to ensure that geo-replicated writes maintain the minimum required quorum in case some nodes are not available.

Geospatial

Geospatial data (often shortened to “spatial data”) is information that directly pertains to, or describes, specific geographic locations on Earth. It’s a fundamental component for applications ranging from navigation and urban planning to environmental monitoring and targeted advertising. There are a few common paradigms used to structure this data:

Vector Data:

A vector data model graphically represents real-world features using discrete geometric primitives. These consist of:

  • Points: Individual geographic locations, typically defined by a single pair of coordinates, such as latitude and longitude (e.g., the exact location of a specific tree, a street light, or a building entrance).
  • Lines (or Polylines): Sequences of connected points that represent linear features like roads, rivers, utility lines, or transit routes.
  • Polygons: Enclosed sequences of lines that define areas, such as property boundaries, lakes, administrative regions, or even the footprint of a building.

The process of assigning geographic coordinates to human-readable addresses and locations (e.g., city names, street addresses) is called geocoding. Through geocoding, a specific street address might be represented as a point, an entire street as a line, and a neighborhood or city block as a polygon.

This vector data is stored in various formats. The Esri Shapefile (.shp) remains a very common and widely recognized open standard, though it’s important to note that a “shapefile” is actually a collection of several files (e.g., .shp, .shx, .dbf) that together define the spatial data and its attributes. Another ubiquitous and developer-friendly format, especially for web applications, is GeoJSON, which is used to store points, lines, and polygons, complete with their coordinate information and associated properties, as a simple JSON object. Other common storage options include KML and various spatially-enabled database formats (like PostGIS).

Vector data, in this case graphically represents real-world features (points, lines, polygons). This is distinct from vector embeddings in AI, which are numerical representations of semantic meaning.

Raster Data:

In contrast to vector data, spatial imagery and continuous surfaces are presented in a grid of pixels called a ‘raster.’ Each pixel in a raster image contains a value representing a specific attribute of that location, such as elevation, temperature, land cover, or satellite imagery. This model is particularly useful for representing phenomena that vary continuously across space or for displaying aerial photographs and satellite images.

There are niche specialized spatial data stores, but mainstream relational and nonrelational databases from MySQL to Google BigQuery can store spatial data. Today, almost every key database supports some spatial data type, such as GEOGRAPHY.

Why is this information important to data and analytics professionals?

Geographic information systems (GIS) are used by mapping software, such as Google or Apple Maps or a rideshare app, like Uber or Lyft. With the introduction of high-speed 5G networks and data from sources like connected cars, data stores must store unprecedented amounts of spatial data. This data also comes from satellites and drones as light detection and ranging (LiDAR) data. A format called Indoor Mapping Data Format from Apple is used for mapping indoor locations, such as airports, shopping malls, and so on. LiDAR technology has even reached end-user devices, such as the Apple iPhone.

Transportation companies (airlines, trucking, railways, etc.) use GIS data to pinpoint their moving assets by “geofencing” streaming spatial data into polygons. Geofencing is also used by businesses to trigger notifications, coupons and security alerts when users enter the virtual perimeter. This data can be joined with time series data to analyze rush hour traffic patterns. Streaming spatial and time series data points can determine the average velocities of vehicles in cities’ rush hour traffic. Naturally, the importance of time series data has prompted new data stores optimized to handle it.

Time series database

Time series databases (TSDB) are optimized to ingest and store sequences of data where time is a key attribute as it measures how events change over time. TSDBs track, monitor, and analyze events over time. Examples of temporal data range from stock and currency trades to clickstreams from websites to measurement data emanating from sensors and devices. With the rise of IoT and 5G, this is one of the fastest growing categories of DBMS. Time series databases are adept at analyzing historical data, identifying trends, and predicting future events.

Purpose-built time series databases (TSDB) are optimized for analyzing temporal use cases, but relational, document, and wide column databases are also frequently used for these use cases. TSDBs are built to chronologically ingest high cardinality data that arrives at high speeds and in huge volumes. This necessitates an efficient ingestion mechanism as well as an efficient storage engine. Time-centric data is usually added to the database in an append-only fashion. Upserts are not very common unless a late-arriving data packet needs to correct a previous time interval’s aggregation with new values. These characteristics make time-series data different from regular rows of data with a timestamp column.

Real-time data from IoT devices, sensors, stock trades, clickstream, logs, CPU, and have one thing in common — they are indexed by timestamp.

Some of the TSDB use cases are:

  • Infrastructure management / DevOps. Data center infrastructure data, such as event logs, CPU usage, network throughput, and application performance are commonly consumed over a message bus, such as Kinesis or Kafka or using open standards, such as Telegraf into a TSDB, and visualized using Grafana to improve resource utilization, forecast capacity, and provide security alerts. By analyzing resource consumption trends and metrics, TSDBs can perform root cause analysis on application bottlenecks.
  • Operational Excellence / Anomaly Detection. Internet of Things (IoT) devices and sensors, such as assembly lines, wearable devices, telco equipment, medical devices, etc. use protocols, such as OPC-UA, MQTT, etc. to ingest data into a TSDB. This data can be used to monitor systems, predict failures, recommend preventative maintenance, and plan upgrades. In healthcare, TSDBs are used to track clinical research data and monitor patients in real-time.
  • Financial / Trading Analysis. Stock symbols, currency exchanges, and trade settlements have used time series databases for decades to enable high-frequency trading analytics, portfolio optimization, and real-time risk assessment.
  • Business analytics. Application events, such as clickstream data, are used to understand trends and patterns in custom journey analytics as well as guide pricing strategies. Supply chain vendors used TSDB to improve industrial productivity including warehouse fulfillment, inventory planning, asset tracking, manufacturing, and logistics. Retailers use TSDB to optimize inventory, forecast sales, and understand customer behavior. Grafana is commonly used to build dashboards on top of time series databases.

Often time series data is collected at regular intervals, such as electricity usage from smart meters, but generally time series data tends to be irregular. It spikes when certain events or conditions happen (e.g. material changes in the financial market cause a burst in the trading volume). Queries need to be in near real-time, and most users are only interested in the most recent data. So, most TSDBs create time-based partitions to query data. Only the most relevant portions of data are loaded into memory. The analytics engine supports “windowing” options, such as tumbling, sliding, and session windows, and functions pertaining to interpolation, smoothing, and approximation to identify trends, behaviors, and patterns over a fixed or arbitrary time interval.

Older data is usually aggregated, a process known as ‘downsampling’. For example, granular real-time measurement data from smart meters is collected every 5 to 15 minutes into a TSDB, then analyzed for anomalies and trends. Aggregation occurs on an hourly basis, and the resulting data persists in a data warehouse for historical analysis.

But most generic data stores are not designed to handle the extreme velocity and volume expected from 5G device growth. Any optimized system must meet the unique qualities peculiar to time series data:

  • Delayed communication. Some remote data sources may not have guaranteed network access. As a result, data must persist on an “edge gateway,” which is synchronized when reestablishing connection with the primary server. Such conditions apply to oil wells in remote regions and ships far out at sea. So long as transmitting data over expensive satellite networks is not cost effective, delayed communications are a reality engineers must plan for (at least not until low Earth orbit satellites like SpaceX’s Starlink and Amazon’s own effort produce the high-speed networks they are promising).
  • Poor data quality. Sensors degrade over time, which can corrupt data collection. If this “data drift” goes unhandled, it can render efforts to do proactive and predictive maintenance useless. Sensors need to be recalibrated by analyzing historical data or doing a linear interpolation of existing data.
  • Out of order data. Devices may be in different time zones, and hence the storage may require unification around a common time zone, such as UTC. In cases where there are thousands or millions of devices in different geographical regions, it is likely data won’t arrive in a chronological manner. This challenges the accuracy of any time window-based data analysis.
  • Technological challenges. Even for TSDBs, the sheer volume and speed at which data arrives (e.g., millions of IoT sensor readings per second) can overwhelm systems not adequately provisioned or optimized. Bursty data patterns can also lead to backlogs. Managing indexes and metadata for high cardinality can slow down writes and consume vast amounts of memory and storage.

Well-known time-series databases include InfluxDB, TigerData (formerly called TimescaleDB but they have evolved into an expanded PostgreSQL space), Prometheus, and CrateDB. KX Systems’ KDB+ is also a well-known, high-performance database specifically used for managing large financial datasets, with native integration into market data sources.

Search Data Stores

Search data stores solve a unique problem: how to search and analyze unstructured data such as documents, logs, emails, and website content. Unlike traditional databases that rely on primary keys, these data stores index content using an inverted index (also known as a reverse index). This specialized data structure is at the heart of their performance.

An inverted index works by ingesting documents, breaking down their content, and creating a mapping from words to the documents where they are found. This process involves a series of steps:

  • Tokenization: A document’s text is converted into a stream of tokens, which are individual words or phrases. This process includes removing whitespaces and punctuation, converting text to lowercase, and eliminating common “stop words” like “a,” “the,” and “is.”
  • Stemming: Words are reduced to their root form (e.g., “big,” “bigger,” and “biggest” all become “big”).
  • Indexing: Each token becomes a key whose value points to the documents where it is found. The index may also store metadata like the word’s position and frequency within each document.

This architecture is highly optimized for search. The index is sharded into multiple partitions to speed up queries, with each shard having a replica for redundancy and high availability.

Search data stores are used for a variety of critical workloads:

  • Enterprise Search: This is the original use case. Search databases provide powerful full-text search, semantic search, and document retrieval.
  • Observability: They are widely used for ingesting massive volumes of system and application logs to provide metrics, trace analytics, and anomaly detection. The volume of data for log analytics often dwarfs that used for enterprise search.
  • Security Analytics: These databases enable real-time analysis of security logs to detect threats and suspicious behavior.
  • AI and Machine Learning: They can serve as a vector database for Retrieval Augmented Generation (RAG), powering semantic search and other AI-driven applications.

While some other databases, like Splunk, InfluxDB, and even some relational databases, offer text search, dedicated search databases are uniquely optimized for this function. They provide an HTTP web interface with REST APIs for both search and CRUD (Create, Read, Update, Delete) operations. These APIs allow for a wide range of analytical and visualization tools to connect via ODBC or JDBC drivers.

The open-source space is dominated by two main players: Apache Solr and Elasticsearch. Both are built on top of the open-source, Java-based Apache Lucene library, which serves as their core search engine. This data can be unmarshalled (or serialized) into JSON.

These systems are defined by two key features: universal indexing and a distributed, scalable multi-tenant architecture. They go beyond simple keyword indexing by providing sophisticated relevance scoring. This process uses statistical measures like TF-IDF (term frequency–inverse document frequency) and also factors in external signals such as user ratings, recency, and links to make search results more relevant and personalized.

The relationship between these open-source projects is noteworthy. In 2021, Elastic.co changed the licensing of Elasticsearch to the Server Side Public License (SSPL), which led to a “hard fork” of the project (based on version 7.10.2). This led to the creation of OpenSearch, a community-driven, open-source search and analytics suite, which remains under the Apache 2.0 license.

While Apache Solr has stayed completely open source, Elasticsearch has a commercial offering from Elastic.co (Elastic Cloud). Both can be self-managed, hosted, and managed by a cloud provider (e.g., Amazon OpenSearch Service). As of August 2024, Elasticsearch and Kibana, part of the Elastic Stack, are now again open source under the AGPLv3 (GNU Affero General Public License).

Graph

Unlike RDBMS where relations are defined via foreign keys in a single direction, graphs are modeled to represent relationships between entities in a schema-optional and bi-directional manner. In RDBMSs, relationships (i.e. joins) are resolved at query time while in graphs they are built into the model during graph creation. Graph traversal languages do not need to do ‘joins.’ Data and metadata co-exist, and schema evolutions are easily supported. Graph connections are flexible and can be dynamically altered.

Use cases involving complex relationships benefit from graphs, especially when multiple many-to-many joins can degrade query performance of an RDBMS. While graph databases are well suited for short writes of complex structured data, RDBMS are generally used for use cases that involve large aggregations.

Let’s look closer at how graph databases store information. Discrete objects, such as a person or a place or an event, are stored in nodes or vertices, which are connected through edges. The edges between nodes represent these relationships and can be dynamically added (or removed) unlike a statically defined relational table structure. Although graph databases may have indexes for key items, like people, to speed up queries, nodes also have pointers to all nodes they are connected to and hence, don’t need indexes for hopping through the paths. This concept is known as “index-free adjacency.”

Graph databases support complex use cases, such as anti-money laundering using path analysis, social networking, ranking, and recommendations, fraud detection, clinical trials, and risk assessment by tracking sensitive data. These databases are increasingly being used for the data science workloads that require graph algorithms:

  • Social graphs. Platforms, such as Facebook and LinkedIn, use graphs to build social networks of friends, business acquaintances, and so on.
  • Transaction graphs. These graphs link customers with their orders, payments, product catalogs, and other transactional history. They can be used to visually show how money is exchanged between accounts (real and synthetic ones created through stolen identities).
  • Knowledge graphs. Knowledge graphs are used to build semantic layers. In 2001, Sir Tim Berners Lee advocated his vision for the next version of the world wide web, and he called it the Semantic Web.

Graph databases have built a mature ecosystem of data ingestion, debugging, profiling, and data analysis options. However, there are many different type of graph databases:

  • Triple stores or W3C RDF

The graph data model is expressed in “triples” consisting of subject, predicate, and object. This standardized representation of triples is formally defined by the W3C (World Wide Web Consortium) as the Resource Description Framework (RDF), providing a foundational framework for expressing interconnected information in a machine-readable format. It comprises:

  • The subject and object represent resources (entities, concepts, or things). These are typically identified by URIs (Uniform Resource Identifiable) or can be literals (data values like strings or numbers for objects). They act as “nodes” or “vertices” in the graph and are nouns.
  • The predicate (also known as a property or relation) defines the specific type of relationship or attribute that connects the subject to the object. This serves as the “edge” between the two nodes and are verbs.

For example, consider the sentence “Larry likes Lanai.” In an RDF triple: “Larry” would be the subject, “likes” would be the predicate, and “Lanai” would be the object.

One of the original use cases for graph databases, particularly knowledge graphs, has been the deployment of ontology. An ontology is a formal, explicit specification of concepts within a domain and the relationships between them. Ontologies significantly enhance traditional schemas by providing explicit definitions of classes, properties, and the constraints governing their connections. A well-defined ontology acts as a blueprint, encoding the precise structure of a graph database. The primary W3C standard for modeling ontologies is the Web Ontology Language (OWL).

When ontologies are applied to a specific domain, they collectively define the semantics (the meaning and relationships) for that data. Wikipedia and schema.org widely utilize RDF to express complex relationships in a machine-readable, semantic layer.

  • Labeled Property Graphs (LPG).

In an LPG, nodes can have one or more labels, which act as tags or types, categorizing the node (e.g., a node representing a company might have the label “Supplier”). Each node also possesses properties, which are key-value pairs (or even nested data structures) storing metadata about that node, such as contact name, address, phone, or region. Similarly, edges can also carry their own properties to store metadata about the relationship itself (e.g., a purchase date property on an edge labelled as ‘buys’) and has a direction.

While RDF graph models are often considered edge-centric (or triple-centric), the Labeled Property Graph (LPG) model is property-centric. Often simply called ‘property graphs,’ this model is frequently perceived as more intuitive and flexible in its common query languages compared to some RDF serialization formats.

LPGs streamline upfront modeling relative to RDFs. They are more agile when addressing modern use cases such as knowledge graphs and generative AI.

  • RDF*

Pronounced “RDF star,” RDF* is a significant extension to the core RDF model. Its primary innovation is the ability to attach properties directly to RDF statements (triples). This capability is analogous to how relationships (edges) in Labeled Property Graphs (LPGs) can carry their own metadata.

By allowing metadata to be directly associated with triples, RDF* significantly enhances RDF’s expressiveness. This also enables crucial use cases such as establishing data provenance, where the origin and lineage of data are meticulously documented. For example, one could attach properties like ‘date added’ or ‘source’ directly to a specific triple, ensuring the context behind the data is formally captured within the graph.

Apache TinkerPop is an open-source graph computing framework that provides a comprehensive set of APIs and a robust reference implementation for general-purpose graph traversal and analytics. The framework includes TinkerGraph, an in-memory graph database that serves as a lightweight reference for development and testing.

Many prominent commercial and open-source graph database vendors leverage or integrate with Apache TinkerPop to build their products. This widespread adoption fosters significant interoperability, enabling users to more easily move data and queries between different TinkerPop-compliant graph systems.

Graph data stores persist data efficiently, but querying graph databases, processing its data, and performing analytics is not as straightforward as SQL in a RDBMS. The next section takes a closer look at how the industry is responding to increasing graph databases user experience.

Graph database querying

RDF graphs use the SPARQL query language, which is designed for exploration based on semantic relationships. The semantic structure is further explained by any OWL-powered ontology present. Well known ontologies in this category include the financial industry business ontology (FIBO), the health care ontology called Health Level 7 (HL7), and the Fast Healthcare Interoperability Resource (FHIR). The goal of these open standards is interoperability between related applications offered by different vendors.

In contrast to RDF graphs, the property graph space has a standard query language called Graph Query Language (GQL), which is a ISO standard. Prior to the existence of GQL, Neo4j created its proprietary language, Cypher. Several technologies, like Amazon Neptune, Redis and SAP HANA have adopted its open-source version, openCypher. Alternatively, Apache TinkerPop has developed Gremlin which is supported by IBM Db2, DataStax Enterprise (built on Apache Cassandra), Amazon Neptune and Azure Cosmos DB. From this effort a specification has emerged to run SPARQL on top of property graphs. Furthermore, Gremlin can also run on top of Cypher.

Although SQL plays best with relational databases, because of its popularity, several vendors, such as Stardog, Cambridge Semantics (now part of Altair), and Ontotext (it merged with Semantic Web Company and is now called Graphwise), have added support for it. This tool has enabled common data visualization and BI tools, like Tableau, to connect to graph databases. All that’s needed is a translation layer that converts ANSI SQL into the graph database’s native query language. An open-source standard called Apache Calcite is frequently used to provide this service.

Query capabilities are also being extended to ML notebooks, such as Apache Zeppelin and Jupyter Notebooks to enable graph-based data science. Geospatial support is also being added to graph databases. Finally, hardware acceleration by Intel and Cray (now part of HPE), Katana Graph and Graphcore are being added to allow multiple cores and AI to speed up graph traversal.

A number of independent graph visualization tools interface with graph databases, such as Tom Sawyer and Linkurious. There are also open-source visualization tools, such as Gremlin visualizer, Gephi, and Keylines. Graph visualization and it can also be embedded in applications using well-known libraries, such as D3.js and Plotly. Finally, graph databases support API access. These are generally REST APIs.

Graph Algorithms

Less than 5% of total enterprise database workloads involve graph databases, but their adoption is growing in sectors like fraud detection, cybersecurity, recommendation systems, supply chain, and knowledge graphs. Graph algorithms analyze nodes’ degree of importance, how they are clustered, the paths, and distances between them to help reduce hallucinations in generative AI apps, find patterns, single points of failure, and relationships that are not easy to detect otherwise. Some well-known algorithms are:

  • Pathfinding

Path analysis is used by transportation companies (e.g., rail networks, airlines) to determine the most optimal routes based on cost or distance. Telecommunication companies utilize it to find the least costly routing for phone calls. The Minimum Spanning Tree is used to find the shortest distance between cities and is often used to solve the classic ‘Traveling Salesman Problem’ where a salesman must visit each city on his or her itinerary exactly once. Examples include Shortest Path, Minimum Weight Spanning Tree, Random Walk, and Breadth- and Depth-First Search.

  • Centrality

These algorithms identify the most important or influential nodes within a graph. Examples include Degree Centrality, Closeness Centrality, Eigenvector Centrality, Harmonic Centrality, and PageRank. Google originally delivered search results using the PageRank algorithm, named after co-founder Larry Page. Developed while Page and Sergey Brin were Ph.D. students at Stanford, the algorithm was patented by Stanford University and later licensed to Google. While the patent has since expired, PageRank remains a Google trademark

  • Community Detection

These algorithms help to identify how graphs can be partitioned and grouped into cohesive communities or clusters of similar nodes. They are widely used in applications like recommending products based on common buying patterns. Examples include Clustering Coefficient, Label Propagation, Louvain Modularity, and Connected Components (Union Find).

  • Heuristic Link Prediction

These algorithms predict probable, yet unobserved, relationships between nodes based on their surrounding graph structure. They can also be used to estimate missing data. Examples include Same Community, Total or Common Neighbors, Resource Allocation, and Adamic-Adar.

  • Similarity

These algorithms quantify how similar nodes are based on comparisons of their features or connections. Examples include Euclidean Distance, Approximate K-Nearest Neighbors (KNN), Cosine Similarity, Jaccard Node Similarity, Overlap, and Pearson Similarity.

  • Embeddings and ML models

These algorithms transform the complex structure of a graph (nodes and their connections) into a flat, low-dimensional numerical vector format, reducing dimensionality for data science workloads. Graph databases like Neo4j are now capable of performing supervised machine learning directly on these embeddings, with training and inference happening within the database itself. Enterprise applications are increasingly leveraging graphs with deep neural networks to deliver advanced use cases, such as digital twins in industrial IoT. Node2Vec is an earlier example, while FastRP represents a more performant version. GraphSAGE, meanwhile, learns to represent a node’s embedding based on its local neighborhood.

Graph-based retrieval augmented generation (GraphRAG) has demonstrated improving accuracy of RAG workloads. It uses several algorithms mentioned above, such as:

  • Pathfinding to find relevant connections
  • Centrality to prioritize important entities
  • Similarity or Graph Embeddings to find semantically related nodes/subgraphs
  • Community Detection to identify relevant knowledge clusters

Graph Processing

While graph databases have historically excelled at operational use cases involving traversing highly connected data, recent enhancements now enable a single graph query to efficiently serve both operational (OLTP) and analytical (OLAP) workloads.

Consider a SPARQL query designed to retrieve the PageRank of a specific physician and then compare it against the highest-ranked physician in the same city. The initial part of this query — retrieving the specific physician’s PageRank — is operational (OLTP-like). However, the second part — identifying the highest-ranked physician across all doctors in a city and sorting — requires an analytical aggregation over a larger dataset, demonstrating the convergence of these distinct workload types within a single query.

Pregel is a scalable graph processing framework developed by Google. It was designed to efficiently process large graphs through iterative computations, addressing the limitations of more general-purpose frameworks like MapReduce for graph-specific algorithms. It enables analytics on graph data distributed across a large number of computing nodes. It is particularly well-suited for iterative graph algorithms such as PageRank, Shortest Path, and Connected Components, where computations need to be performed over many cycles across the graph.

Ingesting data into graph databases requires specialized approaches, reflecting their unique structure. While simple methods like loading from CSV files are common for initial datasets, more advanced options are crucial for real-time streams, large volumes, and complex integrations:

  • Graph Database APIs (e.g., Apache TinkerPop): Many graph databases expose APIs (often based on frameworks like Apache TinkerPop) that allow programmatic insertion of nodes and edges. This provides fine-grained control for developers building custom ingestion pipelines.
  • Real-time Streaming Connectors (e.g., Kafka Connect): For dynamic, real-time data flows, direct ingestion of streaming data is essential. Specialized connectors, often built on frameworks like Kafka Connect APIs, enable continuous data pipelines.
  • Vendor-Specific Utilities: Cloud-native graph databases and some standalone solutions provide optimized utilities for bulk loading.
  • Data Integration Frameworks: General-purpose data integration and orchestration tools also support graph database ingestion. Apache Hop is an example of a visual data pipeline tool that can be used to model and execute complex data flows into graph databases. Its metadata-driven approach and flexible architecture make it a viable option for various ingestion scenarios, including those for highly connected graph data.

Beyond the ingestion mechanism, effectively loading data into a graph database profoundly depends on its unique data modeling paradigm. Unlike relational or traditional NoSQL databases, creating models for graph databases means meticulously defining nodes and their relationships (edges), which is a fundamentally different process.

A particular challenge in graph modeling is the phenomenon of supernodes. These nodes possess an extraordinarily high number of edges, creating a disproportionate concentration of connections. A classic example is a social media celebrity with millions of followers — their profile node would be a supernode. They can become “hotspots” that severely impact query performance during graph traversals or lookups. To mitigate these issues, graph databases employ and recommend several optimization strategies:

  • Indexing: Implementing specialized indexes, such as B-Tree indexes, on edge properties or relationship types can significantly improve the lookup performance for edges connected to supernodes.
  • Refactoring: A more advanced technique involves refactoring the supernode itself. This can mean converting some of the edges into properties on the supernode, or introducing intermediate nodes to distribute the connections, effectively “shredding” the dense connections to alleviate the hotspot.

Miscellaneous Database Types

Several other specialized database types cater to unique requirements or leverage distinct architectural approaches. Some of these have long-standing roots, while others represent emerging trends driven by new technological capabilities and evolving data demands.

  • Embedded Databases

These are lightweight database engines designed to run directly within an application’s process, rather than as a separate server. They typically manage their data in local files and handle resource allocation (like memory and I/O) directly within the host application’s context.

SQLite is the quintessential example, renowned for its incredible ubiquity. With over a trillion deployments, it powers web browsers, mobile phones, desktop applications, embedded systems, and IoT devices. Its simplicity, compact footprint, and transactional guarantees have made it arguably the most widely deployed database engine globally. Other examples include H2, Apache Derby, and RocksDB.

etcd is an open-source, strongly consistent, distributed key-value store that uses the Raft consensus protocol to ensure data integrity across a cluster. Kubernetes uses etcd as its primary data store to keep track of its cluster’s state and configuration and for service discovery. Written in Go, etcd is also used by other large-scale systems, such as Uber’s metrics database, M3.

  • Hardware-Accelerated Databases

These databases are specifically optimized to leverage specialized hardware, most notably Graphics Processing Units (GPUs), to accelerate data processing. GPUs excel at massively parallel computations, making them exceptionally well-suited for certain analytical workloads that involve complex mathematical operations across large datasets. This category also includes databases that leverage Field-Programmable Gate Arrays (FPGAs) or other custom silicon for high-speed, specialized tasks.

Some well-known examples of hardware-accelerated databases are Kinetica, OmniSci (now known as Heavily.io), SQreamDB, and PostgreSQL-compatible BrytlytDB.

Appliances are pre-configured hardware installations designed to run complex and large workloads. Netezza (acquired by IBM) ushered in the era of database appliances when it introduced its PostgreSQL-compatible data warehouse appliance in 1999. Other natable examples include Oracle Exadata and Yellowbrick Data.

  • In-Memory Databases (IMDBs) and In-Memory Data Grids (IMDGs)

This category focuses on achieving ultra-low latency and high throughput by primarily storing and processing data in RAM rather than on persistent disk storage. While they are not a distinct data model themselves, they represent performance optimization applied across various database types. They are typically deployed as a caching layer or a high-speed operational store that sits between the application and the persistent database.

In-Memory Databases (IMDBs) are full-fledged database systems where the primary working set of data resides in the computer’s main memory. The emphasis is on maximizing read/write speed for applications requiring immediate data access, such as high-frequency trading, real-time recommendation engines, or fraud detection. Data persistence is typically handled by asynchronously writing to disk or replicating to standby nodes to ensure durability.

In-Memory Data Grids (IMDGs) are distributed systems that manage data in memory across a cluster of servers, providing high scalability and fault tolerance. While they can store data persistently, their primary role is real-time data distribution and processing for applications that require an extremely low-latency, high-concurrency data fabric.

Well-known examples in this category include Apache Ignite and Hazelcast, which are both open-source. While Apache Ignite is a community-driven project, it has a commercial offering from GridGain.

  • Ledger and Blockchain Databases

These databases provide highly secure, verifiable, and immutable records of data, ideal for audit trails, compliance, and transactions where data integrity and non-repudiation are paramount.

Ledger Databases are typically centralized database services that employ cryptographic techniques to create an immutable, append-only record of all data changes. Every revision is cryptographically chained to the previous one, forming a tamper-evident history. This is often achieved using Merkle trees, where each new record generates a cryptographic hash that contributes to a verifiable “digest” of the entire ledger. When a record is written, the database can provide a cryptographically verifiable proof that guarantees its inclusion and immutability. These are often accessed via standard web service APIs (e.g., REST).

Blockchain Databases (Distributed Ledgers) are a subset of ledger databases that operate on a decentralized, distributed network of nodes. Data is organized into cryptographically linked blocks, and consensus mechanisms ensure data integrity and agreement across all participants. While commonly associated with cryptocurrencies, permissioned blockchain solutions (e.g., Hyperledger Fabric, R3 Corda) are being explored for use cases like supply chain traceability, digital identity management, and cross-organizational financial transactions where a shared, trusted, immutable record is needed among multiple parties.

Some traditional databases are now integrating ledger capabilities. Oracle Database’s Blockchain Tables, for example, allow users to create immutable tables that leverage blockchain technology for verifiable data integrity, which can still be queried using standard SQL, but their content cannot be modified once committed.

This trend of integrating specialized functionalities, such as immutable ledgers, directly into existing database platforms exemplifies a broader industry movement. This convergence of capabilities serves as a perfect transition to explore the topic of multi-model databases.

Multi-model

The increasing diversity of data types and access patterns has traditionally led organizations to deploy multiple specialized databases. For example, a relational database for structured data, a document store for semi-structured content, and a graph database for relationships. However, managing and integrating these disparate systems incurs significant operational overhead and complexity. In response, a major trend in the database industry is the move toward multi-model databases, which aim to reduce this overhead by supporting multiple data models within a single, unified system.

Converged Databases

One approach to multi-model functionality involves enhancing a core database engine to natively support additional data models. PostgreSQL is a prime example of a relational database that has evolved to incorporate robust support for semi-structured data, particularly through its JSON and JSONB data types.

The JSON data type stores JSON documents as plain text. While simple, querying requires re-parsing the text for each operation. In contrast, the JSONB data type (Binary JSON) parses the document into a decomposed, binary format upon ingestion. This optimized internal representation removes whitespace and allows for efficient indexing (e.g., using GIN indexes) to enable rapid key-value lookups and full-text searches within the JSON documents.

PostgreSQL further extends its JSONB capabilities with a rich set of JSON-specific operators and functions, allowing for powerful manipulation and querying of nested JSON structures directly within SQL queries, stored procedures, triggers, and functions. This enables developers to combine the flexibility of a schema-less JSON model with the ACID properties and strong consistency of a relational database. Complex XML or JSON documents can also be “shredded” — a process where parts of the semi-structured document are extracted and normalized into traditional relational tables, allowing for seamless integration with existing relational schemas and operations.

API Compatibility-led

A second, distinct approach to multi-model functionality is provided at the access layer through API compatibility. In this scenario, the underlying storage engine might be a single, highly optimized distributed system, but it abstracts this complexity by exposing multiple, well-known APIs. Developers can then interact with the data store using the API native to their preferred data model, without needing to understand the underlying storage mechanism.

Azure Cosmos DB is a leading example of this paradigm. It is a globally distributed, multi-model database service that supports several APIs, making it appear as different database types to the developer. Its APIs include MongoDB, Apache Cassandra, and Gremlin.

ArangoDB is a notable example that was built from the ground up as a multi-model database. It natively supports document, key-value, and graph data models, and can query across them using a single declarative query language called AQL (ArangoDB Query Language).

This API-driven approach offers tremendous flexibility for developers, allowing them to choose the best data model API for a particular application component while benefiting from a unified management plane, global distribution, and consistent guarantees (e.g., availability, latency, throughput) provided by the single underlying platform. This strategy streamlines operations and simplifies architectural choices for diverse application needs.

Zero-ETL

While a multi-model database is a single system that handles multiple data models, zero-ETL represents a distinct architectural approach. It is a data integration pattern that automatically synchronizes data between two separate, specialized systems. For instance, data can be replicated from a transactional database (OLTP) to an analytical data warehouse (OLAP), or from a key-value database (e.g., Amazon DynamoDB) to a vector-enabled search database (e.g., Amazon OpenSearch).

The key is that the two databases remain distinct. Data is continuously replicated from the source to the destination without the need for traditional, manually built ETL pipelines. This allows developers to query data from the best-fit database for their workload while ensuring the data remains fresh with minimal latency.

Summary

The options for DBMS have grown tremendously, with the DB-Engines website listing 424 products as of August 2025. While industry experts have long predicted consolidation, the reality is that many new databases are introduced each year. Although RDBMS remain the most common category, non-relational databases are growing at a much faster rate.

This blog focuses primarily on DBMS categories that serve operational use cases and workloads. The world of analytical data stores, including data warehouses, data lakes, lakehouses, and open table formats like Apache Iceberg, Delta Lake, and Apache Hudi, will be covered in a subsequent blog.

--

--

Sanjeev Mohan
Sanjeev Mohan

Written by Sanjeev Mohan

Sanjeev researches the space of data, analytics & AI. Most recently he was a research vice president at Gartner. He is now a principal at advisory firm, SanjMo.

Responses (2)