How PostgreSQL is Dominating AI and Multicloud

Sanjeev Mohan
10 min readOct 19, 2024

--

PostgreSQL, once considered a niche database primarily used for academic and research purposes, has experienced a remarkable surge in popularity and adoption in recent years. It has rapidly evolved from a purely operational relational database management system (RDBMS) to a top contender to meet both on-premises and cloud-based enterprise workloads.

PostgreSQL’s reputation as a versatile, open-source database continues to grow. Its adaptability, robust features, and strong community have fueled its transformation over the years that have broadened its appeal. Notable advancements, such as enhanced performance, new data types, and seamless multi-cloud integration, are reshaping the way organizations deploy and manage their data. The recent rise of cloud-native PostgreSQL services like AlloyDB demonstrates its ability to not only compete with but also to outshine legacy databases and other open-source alternatives.

This blog examines the evolution of PostgreSQL and the key factors driving its growth, including its new features such as vector databases, various cloud deployments, performance optimizations, automation and security innovations.

Brief history of PostgreSQL

Launched in 1986 as part of a research project led by Michael Stonebraker at the University of California, Berkeley, PostgreSQL has its roots in a time when relational databases were still a relatively novel concept. It was originally developed as a successor to Ingres, another relational database system developed by Michael Stonebraker. Over time, PostgreSQL has steadily grown in popularity due to its flexibility, extensibility, and adherence to ACID (Atomicity, Consistency, Isolation, Durability) principles.

While PostgreSQL was initially viewed as an academic alternative to larger, commercial databases such as Oracle and Microsoft SQL Server, it has since grown into a powerhouse in the open-source community. Even hyperscalers, like AWS, Google, and Microsoft Azure, who have their own proprietary DBMS offerings, have adopted PostgreSQL due to its scalability, native SQL support, and powerful query engine.

PostgreSQL is governed by the PostgreSQL Global Development Group, a community of developers and companies contributing to its development. It is released under the PostgreSQL License, a permissive open-source license that allows for free use and modification. The community has attracted over 600 contributors and over 50,000 commits in its 38 years of development. In 2022 alone, PostgreSQL saw over 10,000 commits.

PostgreSQL follows a consistent release cycle, with major updates launched annually, usually around September. Each major release includes new features, improvements in performance, security enhancements, and bug fixes. Minor versions, which are released quarterly, focus mainly on patching bugs and vulnerabilities but do not introduce new features. The latest major version is PostgreSQL 17, released as of September 2024.

PostgreSQL Compatible Landscape

As an open-source database, PostgreSQL offers a high degree of flexibility, customization, and control. This has made it attractive to organizations seeking to avoid vendor lock-in and reduce costs. The vendor community has responded with rich offerings, each with its own unique value proposition. These offerings vary from ports of the open-source version to highly optimized versions while maintaining compatibility with the original PostgreSQL protocol. These options fall into three main categories:

  • Port: These products offer preconfigured managed services, with tuned database environments and additional proprietary management features, built for the core PostgreSQL codebase. A standard port is the fastest option to upgrade and adopt the newest open-source base release and the easiest to migrate from one vendor to another. Examples include Crunchy Data and Neon.
  • Optimized: These products diverge from the main codebase by introducing proprietary yet compelling functionality without sacrificing high levels of compatibility. These products provide customers with full application portability between open source PostgreSQL and proprietary offerings, but may have different, or often better, performance characteristics and add-on features within the database engine. While vendors may contribute to the open-source community, some capabilities remain proprietary long term, and they typically take a bit longer to adopt the latest open-source versions than simple ports. One common enhancement is the creation of shared storage systems, enabling PostgreSQL to support the separation of compute and storage. Examples include AWS Aurora and Google Cloud AlloyDB.
  • Wire-Compatible: These products emulate PostgreSQL’s wire protocol at the API level, but their implementation is proprietary. These products are often built to take advantage of cloud-native infrastructure, and can offer customers unique capabilities like strong cross-region consistency, or horizontal scaling of reads and writes, that are difficult to achieve with simple ports or optimized databases. Examples include CockroachDB, Google Cloud Spanner and Yugabyte.

The table below shows cloud hyperscaler’s PostgreSQL offerings across the three categories:

Several technical reports have detailed PostgreSQL’s standard offerings. In this blog, we explore how this venerable DBMS has undergone significant transformations to become cloud-native. These transformations are observed across several key categories:

  • Architecture: Historically, PostgreSQL followed a monolithic software stack. However, modern optimized iterations have decoupled the storage and computation layers, optimizing for cloud environments. Some PostgreSQL offerings have added serverless capabilities, allowing automatic scaling and reduced operational overhead. Others have added automation options like AUTOVACUUM and automatic memory management.
  • Deployment: While the original database was designed for on-premises use, PostgreSQL today supports hybrid, multi-cloud options. This flexibility ensures that enterprises can deploy PostgreSQL in private, public, or hybrid cloud environments with ease, offering seamless movement across infrastructure.
  • Hybrid Search: Initially, PostgreSQL was built to support traditional SQL-based keyword searches. However, with the rise of AI, PostgreSQL now includes vector search functionality, empowering the database to meet the demands of machine learning and other AI workloads.
  • Data Types: PostgreSQL’s original focus was on relational data, but it has since evolved to support a broader range of data types. Native JSON support allows handling of semi-structured data, and thanks to its extensions framework, PostgreSQL now boasts some of the best geospatial capabilities in the industry.
  • Non-functional Enhancements: Over the years, PostgreSQL has seen continuous advancements in terms of performance, scalability, reliability, security, and usability. These improvements have cemented PostgreSQL as one of the most trusted and efficient databases, capable of handling the complex needs of modern applications.

In the next section, we’ll dive into each of these categories in greater detail, using vendor products like AlloyDB to demonstrate real-world applications and advancements in the PostgreSQL ecosystem.

Architecture

Over its three-decade history, PostgreSQL has transformed from a monolithic architecture to a modular and cloud-friendly design through the separation of storage and compute layers. As a result, if a workload is read-heavy, it can allocate more compute nodes to manage the query load without needing to add extra storage. This offers significant advantages in terms of elasticity, resilience, and cost-efficiency. AlloyDB’s storage service utilizes Google’s cluster management system called Borg, which was the precursor to Kubernetes — now a part of Cloud Native Foundation (CNCF) for built-in compute to scale I/O and essentially eliminate I/O bottlenecks.

The shared storage architecture allows all compute nodes to access the same data from a single source of truth, reducing redundancy as multiple read replicas can access the same data without duplicating storage. It also simplifies backups and point-in-time recovery through automation by eliminating manual intervention and reducing the risk of human error.

At the core of AlloyDB is Colossus, Google’s distributed storage system. Colossus is designed to handle large datasets with high availability, low latency, and strong consistency. It can be deployed across multiple regions, providing low-latency access to data while adhering to data residency restrictions. It is a key component of many Google services, including BigQuery, Bigtable, Firestore, Google Search, Gmail, and Google Drive.

Both Colossus and Borg are exabyte-scale, horizontal scalable systems designed for extreme-scale availability and reliability. Colossus incorporates data compression techniques that help AlloyDB manage storage more efficiently and improve query performance

Although AlloyDB is not fully serverless, its architecture provides many benefits similar to serverless databases. For example, scaling compute resources based on workload, and separating storage from compute, mimics the elasticity of serverless environments while giving more control over resources.

Deployment

Data has gravity and organizations often are reluctant to move data from one location to another without a compelling reason. Moving large amounts of data exposes organizations to added overhead costs, security breaches, transformation challenges, and significant latency. In fact, the current trend is to bring analytics and AI to where data already resides. Hence, the growing emphasis on hybrid multi cloud deployment that minimize risks, avoid vendor lock-in, and optimize the usage of existing infrastructure.

AlloyDB Omni, a feature of Google’s AlloyDB for PostgreSQL, is designed to support hybrid and multi-cloud deployments, making it possible to run AlloyDB workloads on any cloud environment, whether it’s Google Cloud, AWS, Azure, or even on-premises infrastructure. This cloud-agnostic approach gives enterprises the flexibility to choose the best infrastructure that suits their needs, whether they want to leverage the scalability of public clouds or maintain control with private or hybrid cloud environments. The database resides in the cloud provider region where data is generated or consumed to optimize for latency-sensitive applications. AlloyDB Omni can replicate data across multiple regions to ensure high availability and disaster recovery.

AlloyDB control plane is managed by Google Cloud (and in the future by Aiven), as it is deeply integrated into Google’s infrastructure. It acts as the central layer that orchestrates the management, scaling, and coordination of AlloyDB’s compute and storage resources.

AlloyDB Omni utilizes Kubernetes for managing and orchestrating its workloads, making it compatible with any Kubernetes-compatible cloud or on-prem environment. Since Kubernetes is widely supported across major cloud providers (Google Kubernetes Engine, AWS EKS, Azure AKS), this allows AlloyDB Omni to run seamlessly in multi-cloud and hybrid-cloud environments. It can scale and manage containerized AlloyDB services efficiently across cloud platforms.

AlloyDB Omni is fully compatible with PostgreSQL, meaning workloads can be migrated from traditional PostgreSQL databases to AlloyDB without requiring changes in the application layer.

Hybrid Search

Organizations require powerful querying capabilities that combine traditional relational database operations with more advanced search features, including keyword search, full-text search, and more recently, vector search for AI workloads.

Traditional keyword search, the simplest form of search in databases, queries specific columns or fields for exact or partial matches, allowing users to find data based on specific keywords or phrases. This is useful for searching structured data, such as product names and customer information. Full-text search allows users to search for words and phrases within documents for tasks such as document retrieval, content analysis, and information retrieval. Vector search uses similarity-based searches on high-dimensional data for finding similar items or concepts within a dataset for applications such as recommendation engines, image and video search, and natural language processing.

pgvector extension and ScaNN (Scalable Approximate Nearest Neighbors) are key components of AlloyDB that enable vector search capabilities. The former is an open source extension to PostgreSQL that supports vector storage and similarity search directly within the database. Vector embeddings are stored in a native data type and it enables vector operations like finding the nearest neighbors (similar items) based on vector distance metrics such as cosine similarity or Euclidean distance. ScaNN is a highly efficient, approximate nearest neighbor (ANN) search algorithm developed by Google and used by services like YouTube and Search. It’s designed to scale and accelerate the process of finding similar vectors in large datasets while maintaining high accuracy.

By default, pgvector uses Hierarchical Navigable Small World (HNSW), a graph-based approach for ANN search. HNSW excels at providing quick search times, especially in high-dimensional spaces. Meanwhile, AlloyDB introduced theScaNN index to improve speed, accuracy, and memory efficiency.

Data Types

PostgreSQL has always been recognized for its rich set of native data types, but recent additions and enhancements have bolstered its ability to handle more complex workloads. For example, with the addition of JSON-Tables in PostgreSQL 17, it has largely achieved parity with specialized NoSQL document databases, like MongoDB. The JSONB data type stores JSON in a binary format, allowing fast indexing, querying, and storage of semi-structured data alongside relational data.

As mentioned before, PostGIS extension has long provided comprehensive support for geospatial data. PostgreSQL natively supports arrays and range types, which have been extended to support more complex queries and data analysis. These data types allow developers to store and manipulate multi-dimensional data within the database.

In the previous section on Hybrid Search, we discussed the addition of vector data types through the pgvector extension. Other extensions transform PostgreSQL into a distributed database, or a time-series data.

PostgreSQL’s adaptability ensures it remains a versatile solution for companies that need to handle complex and non-traditional data.

Non Functional

Non-functional improvements refer to enhancements in areas such as performance, reliability, and security rather than new features.

As applications scale, database performance becomes a bottleneck. PostgreSQL has also seen enhancements in performance, scalability, and indexing, especially for large-scale applications. For example, partition pruning and indexing significantly speed up query performance on large datasets. The ability to parallelize queries, allowing multiple tasks to be processed simultaneously for faster data retrieval is especially useful for analytics-heavy workloads. For ordered queries, PostgreSQL now performs incremental sorting, reducing overall processing time and optimizing resource utilization.

Low-latency analytics result in improved user experiences in data-heavy applications, and reduced costs through more efficient resource usage.

As more data moves into the cloud and across multiple environments, the risk of security breaches grows. With the rise of data breaches and increasingly stringent regulations, security remains a top priority for database administrators. PostgreSQL continues to enhance its security features, ensuring that data remains protected in an increasingly distributed environment.

PostgreSQL now supports enhanced encryption mechanisms for securing data both at rest and in transit. In addition, its row-level security capabilities allow fine-grained control over data access, ensuring that users can only see the data relevant to their roles. Advanced role management and support for external authentication mechanisms (such as LDAP and Kerberos) provide greater control over database access. The ability to manage sensitive data enables regulatory compliance and reduces risk of data breaches.

As PostgreSQL is widely used for mission-critical applications, downtime or unavailability can have severe consequences. Hence, reliability is a critical attribute. Open-source PostgreSQL community has developed extensions that address areas like replication, failover, point-in-time-recovery (PITR), backup and disaster recovery.

Google Cloud has introduced a fleet management solution, called Database Center, that spans all its PostgreSQL compatible offerings — Cloud SQL, AlloyDB, and Spanner. Not only does it provide visibility into the operational health of the databases, but it also uses AI to make recommendations to optimize performance, security, and reliability.

Conclusion

PostgreSQL’s recent advancements, particularly in multi-cloud compatibility, performance, and extensibility, make it a top choice for modern enterprises. As businesses increasingly shift toward distributed cloud infrastructure, open-source PostgreSQL’s adaptability ensures that it remains a leading database solution. PostgreSQL continues to evolve and meet the needs of today’s most demanding applications.

By adopting PostgreSQL, organizations can future-proof their database strategy, ensuring flexibility, security, and performance in an increasingly multi-cloud world.

--

--

Sanjeev Mohan
Sanjeev Mohan

Written by Sanjeev Mohan

Sanjeev researches the space of data and analytics. Most recently he was a research vice president at Gartner. He is now a principal with SanjMo.

No responses yet