Databricks Data & AI 2024 Summit: A Spark of Innovation
Two weeks of back-to-back Snowflake Summit and Databricks Data and AI Summit (DAIS) left the community with more questions than ever. It seems we are at pivotal crossroads of how the future analytical architectures will emerge. Some are predicting the end of data warehouses altogether considering the rise of table formats. Others feel that the vibrant ecosystem of independent software providers will be subsumed by the large ecosystem players offering integrated solutions.
In this blog, I capture my key learnings from the DAIS 2024 which followed the Snowflake Summit. Both conferences took place in Moscone Center in San Francisco in the first two weeks of June. Incidentally, next year will be an exact repeat — location and timing-wise.
As far as the content is concerned, some folks half-joked that I can take my Snowflake Summit blog and replace Cortex with Mosaic and Polaris with Unity and my work is done! While both the vendors are definitely headed in the same direction, they couldn’t be more different in their approaches.
Snowflake wants to be an intelligent data apps cloud while Databricks wants to be the “open” and versatile platform for all data roles. Even their taglines reflect their intent. Snowflake calls itself the AI Data Cloud and Databricks calls itself the Data Intelligence Platform.
Make no mistake, both are moving up the stack as the primary compute engines with co-pilots and agents becoming the primary applications. Both are also moving down the stack to leverage GPUs and add foundational capabilities that compete with many of their ecosystem partners. In the end, customers should benefit from these developments.
Figure 1 is the overall framework for organizing our discussion of the key announcements from DAIS 24. The pace of development far exceeded what we saw at DAIS 2023. As always, this document does not mention the availability status (private preview, public preview, or generally available) of the new announcements. Please check Databricks’ documentation for the latest status.
We start our analysis with overall news that is not even in the above figure.
Overall — 100% Serverless
One of Databricks’ criticisms has been the higher effort needed to use it compared to Snowflake. Databricks resolves this by making all its services serverless. Most future developments will be released only on the serverless platform. Customers can now have three choices:
- Dedicated: This is the traditional option where Databricks clusters are deployed in the customers’ accounts. The data plane resides in the customer account and the control plane in Databricks’ account. It is important to note that customers have separate bills for Databricks usage and the underlying cloud provider.
- Serverless: This mode doesn’t need any dedicated clusters but the data plane still resides in the customer account, while the compute clusters are provisioned in the Databricks account only when needed. This allows the company to roll out new versions to all its users at the same time. The Databricks bill now bundles the cloud provider usage. All 12K Databricks customers will be migrated in phases starting July 1, 2024. The cluster startup time is now down to just 5 seconds.
- SaaS: This is a brand new offering launched at DAIS 24, where the data and control planes reside in Databricks’ account. It is a very small percentage of Databricks business so far and has been released to further onboarding of new users looking for zero overhead.
Delta Lake 4.0
The bedrock of Databricks’ data announcements center around three themes: cost, quality, and security. Delta Lake is where data is stored and Unity Catalog is how it is discovered and governed.
Databricks is known to do $billion acquisitions just before its annual conferences. Last year, it was MosaicML and this year it was Tabular — from the co-creators of the Iceberg table format. In retrospect, the $100M Arcion acquisition now seems paltry in comparison, considering its role in LakeFlow (more on this later).
Databricks now becomes a major player in both of the major table formats — its native Linux Foundation Delta Lake and Apache Iceberg. UniForm is the translation layer that converts Delta format to Iceberg with minimal overhead (we were told it is 2–4%). UniForm’s Iceberg support includes ‘streaming merges’ including deletes which isn’t supported by the original Iceberg format.
Onehouse offers lake management functionality on the Apache Hudi table format and originally built another translation layer Onetable with Microsoft and Google. It has been renamed Apache XTable (incubating) and provides interoperability with Iceberg and Delta Lake.
Microsoft Fabric uses XTable to create Hudi and Iceberg metadata every time data is written in its native Delta format, as well as create Delta metadata whenever Snowflake writes into Iceberg tables. Unlike UniForm, XTable supports bi-directional translation between all three table formats and is also governed neutrally by the Apache Software Foundation independent of the three table format projects.
Liquid clustering is the automated workload optimization feature that reorganizes and compacts clusters to get optimal query performance from “hot” data.
Open variant is a new data type (preview) for JSON in both Delta 4.0 and Spark 4.0. Prior to this, JSON was stored as a string, which is very slow. The new data type is able to store flexible schemas and nested semistructured data more efficiently and provides 8x better performance for TPC-DS and 20x for TPC-H. When schemas evolve, open variant eliminates the need for table rewrites. Open variant is in preview.
To use Delta Lake 4.0, users first have to upgrade to Spark 4.0. A significant portion of the Day Two keynote by Databricks’ co-creator, Reynold Xin, was spent on the new features in Spark 4.0. A key one is that Python becomes a first-class citizen like Scala. Several Delta Lake 4.0 preview features can be seen here.
Unity Catalog
Matei Zaharia, another one of the seven co-founders, clicked on the GitHub button during the Day Two keynote and made Unity Catalog’s GitHub repo public. Databricks made a big deal of how its catalog is open-source now, unlike Snowflake’s Polaris Catalog which has a 90-day window before becoming open-source.
Unity Catalog is not equivalent to Polaris but rather the combination of Snowflake’s Polaris and Horizon catalogs. Unity Catalog is actually more full featured as it includes metadata for not just data assets like tables and files, but also AI models. It is built on OpenAPI specifications with support for Apache Hive’s meta store API and Apache Iceberg’s REST catalog API.
Databricks has made the back-end open-source but not the UI. We hope that they will open-source all of Unity Catalog in the near future.
Key Unity Catalog announcements include:
- Metric store
- Data sharing with Lakehouse Federation
- Attribute based access control (ABAC)
- Data quality via Lakehouse Monitoring
Unity Catalog introduced a metric store which is used by various Databricks products that are covered below, like AI/BI and by 3rd party BI tools like Tableau and PowerBI. The metrics are also compatible with independent metric stores, like AtScale, cube.dev and dbt.
Lakehouse Federation allows users to do read-only queries on external data sources by creating “foreign catalogs” in Unity Catalog. The data sources include Snowflake, Google BigQuery, Amazon Redshift and others. A foreign catalog mimics the source schemas and pushes down Databricks queries to their respective native formats. Databricks continues to add more integrations in the future like Amazon Glue. The Linux Foundation Delta Sharing protocol is used for this zero-ETL virtualization approach. This protocol is also used to share Databricks data, including foreign catalogs, with downstream external systems.
Attribute-based access control (ABAC) applies to all the assets managed in Unity Catalog including native Databricks ones or the ones federated from external sources. Many other similar products offer policy enforcement capabilities, but Databricks also offers a UI to define unified policies. It also integrates with 3rd party ABAC tools, like Immuta and Satori.
Lakehouse Monitoring monitors data and AI to profile, diagnose and enforce data quality. It automatically identifies trends and anomalies and visualizes key metrics, like data volumes, percent nulls, model drift, accuracy, F1 score, precision, and recall. It can inform users if models need retraining.
Data Engineering
One of the primary use cases of Databricks is to do data engineering. It launched a new service called LakeFlow that addresses the entire end-to-end journey using a low/no code approach integrated with Unity Catalog. LakeFlow comprises:
- LakeFlow Connect — Arcion-based ingestion connectors include DBMS, SaaS apps like Salesforce, Workday, and SAP, storage locations like SharePoint, etc.
- LakeFlow Pipelines — it uses Delta Live Tables (DLT) technology to do data transformation in either SQL or Python
- Lakeflow Jobs — the final component is used to orchestrate jobs and monitor their health automatically. It can be integrated with external alerting tools like PagerDuty.
The goal is to help customers alleviate data silos in an integrated manner. Previously, organizations needed tools like Fivetran and dbt for ingestion and transformation respectively and an external catalog. LakeFlow enables incremental processing all the way from incremental reading of source data using change data capture (CDC) to intermediate streaming tables and materialized views in DLT, thus modernizing ETL processing.
Generative AI and ML
A big theme for Databricks is ‘Compound AI Systems’ a composable approach to drive greater GenAI app quality by using multiple components instead of using just a monolithic model. Retrieval augmented generation (RAG) is an example of it, where the GenAI app leverages the model as well as proprietary data to provide better context to the model to provide more accurate answers. Enterprises can also use their data to fine-tune or pre-train their own models.
Databricks mentioned that it has helped deliver 200,000 custom models. All of Databricks’s generative AI offerings are now branded under Mosaic AI.
The Mosaic AI Model Training to fine-tune models requires just three inputs:
- Pick from the list of base open-source models like Llama 3 or proprietary models like DBRX
- Select the type of task — code completion, instruction fine-tuning, or continued pre-training
- Specify the location of your dataset
Shutterstock announced a text-to-image model trained on its own dataset using Mosaic AI Model Training.
Databricks also wants to help accelerate AI experimentation and help build agents and RAG applications in an environment that provides optionality. It announced the public preview of its Mosaic AI Agent Framework comprising:
- Agent SDK: build end-to-end agentic applications
- Agent evaluation: define high-quality answers to evaluate models and then use LLMs or invite human judges. Even people external to Databricks accounts can assess, review, and label responses. It is like doing root cause analysis for AI by observing each stage of the AI lifecycle. In addition, users can also use open-source MLFlow 2.14 with its new MLFlow Tracing.
- Agent serving: deploy agents as real-time API endpoints. It also enables function calling.
- Tools Catalog: helps clients create a registry of SQL, Python, or remote functions, model endpoints.
Mosaic AI Vector Search went GA.
Security
Mosaic AI Gateway builds on Model Serving and enables rate limiting, permissions, and credential management for model APIs (external or internal). Users can use the unified interface to query foundation model APIs and audit model usage and data that is sent and returned.
Mosaic AI Guardrails adds endpoint-level or request-level safety filtering to prevent unsafe responses. It helps prevent sensitive data leakage.
Unity Catalog LakeGuard provides data governance on Spark workloads in SQL, Python and Scala by isolating users’ code on shared compute clusters.
Data Clean Rooms allow businesses to collaborate on data without sharing any sensitive data. It is most commonly used in media and entertainment that have restrictions on sharing cookies.
Convergence of AI and BI / Data Warehouse
Databricks entered the business intelligence (BI) space in 2020 when they acquired Redash. In 2024, they upped the ante through a drag-and-drop dashboard called AI/BI with a natural language assistant called Genie. It gathers usage history and metrics from Unity Catalog and related assets like Notebooks, Dashboards.
Genie is the semantic layer. It constantly learns and memorizes from users’ questions. It provides “certified answers for query patterns specified by the data teams.” In other words, Databricks is cognizant of the hallucination and privacy aspects of results being returned by the LLMs.
AI/BI changes the modus operandi of how analysts generate reports as it incorporates dynamic semantic definitions, and prompting techniques into few-shot examples. Instead of relying solely on pre-built example and semantic layers, AI/BI encourages users to use the chat interface to supply certified questions and answers, like examples of working SQL statements.
Databricks SQL is the end-to-end data warehouse with ingest, transform, query, visualize, and serve capabilities. It sits on top of Unity Catalog and Delta Lake. Its compute engine is called Photon and Databricks announced a partnership with NVIDIA to leverage GPUs inside Photon and increase parallelism.
Conclusion
This document has been filled with a humongous number of announcements, yet the future of data and AI feels murkier than ever before. The competition in the data and analytics space is coming from unexpected directions. As I write this blog, OpenAI is acquiring a real-time streaming analytical database called Rockset. In other words, OpenAI is on a journey of moving from models into analytics, and possibly into data. This is exactly opposite of Databrick’s journey from data to analytics to AI. Eventually, both these companies (and a host of other data infrastructure players) will converge. Data is king and whoever has the best managed data will win in the AI race.
There are various market and economic factors affecting how we currently think about data analytics infrastructure. Customers are seeking unified platforms for their data and AI workloads. Vendors, on the other hand, are looking to differentiate via intelligent assistants and agents.
From the economic point of view, the cost of developing software is very rapidly trending to be zero. This is driven by foundation models that are adept at writing code and are close to being free. Google’s software engineering teams used LLMs to write 50% of the new code in the past year. This incentivizes builders to build larger platforms at much lower price points and a better pricing model. In other words, more software will be written as the cost of experimentation and skills needed goes down. Second, vendors providing data infrastructure will benefit from higher compute usage. So, the beneficiaries will be large and integrated data infrastructure companies at the cost of specialized software companies.
Another aspect that is helping large players, like the hyperscalers, Databricks, Snowflake, etc. is that organizations are now leveraging both structured and unstructured data in a cohesive manner. While AI typically utilizes unstructured data, structured data is needed to provide context and reduce hallucinations. This is leading to what we call the Intelligent Data Platform — a unified storage and metadata offering. And Databricks is pivoting into this space rapidly.
As always, I want to thank you for taking the time to hear my thoughts. You can watch the commentary on Databricks and Snowflake summit announcements, at the latest episode of the ‘It Depends’ podcast.