Databricks Data + AI Summit (DAIS) 2023 Announcements

Sanjeev Mohan
5 min readJul 5, 2023

--

DAIS was held from June 26–29, 2023 in Moscone Center in San Francisco. The author split his time with the overlapping Snowflake Summit in Las Vegas. This document captures the key highlights of DAIS 2023.

Key takeaways:

  • For the past few years, Databricks focused on bolstering its data warehouse capabilities. However, with the resurgence of AI, it is now back into its sweet spot — accelerating the data engineers and data scientists developer experience.
  • It brought an end to the table format wars, by building compatibility between Apache Hudi, Apache Iceberg and its very own Linux Foundation Delta Lake.
  • Unity Catalog came of age. It is now the lynchpin in its strategy to support unification of data and AI, federation, security and governance.
  • Open source is paying off. Not only is Databricks solution built on MLFlow, Apache Parquet and Delta Lake, but it is expanding its open source reach into data sharing as well via Delta Sharing.
  • Databricks’ focus on AI to extract all kinds of metadata from data sources serves well for technical and operational metadata. But, curation of attributes by the business folks to add context is still needed. LakehouseIQ and partner semantic layers can fill that gap.

Data Lake

Unification was the biggest theme — after all, it is in it the name of the conference. Unification shows up across the lakehouse:

  • Unity Catalog
  • Delta 3.0 — UniForm, Delta Kernel, Liquid Clustering
  • Lakehouse Federation
  • Hive Metastore (HMS) interface

Unity Catalog was only introduced a year ago, and it is already becoming a central part of Databricks’ roadmap. Most data catalogs, as the name suggests, curate and store metadata for data. But Unity Catalog unifies metadata for tables with unstructured data and ML models. It logs and monitors all requests and responses to Delta tables and creates end-to-end lineage. Unity Catalog has been extended to enable vector search as described later.

Unity Catalog acts as a central metadata store to secure and govern access. It applies policies for table, row/column access. Today, Unity Catalog doesn’t let you define common data access governance policies but plans to add those capabilities in the future.

UniForm, or unified format, was one of the more impactful announcements as it provides compatibility between Apache Hudi, Apache Iceberg and Delta Lake. Behind the scenes, Databricks makes three copies of metadata to support each of the formats. This allows Delta tables to be read as if they are Hudi or Iceberg formatted. This feature is the key part of open source Linux Foundation Delta Lake 3.0.

Other Delta Lake 3.0 announcements include Delta Kernel, a simplified connector development kit that protects against version changes and Liquid Clustering that makes partitioning more cost efficient and lower latency for read and write operations.

Lakehouse Federation allows users to discover and query data in external systems, like Snowflake, Amazon Redshift, etc from Databricks and without moving data. It uses caching and query optimization to reduce latency while accessing data across different systems. Other data sources include MySQL, PostgreSQL, Google BigQuery, Azure SQL Database and Synapse.

Hive Metastore interface provides compatibility to all the engines that rely on it, such as Amazon EMR, Apache Spark, Amazon Athena, Presto, Trino.

Generative AI and ML

The motto of the conference was generation AI. ‘Nuf said…

Lakehouse AI includes capabilities to allow users to build generative AI applications, manage the entire AI lifecycle, monitor and govern the process. Its mission is to speed up the process to get models from experimentation into production.

Databricks strategy is to support all three approaches for LLMs — using foundation models from Databricks Marketplace, fine-tuning models, and training custom models.

New announcements include:

  • Vector Search and integration with AutoML, Model Serving
  • Linux Foundation MLFlow 2.5 (releases in July 2023)
  • MosaicML acquisition
  • English SDK for Apache Spark

Vector Search was also a key announcement for MongoDB and Snowflake in June 2023. Databricks’ approach is unique as Unity Catalog is central to its strategy. The catalog automatically creates vector embeddings from unstructured files. It is integrated with Databricks Model Serving which optimizes cost performance.

Databricks AutoML is a low-code and secure approach to fine-tune LLMs on enterprise data. The resulting model can be published in Databricks Marketplace.

MLFLow 2.5 adds AI Gateway, Prompt Tools and Monitoring. AI Gateway is interesting as it not only does access control and rate limiting, but it caches predictions to serve repeated prompts.

MosaicML acquisition gives Databricks an ability to address the third leg of the LLM strategy-training custom models. How big will this space be is anyone’s guess at this point. Databricks made a big bet with the $1.3B purchase. The co-founder of MosaicML, at the keynote, claimed that its MPT-7B is the most downloaded LLM yet. In June 2023, they launched MPT-30B and drastically reduced the cost of training to under $800K.

English SDK for Apache Spark has an inherent understanding of tables and dataframes. This allows users to just ask questions in English and the generative AI engine compiles it into PySpark and SQL code. Databricks wants this assistant to no longer be a copilot, but become the pilot.

Data Warehouse and Streaming

It’s not all about serving the needs of data scientists and engineers, but also helping business analysts search and analyze enterprise data. Announcements in this space include:

  • LakehouseIQ
  • Data warehouse engine
  • Core enhancements

LakehouseIQ is a knowledge engine that uses generative AI to scour various data sources and infer data’s context, usage patterns, and organizational structures. This metadata is stored in Unity Catalog and natural language can be used to query and understand data. During the keynote, Larry Feinsmith, MD, JPMC called LakehouseIQ, the “future of Databricks.”

The data warehouse engine has been rewritten from scratch in the lakehouse. It has introduced many new features, like Indexless Index, which uses AI to predict indexes. AI is also used to determine the data layout and clustering, which has resulted in lower storage costs.

Core enhancements include materialized views on streaming data using Delta Lake Tables and “volumes” to store unstructured data. Databricks SQL continues to be the analytical tool for the business analysts.

Marketplace

Lakehouse Apps didn’t get as much limelight but it is akin to Snowflake Summit 2023’s biggest announcement — Snowpark Container Service. Lakehouse App allows 3rd party products to run inside a user’s Databricks instance in containers. This approach removes the risk of sending data across different products running in their own security environments.

Why is this important? Both Databricks and Snowflake provide application providers with a secure distribution channel and help solve some of the go-to-market steps. In addition, when 3rd party apps are made available through a native and secure container service, they can overcome many of the procurement hurdles. Databricks took it a notch further as the consumers don’t need a Databricks account to access the applications.

Delta Sharing allows sharing live data sets across platforms and clouds, without the need for replication. It is an open protocol and does not have any dependency on Databricks. There are already a few companies using it, like Oracle Cloud Infrastructure (OCI), Dell and Twilio. Cloudflare R2, a distributed object storage solution uses Delta Sharing with zero egress cost.

Marketplace scope is vastly expanded. It includes data, AI/ML models, applications, notebooks, schemas, documentation, queries, and volumes. Databricks is now adding monetization capabilities. New data providers, like the London Stock Exchange Group, IQVIA, LiveRamp, and many others have been added.

--

--

Sanjeev Mohan
Sanjeev Mohan

Written by Sanjeev Mohan

Sanjeev researches the space of data and analytics. Most recently he was a research vice president at Gartner. He is now a principal with SanjMo.

No responses yet