Snowflake Summit 2024: The Rise of AI Data Cloud

Sanjeev Mohan
11 min readJun 11, 2024

--

The pendulum swings relentlessly. A common theme is emerging after talking to many customers: they are once again asking for an integrated and simplified data platform, without having to stitch multiple disparate software products. Call it a hangover from the drunkenness of Modern Data Stack. The integrated data platform is also core to enterprise architecture for achieving intelligent data applications beyond traditional analytics.

However, there is a twist to this — interoperability. To better prepare for pendulum swings, users want their platform to adopt open standards so that there is some semblance of future-proofing.

Snowflake Summit 2024 reflected this sentiment as it showcased its deep commitment to data and AI. The key highlights are:

  1. Snowflake is expanding its appeal beyond the core data analyst persona. In the last few years, it added app development capabilities. Now, it’s adding more data science, operational, and governance features through conversational interfaces.
  2. Snowflake is not only going wider, but it is also adding capabilities vertically. This includes more foundational capabilities that are starting to encroach into its ecosystem. However, the ecosystem is so large, we can expect it to grow even further. Some vendors are delighted as their category is now getting elevated by increased focus from Snowflake, like DataOps.
  3. While Snowflake is expanding the platform capabilities beyond the core data platform for analysts, there is a gap between market perception and its messaging and positioning. The market still perceives Snowflake as being behind in data science and at higher cost, although it has made positive strides in both directions.

In this document, we examine key announcements and its impact on customers and the industry. Like the 2023 Snowflake Summit blog, this document does not specify the latest availability status (private preview, public preview, or general availability) of the new announcements.

Data Management

Snowflake’s north star is that they handle both structured and unstructured data, with discovery and governance, and at the most optimal cost performance. The latter is handled through Document AI which provides a workflow to parse PDFs. This technology is based on the August 2022 acquisition of Poland-based Applica and has been enhanced by Snowflake’s Arctic LLMs. In fact, it uses Arctic TILT — a mere 0.8B parameter model with very high benchmark rating. More on the Arctic LLMs later in the document.

Managed Iceberg Tables went GA at the Summit. They are optimized to have comparable performance as native tables. However, this year Snowflake also announced the ability to use any compute analytical engine that supports the Iceberg format.

Two open table formats predominantly in use are Iceberg and Delta. A bevy of database management systems supports Iceberg, like Snowflake, Cloudera, IBM watsonx.data, ClickHouse, and object-store-based fully managed data services, like Salesforce Data Cloud, Fivetran, and Confluent. Some products support multiple formats like Google’s BigQuery and Amazon Redshift. While the Delta format is mainly used by Databricks, it has created a translation layer called UniForm to interoperate between the two formats. Microsoft Fabric is based on OneLake that uses Delta as its native table format. Under the hood, Microsoft Fabric uses Onehouse’s Apache XTable translation layer to support both Delta and Iceberg. To complete the story, Onehouse also supports the Hudi table format. This video contains an in-depth understanding of the various lakehouse table formats and their interoperability.

Before we go any further, let’s address a huge spanner that Databricks threw into the party when they announced their acquisition of Tabular, founded by the original creators of the Iceberg table format, while they were at Netflix. Databricks paid an eye-watering sum of between $1B and $2B for a company with 40 employees and a funding of $37M. The announcement was timed to the minute as the Snowflake Summit keynote got started. We will leave any commentary on this acquisition out of this blog and instead focus on Snowflake’s Polaris and Horizon catalogs that are needed to achieve Iceberg interoperability.

Polaris Catalog

Polaris catalog is built on Iceberg REST API and will be open-source in 90 days. It tracks technical metadata, such as table names, columns, partitions, and bucket paths on object stores like Amazon S3 or Google Cloud Storage. With this metadata, any supported commute engine can act on the underlying data in a read/write manner. Some of these engines include Snowflake, Spark, Flink, Trino/Starburst, Presto, and Dremio, etc.

Polaris Catalog deployment options range from Snowflake-hosted (to be in public preview soon) to self hosted as Docker containers managed by Kubernetes.

Polaris catalog supports coarse-grained role based access control. Fine-grained access control comprising row-level and column-level security is done in Horizon Catalog which is covered next. One way of thinking of Polaris Catalog is as a mechanism to ensure multi-engine concurrency control and query planning.

Horizon

Horizon is an overarching solution with built-in governance, discovery, and access for content internal to an organization, as well as sourced from third parties. It has a unified set of compliance, security, privacy, interoperability, and access capabilities. Horizon includes Snowflake’s internal technical catalog, business catalog, and Trust Center (see below). Horizon was released in November 2023, for all data and application assets. It will extend Polaris Catalog by adding column masking policies, row access policies, object tagging and data sharing capabilities.

Horizon has all the features of a full-blown data catalog like business glossary and lineage, and is being expanded to become a registry for AI models. Its object descriptions can then be used to develop a semantic layer, which is available as a YAML file and is used by the Cortex Analyst (see below).

Cost Management

Snowflake’s new slick cost management user interface has enhanced capabilities such as budgets, which allows users to set spending limits and notifications. It has three goals:

  1. Cost transparency shows spend overview at account level and across teams.
  2. Cost control through allocation at account level and in the future, at query level.
  3. Cost optimization through rule-based heuristics leading to recommendations.

Cost is such a key topic in this ecosystem that several players offer solutions to help reduce costs, like Capital One Software’s Slingshot.

Generative AI

Snowflake is no longer The Data Cloud. It is now The AI Data Cloud and its fully managed AI service is called Cortex. Snowflake claims that 750 of its customers are now using Cortex. This service has several functions as depicted in Figure 1.

Figure 1: Snowflake’s Generative AI vision (PU: public preview / PR: private preview)

Let’s look at each of these categories starting with its family of large language models (LLM) called Arctic.

Snowflake Arctic was open-sourced under Apache 2.0 in May 2024, and shares with the community the model weights and cookbooks on how the models were trained but not the actual datasets. Arctic models use Mixture of Experts (MoE) architecture to keep its training cost in check. Today, the model is available only for text and coding, but it will be multi-modal in the future.

Snowflake also uses models from Meta, Mistral, Google, and Reka. At the keynote, Jensen Huang, CEO of Nvidia, called remotely from the show Computex. Nvidia’s NeMo Retriever is a framework and microservices to develop copilots, chatbots, and assistants using retrieval-augmented generation (RAG) to provide semantic search on corporate data. At the Summit, Snowflake and Nvidia announced the NeMo Retriever now works on Snowflake.

Cortex Analyst

The purpose is to allow business users to chat with data using AI as opposed to a data analyst copilot that uses natural language text to generate SQL queries. Cortex Analyst is a REST API that allows applications to upload documents and parse unstructured data to create a semantic layer (as a YAML file) and ask questions in natural language.

It uses Meta Llama 3 and Mistral Large models to build apps on top of corporate data using REST APIs.

Cortex Search

Extends the Neeva acquisition from May 2023 and infuses it with Snowflake Arctic embedding model to perform a hybrid text and vector search without having to write a single line of RAG code. Cortex Search abstracts the entire workflow of ingestion, chunking, vector embedding, retrieval, ranking, and generation. It automates continuous refreshes as new relevant documents are added to the knowledge source. Cortex Search interface is Snowsight (Worksheets and Notebook) and it provides connectors to 3rd-party data sources.

Cortex Search can synthesize information that is spread across multiple documents and use reasoning to generate answers to business questions. The semantic layer is automatically generated, as opposed to Cortex Analyst which uses a semantic YAML file.

In the future, Snowflake will allow users to use 3rd-party (or fine-tuned) embedding models. The search encompasses worksheets, dashboards, Streamlit apps, marketplaces, and other assets.

AI & ML Studio

This is a no-code development tool that provides citizen data scientists with a natural language interface to invoke SQL and Python APIs. No-code tools are typically aimed for less-technical people like business executives who want to leverage data. By Snowflake moving into this space, they are moving up the stack by giving citizen roles data science capabilities.

Studio is the entry point into multiple Cortex capabilities. Some of the no-code use cases include classification, forecasting, regression analysis, anomaly detection and fine-tuning models, which is covered next. New features being added include testing and evaluating LLMs that best fit users’ specific needs.

Cortex Fine-Tuning

This component performs serverless fine-tuning of models through a wizard-like user interface. In a demo, Snowflake showed how it used the output from a Mistral Large model to fine tune a smaller, cheaper, and more efficient model. It uses intermediary relational tables for persistence of temporary data. This is another example why it is critical to bring AI to data, and not the other way around.

Accessible through AI & ML Studio, this feature is currently available for Meta and Mistral AI models. Fine-tuned models can be managed using Snowflake Model Registry.

Application Development

Snowflake supports three modalities of apps — managed apps, connected apps, and native apps.

  • Connected apps: the code lives outside Snowflake but customers manage its own data.
  • Managed apps: the application provider manages the code and data.
  • Native apps: fully integrated options where the code and the data both exist inside Snowflake. The apps run either natively in the database or as Docker containers in a managed Kubernetes environment, called Snowpark Container Services (SPCS).

Snowflake is making a major push towards native apps and has over 200 already available. A native app takes advantage of all the built-in governance features.

Snowflake now has an integrated notebook that can be accessed in Streamlit, Snowpark, or Cortex AI. Snowflake Notebook on SPCS can run on a CPU or a GPU. The notebook is managed by Anaconda.

SPCS can be extended to run external apps. For example, it can mount an AWS EBS volume and run PostgreSQL or Redis.

Along with building apps, Snowflake can also be used for distributing apps in a cloud native manner. Some of the new announcements in this area include:

  • SnowGit provides GitOps capabilities.
  • DevOps using Snowflake CLI for CI/CD capabilities. Snowflake is introducing a declarative approach to database change management.
  • Snowflake Trail provides observability into data quality, pipelines and applications in Snowpark and SPCS. It diagnoses and debugs errors using metrics, logs, and traces using the OpenTelemetry standards. As a result, users can either use Snowsight or any OTel compatible alert platform like Datadog and Grafana.

Two years ago, Snowflake had announced its transactional capabilities via UniStore which is expected to be generally available in late 2024.

Data Quality, Operations, Security & Governance

Snowflake’s Analyst Relations team started the analyst program by debunking the news of threat actors breaching Snowflake and stealing customer data. This was another case of unfavorable news timed to coincide with the start of the Summit. While the details of the event are still emerging, it is nevertheless a reminder of how important the use of multi-factor authentication (MFA) is.

Trust Center is Snowflake’s UI for discovering vulnerabilities and resolving security risks. It regularly scans data and apps against the Center for Internet Security (CIS) benchmark to proactively mitigate security risks. As mentioned above, it is part of the Horizon solution.

Data quality is such a critical element to Snowflake, that it has added its native monitoring capabilities inside Horizon Catalog. Users can define data quality rules and use SQL to set alerts, troubleshoot, or visualize trends.

Data Classification API gives users the ability to specify tags at schema level which can be used for better governance.

Data Clean Rooms, based on the Samooha native app acquisition, is now generally available. Snowflake’s Media & Entertainment clients use it to share first-party data but in a governed manner due to all the regulations that apply to cookies.

Partner Ecosystem

One of Snowflake’s greatest assets is its vibrant partner ecosystem. Its marketplace has over 2500 products. However, as Snowflake adds more capabilities to its platform, some vendors’ value proposition becomes questionable.

For example, Snowflake has added native support for notebooks and enhanced its cost management capabilities. This directly overlaps with many vendors who solely rely on the Snowflake platform for their narrow offerings. However, other vendors are delighted that Snowflake has jumped into the fray since they have more feature-rich offerings and run across many other platforms.

Many of Snowflake’s early enhancements are in the form of tables and views. Vendors use the underlying views to offer a more curated user experience.

Many partners showcased their offerings at the Summit. Andrew Ng, co-founder of Landing AI spoke at the developer conference on how his company is building vision AI agents that use LLMs like Cortex. Their use cases range from medical imaging to drone footage analysis and generating appropriate actions.

Guy Adams, co-founder of DataOps.live showed how his company enables Cortex-powered Native Apps to be built, tested and deployed on Snowflake in less than 10 minutes. In fact, Snowflake themselves is one of their customers, building their internal Snowflake Solution Central on DataOps.live.

Machine Learning

I would be remiss if I didn’t mention the vast set of products from Snowflake that cater to data scientists and ML engineers. There is so much emphasis on generative AI these days that it is easy to overlook any new developments in the machine learning space.

Snowflake is helping enterprises build and deploy ML solutions by providing a full set of capabilities for end-to-end machine learning on a single platform right next to secure and governed data as shown in the figure below.

Figure 2: Snowflake ML Capabilities

Business analysts can directly use pre-built ML functions directly with SQL on their data, and using Studio, a no-code user interface. This democratizes ML and lets non-developers build and use ML solutions.

For custom ML development, there is a full range of tools and APIs for end-to-end ML development and deployment. This includes the Snowpark ML Modeling API, the new Snowflake Feature Store for creating, managing, and serving ML features, and the Snowflake Model Registry, which makes model management and inference easy in Snowflake.

These capabilities can be used with the new Snowflake Notebooks. And everything runs on compute options — warehouse runtime or a container runtime.

Closing Thoughts

Snowflake is the beneficiary of the massive pivot to AI because most of the action, irrespective of what the investor community thinks, is on managing data. At the Summit, attendees spoke of data and AI interchangeably because they are so inextricably connected and can’t be separated. The vision laid out in my2024 Trends in Data and AI report depicted the availability of an Intelligent Data Platform is coming to fruition. There is no AI strategy without a data tragedy, er, I mean, strategy.

Snowflake has benefited from its simplified and integrated platform as well as a compelling developer ecosystem. These days, large companies like Snowflake and hyper-scalers are innovating at the speed of a startup. Interestingly, for companies in the data and AI space, the technology moat is constantly shifting. Data catalogs were an opportunity to lock-in users and make them sticky until they became open-source. Their only moat now seems to be capital (funds and people skills) and distribution. Snowflake has ample of both.

At the beginning of this report, I had mentioned that customers are seeking integrated stacks with interoperability. I will amend it to say, customers are seeking integrated stacks, with interoperability, and no vendor-lock-in. This is like having your cake, eating it too, and not putting on weight.

As always, I want to thank you for taking the time to hear my thoughts. You can watch the commentary on Databricks and Snowflake summit announcements, at the latest episode of the ‘It Depends’ podcast.

--

--

Sanjeev Mohan

Sanjeev researches the space of data and analytics. Most recently he was a research vice president at Gartner. He is now a principal with SanjMo.