Snowflake Summit 2023 Announcements
This document showcases new developments — some are GA, many are in private or public preview. As the status is constantly changing, this note doesn’t mention the status. Please refer to Snowflake’s website to see the current status. The figure below buckets the new announcements in categories chosen by the author to better understand these recent developments, and does not reflect how Snowflake may view them.
Developer Platform
One of the most exciting developments is Snowpark Container Services which dramatically expands Snowflake’s reach. By supporting Docker containers and hiding all the complexities associated with Kubernetes, Snowflake can now run any job, function or service — from 3rd party LLMs, to Hex Notebook to a C++ application to even a full database, like Pinecone, in users’ own accounts. It now supports GPUs.
Streamlit is getting a lot of emphasis as a faster and easier user interface (than React) to develop apps leading to better developer experience. It is an open source Python-based framework compatible with major libraries like scikit-learn, PyTorch, and Pandas. It has Git integration for branching, merging, and version control.
A huge emphasis is on native apps. A ‘native app’ is when the app and data both reside inside Snowflake’s data cloud to protect data and secure intellectual property. It uses Streamlit and Snowflake Marketplace to build, test, and distribute the app.
Snowflake launched ‘Native App Framework on AWS.’ This framework eases development and governance by controlling security loopholes and threats. Once these apps are built, publish them in the Marketplace and set up custom billing (see below) for monetization. More than 30 apps are already live, e.g. Matillion, Capital One’s Slingshot for FinOps, and LiveRamp’s Identity Resolution.
Of course, no conference can be complete without talking about large language models (LLMs) and generative AI. LLMs can be used to search the native apps.
Generative AI and ML
Snowflake is leveraging two of its recent acquisitions — Applica and Neeva to provide a new Generative AI experience. The former acquisition has led to Document AI, an LLM that extracts contextual entities from unstructured data and to query unstructured data using natural language.
The unstructured to structured data is persisted in Snowflake and vectorized. Not only can this data be queried in natural language, it can also be used to retrain the LLM on the private enterprise data. While most vendors are pursuing prompt engineering. Snowflake is following the retraining path.
Neeva is used as a search engine.
Snowflake now offers LLMs in four ways:
- Native embedded LLMs like the Document AI or text to code LLM.
- LLM-powered Streamlit apps, such as Reka AI. Snowflake says it has now 6,000 LLM-powered Streamlit apps.
- LLMs running inside Snowpark Containerized Service like from Hugging Face, etc.
- External LLMs accessed via APIs, such as OpenAI, Cohere, Anthropic, etc.
Snowflake acknowledges that training models in its ecosystem used to be hard, which led users to use an external system, like AWS SageMaker. Now it provides full MLOps capabilities, including Model Registry, where models can be stored, version controlled and deployed. They are also adding a feature store with compatibility to open-source Feast. It is also building LangChain integration.
Single Platform
Last year, Snowflake added support for Iceberg Tables. This year it brings the tables under its security, governance and query optimizer umbrella. The performance of Iceberg table now matches the query latency of tables in native format.
In addition, it becomes bi-directional. You can now define a table in Snowflake as an Iceberg table and access it from Iceberg. Snowflake is syncing up its catalog with external ones.
Streaming use cases are important to Snowflake’s roadmap. Snowpipe streaming using Kafka connector now writes directly to the tables without first staging them on object stores. For example, sensor data gets written as rows in a table. Use Snowsight to see the row count.
A new construct called ‘dynamic tables’ creates a pipeline with incremental data with a lag (e.g. 1 min.). They are like materialized views but without consistency and have window query functions. Streaming data in dynamic tables can be joined with historical data, like maintenance logs.
A native text to code LLM can build dynamic tables writing no SQL.
FinOps
Snowflake is addressing the criticism of its high cost through several initiatives designed to make costs predictable and transparent.
- Snowflake Performance Index (SPI) — using ML functions, it analyzes query durations for stable workloads and automatically optimizes them. This has led to 15% improvement on customers’ usage costs.
- Budgets — New SQL statements are used to create a budget and to add workloads to monitor. Users get alerted when they exceed the budget.
- Marketplace Capacity Drawdown Program — Native apps can be bought through Snowflake capacity commitments.
- Custom event billing — Usage-based billing only is so yesterday. Now, app developers offering their products in the Marketplace can set up billing through events like, the number of rows inserted or updated in a month, or the number of rows scanned.
Core
Snowflake has invested hugely in building native data quality capabilities within its platform. Users can define quality check metrics to profile data and gather statistics on column value distributions, null values, etc. These metrics are written to time-series tables which helps build thresholds and detects anomalies from regular patterns.
SQL Delight is a series of fun enhancements (if you find these things exciting) to SQL statements. For example, instead of specifying a long list of columns to select, you can now select * and ‘exclude’ the columns you don’t want. You can also group by ‘all’. Other changes include rounding, min/max functions.
Performance enhancements show up through reducing full table scans by searching on substrings, like error codes inside logs. Also, warehouse sizing can now be sized by demand patterns in order to optimize its utilization.
ML -powered functions can be called directly from SQL, like time-series forecasting or anomaly detection. This democratizes time series analysis without needing ML expertise.
AutoSQL is the copilot that uses LLMs to generate SQL code.