Data & AI Summit 2022: Databricks Commits to Open Source Innovation

The News: The annual Data & AI Summit in San Francisco this week, organized by Databricks, is one of the largest gatherings of the data and AI community, featuring experts from across the data and AI ecosystem. For more information and a full overview of announcements made at the Summit, see the Databricks blog.

Data & AI Summit Recap 2022: Databricks Continues Commitment to OpenSource Powered Innovation

Analyst Take: The Data & AI Summit, organized by Databricks, offered this year as both an in-person event and a virtual one, featured insightful presentations, a wealth of announcements and new capabilities information, and training opportunities.

Databricks is in a fast-growing segment of the market and competes directly with some of the hottest companies in technology, most notably Snowflake. While this is good for the growth prospects of the business, it also means the company must focus on continued growth and innovation, and at a relentless pace. To put this in perspective the last time the company had an in-person gathering in 2019 they had just over 1,000 staffers. Today that number stands at 4,334 employees, adding over 3,000 staff during the pandemic alone is certainly growth.

Against this backdrop of rapid growth and innovation, the keynote on Tuesday was the usual drop of announcement after announcement, with the packed auditorium barely having time to digest the last announcement before the company announced yet another set of new capabilities. So, let’s dive into those announcements.

Databricks Project Lightspeed

Streaming data is becoming a more critical part of the computing landscape. Streaming is the basis for making quick decisions on the vast quantities of incoming data that systems generate. Processing streaming data has also proved to be technically challenging and requires different approaches from event-driven applications and batch processing.

Structured Streaming was introduced in Apache Spark 2.0 and is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. Structured Streaming has been the mainstay for several years and is widely adopted across organizations globally, processing more than 1 PB of data per day on the Databricks platform alone according to the company.

That’s where Databricks Project Lightspeed comes in. As the adoption of streaming has increased new requirements have emerged. Project Lightspeed is designed to meet these requirements. Some highlights include:

Improving the latency and ensuring predictability
Enhancing functionality for processing data with new operators and APIs
Improving ecosystem support for connectors
Simplifying deployment, operations, monitoring, and troubleshooting

Open Source – Delta Lake 2.0

The model for-profit businesses to develop on the foundation of open source projects is well-proven, with Red Hat and SUSE showing strong growth, amongst many others, in the industry. Databricks is another poster child for this model and the company doubled down on this approach this week with the announcement of its focus on the Delta Lake 2.0 release.

Delta Lake has been a Linux Foundation project since October 2019 and is the open storage layer that brings reliability and performance to data lakes via the “lakehouse architectures.” Since announced by Databricks in 2020, Lakehouses have become a widely deployed solution by data engineers, analysts, and data scientists who want to have the flexibility to run different workloads on the same data with minimal complexity and no duplication. The popularity of Lakehouse is evidenced by over 7M downloads per month.

Michael Armbrust, a distinguished engineer at Databricks and a co-founder of the Delta Lake project, made the announcement on Tuesday and demonstrated how the new features will dramatically improve performance and manageability compared to previous versions and other storage formats. Databricks leads contributions to the code base and continuously contributes new features to the project, and Armbrust is personally a major contributor.

While over 70+ organizations are collaborating and contributing, one could safely argue that Databricks is leading the charge. I see this type of open source innovation as not only a strong differentiator for the company, but also beneficial for customers looking to avoid vendor lock-in.

Data Clean Rooms

As new business models emerge and organizations look to collaborate with partners, suppliers, and even competitors, the need to share datasets is increasing. Against the backdrop of these emerging use cases, it was encouraging to see the company data cleanrooms for the Lakehouse. These cleanrooms are designed to allow businesses to easily with collaborate partners on any cloud in a way, and highly focused on privacy. Participants in the data cleanrooms can share and join their existing data and run complex workloads in any of the common languages including Python, R, SQL, Java, and Scala on the data, while also maintaining data privacy.

The design point for a data cleanroom is to provide a secure, governed, and privacy-safe collaboration space where multiple participants can join their first-party data and perform analysis on the data, without the risk of exposing their data to other participants. Participants have full control of their data and can decide which participants can perform what analysis on their data without exposing any sensitive data, such as Personally Identifiable Information (PII). While this technology is still nascent, this kind of structured privacy-focused collaboration is proving increasingly important — and sought after. This provides a strong underpinning for the growth of Databricks Lakehouse technology, and I believe it will provide further justification for adoption in new clients looking to adopt the technology.

Databricks Marketplace

One area that was particularly interesting and was also the subject of much discussion in the CEO Q&A session with Databricks’ CEO Ali Ghodsi, was the announcements around the newly launched Databricks Marketplace.

The Databricks Marketplace is powered by the company’s Delta Sharing functionality and allows consumers to access data products without having to be on the Databricks platform via an open marketplace for exchanging datasets, notebooks, dashboards, and machine learning models. The power of this open approach is allowing data providers to broaden their addressable market without forcing data consumers into vendor lock-in.

One example of a use case would be for organizations to add a weather forecasting dataset from a source like AccuWeather without having to have a formal relationship with the dataset provider.

Providers can now commercialize new offerings and shorten sales cycles by providing value-added services on top of their data. While Ghodsi was skeptical of thoughts around monetization of type of marketplace, I concur with his viewpoint that this is a significant way for Databricks to provide value and further adoption of the underlying Marketplace technology. As for monetization, we shall see what develops on that front.

The Databricks Unity Catalog

Databricks newly-announced Unity Catalog is designed to provide a unified governance solution for data assets on the Lakehouse. The solutions will soon be generally available on AWS, and Azure is currently in public preview. The Unity Catalog solution is designed to automatically track data lineage across queries executed in any language. Data lineage is captured down to the table and column level, while key assets such as notebooks, dashboards, and jobs are also tracked.

This approach to Lineage opens up new use cases with an example outlined by the company of the ability for consumers to assess the impact changes to tables will have on their data and auto-generating documentation that they can use to better understand data in the lakehouse.

One feature that stood out was that Unity Catalog also includes an enhanced built-in search capability. End users can easily search across metadata fields including table names, column names, and comments to find the data they need for their analysis. This search capability automatically leverages the governance model inherent in the Unity Catalog approach. Users will only see search results for data they have access to, which serves as a productivity boost for the user and also provides further control for data administrators looking to ensure that sensitive data is protected.

Looking Ahead

Demand for AI and data analytics is running hot in the industry right now despite tough economic conditions. That’s because better insight into and analysis of data is a recession-proofing step that smart, strategic leaders are embracing. That bodes well for vendors like Databricks, focused on powering the quest for accessing better insights faster, making them well-positioned to see strong growth in the months ahead.

Databricks model of making money by renting analytics, AI, and other cloud-based software designed to help companies mine insights from business data is a solid approach and one that’s getting traction within the industry. The fact these services are based on open source Apache Spark, a real-time data-analytics technology, is good for the company and also addresses the vendor lock-in issue.

As I mentioned earlier, Databricks is experiencing rapid growth. It’s clear the company is focused on innovation in myriad ways under the leadership of CEO Ali Ghodsi and is adroitly looking to differentiate itself in an increasingly complicated landscape for data and AI solutions.

As multiple companies look to capture the industry buzz around Lakehouse, approaches for data management, and applying AI to vast quantities of data, the team at Databricks is well placed to capitalize on the tailwinds if the company continues its commitment to open source powered innovation. I look forward to seeing what’s ahead.

Disclosure: Futurum Research is a research and advisory firm that engages or has engaged in research, analysis, and advisory services with many technology companies, including those mentioned in this article. The author does not hold any equity positions with any company mentioned in this article.

Analysis and opinions expressed herein are specific to the analyst individually and data and other information that might have been provided for validation, not those of Futurum Research as a whole.

Other insights from Futurum Research:

Oracle Database API for MongoDB: Running MongoDB Workloads on Oracle Cloud Infrastructure

New Qualcomm AI Stack Powers the Connected Intelligent Edge

IBM Cloud and AI Team with Wimbledon to Boost Fan Experience

The original version of this article was first published on Futurum Research.

Steven Dickens

Steven Dickens is Vice President of Sales and Business Development and Senior Analyst at Futurum Research. Operating at the crossroads of technology and disruption, Steven engages with the world’s largest technology brands exploring new operating models and how they drive innovation and competitive edge for the enterprise. With experience in Open Source, Mission Critical Infrastructure, Cryptocurrencies, Blockchain, and FinTech innovation, Dickens makes the connections between the C-Suite executives, end users, and tech practitioners that are required for companies to drive maximum advantage from their technology deployments. Steven is an alumnus of industry titans such as HPE and IBM and has led multi-hundred million dollar sales teams that operate on the global stage. Steven was a founding board member, former Chairperson, and now Board Advisor for the Open Mainframe Project, a Linux Foundation Project promoting Open Source on the mainframe. Steven Dickens is a Birmingham, UK native, and his speaking engagements take him around the world each year as he shares his insights on the role technology and how it can transform our lives going forward.

Data & AI Summit 2022 Recap: Databricks Continues Commitment to Open Source Powered Innovation