Several contributions to popular data and AI open source projects including Delta Lake, MLflow, and Apache Spark were announced
● All Delta Lake enhancements contributed to Linux Foundation with release of Delta Lake 2.0
● MLflow 2.0 with ML Pipelines accelerates time-to-production for machine learning projects
● Spark Connect introduced to allow Apache SparkTMto run on any device
● Project Lightspeed revealed for next generation Spark Streaming
Databricks, the data and AI company and pioneer of the data lakehouse paradigm, today announced several contributions to popular data and AI open source projects including Delta Lake, MLflow, and Apache Spark.
At the Data + AI Summit, the largest gathering of the open source data and AI community, Databricks announced that the company will contribute all features and enhancements it has made to Delta Lake to the Linux Foundation and open source all Delta Lake APIs as part of the Delta Lake 2.0 release. In addition, the company announced MLflow 2.0, which includes MLflow Pipelines, a new feature to accelerate and simplify ML model deployments. Finally, the company introduced Spark Connect, to enable the use of Spark on virtually any device, and Project Lightspeed, a next generation Spark Structured Streaming engine for data streaming on the lakehouse.
“From the beginning, Databricks has been committed to open standards and the open source community. We have created, contributed to, fostered the growth of, and donated some of the most impactful innovations in modern open source technology,” said Ali Ghodsi, Co-Founder and CEO of Databricks. “Open data lakehouses are quickly becoming the standard for how the most innovative companies handle their data and AI. Delta Lake, MLflow and Spark are all core to this architectural transformation, and we’re proud to do our part in accelerating their innovation and adoption.”
Delta Lake 2.0 Brings the Lakehouse to Everyone
Delta Lake 2.0 will bring unmatched query performance to all Delta Lake users and enable everyone to build a highly performant data lakehouse on open standards. With this contribution, Databricks customers and the open source community will benefit from the full functionality and enhanced performance of Delta Lake 2.0. The Delta Lake 2.0 Release Candidate is now available and is expected to be fully released later this year. The breadth of the Delta Lake ecosystem makes it flexible and powerful in a wide range of use cases. Fueling this is a vibrant community of over 6,400 members, with contributing developers from more than 70 contributing organizations.
“Databricks provides Akamai with a table storage format that is open and battle-tested for demanding workloads such as ours. The lakehouse powers interactive analytics at scale so that our customers can have near real-time analysis of security events within our Edge platform,” said Aryeh Sivan, VP Engineering at Akamai. “We are very excited about the rapid innovation that Databricks, along with the rapidly growing community, is bringing to Delta Lake. We are also looking forward to collaborating with other developers on the project to move the data community to greater heights.”
“The Delta Lake project is seeing phenomenal activity and growth trends indicating the developer community wants to be a part of the project. Contributor strength has increased by 60% during the last year and the growth in total commits is up 95% and the average lines of code per commit is up 900%. We are seeing this upward velocity from contributing organizations like Uber Technologies, Walmart and CloudBees, Inc., among others,” said Executive Director of the Linux Foundation, Jim Zemlin.
MLflow 2.0 Introduces MLflow Pipelines to Templatize and Automate MLOps
As one of the most successful open source machine learning (ML) projects, MLflow set the standard for ML platforms. The release of MLflow 2.0 introduces MLflow Pipelines to the platform, substantially decreasing time to production and improving execution at scale through standardization. MLflow Pipelines offers data scientists pre-defined, production-ready templates based on the model type they’re building to allow them to reliably bootstrap and accelerate model development without requiring intervention from production engineers.
Next Generation Streaming Engine and Spark Whenever and Wherever
As the leading unified engine for large-scale data analytics, Spark scales seamlessly to handle data sets of all sizes. However, the lack of remote connectivity and burden of applications developed and run on the driver node, hinder the requirements of modern data applications. To tackle this, Databricks introduced Spark Connect, a client and server interface for Apache Spark based on the DataFrame API that will decouple the client and server for better stability, and allow for built-in remote connectivity. With Spark Connect, users will be able to access Spark from any device.
In collaboration with the Spark community, Databricks also announced Project Lightspeed, the next generation of the Spark streaming engine. As the diversity of applications moving into streaming data has increased, new requirements have emerged to support the most in-demand data workloads for lakehouse, data streaming. Spark Structured Streaming has been widely adopted since the early days of streaming because of its ease of use, performance, large ecosystem, and developer communities. With that in mind, Databricks will collaborate with the community and encourage participation in Project Lightspeed to improve performance, ecosystem support for connectors, enhance functionality for processing data with new operators and APIs, and simplify deployment, operations, monitoring and troubleshooting.