Log4j2 Vulnerability (CVE-2021-44228) Research and Assessment
This blog relates to an ongoing investigation. We will update it with any significant updates, including detection rules to help people investigate potential exposure due to CVE-2021-44228 both within their own usage on Databricks and elsewhere. Should our investigation conclude that customers may have been impacted, we will individually notify those customers proactively by email....
Scala at Scale at Databricks
With hundreds of developers and millions of lines of code, Databricks is one of the largest Scala shops around. This post will be a broad tour of Scala at Databricks, from its inception to usage, style, tooling and challenges. We will cover topics ranging from cloud infrastructure and bespoke language tooling to the human processes...
The Foundation of Your Lakehouse Starts With Delta Lake
It’s been an exciting last few years with the Delta Lake project. The release of Delta Lake 1.0 as announced by Michael Armbrust in the Data+AI Summit in May 2021 represents a great milestone for the open source community and we’re just getting started! To better streamline community involvement and ask, we recently published Delta...
Turning 2 Trillion Data Points of Traffic Intelligence into Critical Business Insights
This is a guest authored post by Stephanie Mak, Senior Data Engineer, formerly at Intelematics. This blog post offers my experience of contributing to the open source community with Bricklayer, which I'd started during my time at Intelematics. Bricklayer is a utility for data engineers whose job is to farm jobs, build map layers...
Introducing Apache Spark™ 3.2
We are excited to announce the availability of Apache Spark™ 3.2 on Databricks as part of Databricks Runtime 10.0. We want to thank the Apache Spark community for their valuable contributions to the Spark 3.2 release. The number of monthly maven downloads of Spark has rapidly increased to 20 million. The year-over-year growth rate represents...
Native Support of Session Window in Spark Structured Streaming
Apache Spark™ Structured Streaming allowed users to do aggregations on windows over event-time. Before Apache Spark 3.2™, Spark supported tumbling windows and sliding windows. In the upcoming Apache Spark 3.2, we add “session windows” as new supported types of windows, which works for both streaming and batch queries What is a "session window"? Tumbling...
Efficient Point in Polygon Joins via PySpark and BNG Geospatial Indexing
This is a collaborative post by Ordnance Survey, Microsoft and Databricks. We thank Charis Doidge, Senior Data Engineer, and Steve Kingston, Senior Data Scientist, Ordnance Survey, and Linda Sheard, Cloud Solution Architect for Advanced Analytics and AI at Microsoft, for their contributions. This blog presents a collaboration between Ordnance Survey (OS), Databricks and Microsoft...
Pandas API on Upcoming Apache Spark™ 3.2
We’re thrilled to announce that the pandas API will be part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible library and has grown rapidly to become one of the standard data science libraries. Now pandas users will be able to leverage the pandas API on their existing Spark clusters. A few...
Shiny and Environments for R Notebooks
At Databricks, we want the Lakehouse ecosystem widely accessible to all data practitioners, and R is a great interface language for this purpose because of its rich ecosystem of open source packages and broad use as a computing language for many non-computing scientific disciplines. The product team at Databricks actively engages with R users to...
How We Built Databricks on Google Kubernetes Engine (GKE)
Our release of Databricks on Google Cloud Platform (GCP) was a major milestone toward a unified data, analytics and AI platform that is truly multi-cloud. Databricks on GCP, a jointly-developed service that allows you to store all of your data on a simple, open lakehouse platform, is based on standard containers running on top of...