Implementing MLOps on Databricks using Databricks notebooks and Azure DevOps, Part 2
This is the second part of a two-part series of blog posts that show an end-to-end MLOps framework on Databricks, which is based on Notebooks. In the first post, we presented a complete CI/CD framework on Databricks with notebooks. The approach is based on the Azure DevOps ecosystem for the Continuous Integration (CI) part and...
Enabling Computer Vision Applications With the Data Lakehouse
The potential for computer vision applications to transform retail and manufacturing operations, as explored in the blog Tackle Unseen Quality, Operations and Safety Challenges with Lakehouse enabled Computer Vision, can not be overstated. That said, numerous technical challenges prevent organizations from realizing this potential. In this first introductory installment of our multi-part technical series on...
Are GPUs Really Expensive? Benchmarking GPUs for Inference on Databricks Clusters
It is no secret that GPUs are critical for artificial intelligence and deep learning applications since their highly-efficient architectures make them ideal for compute-intensive use cases. However, almost everyone who has used them is also aware of the fact they tend to be expensive! In this article, we hope to show that while the per-hour...
Moneyball 2.0: Real-time Decision Making With MLB’s Statcast Data
The Oakland Athletics baseball team in 2002 used data analysis and quantitative modeling to identify undervalued players and create a competitive lineup on a limited budget. The book Moneyball, written by Michael Lewis, highlighted the A’s ‘02 season and gave an inside glimpse into how unique the team’s strategic data modeling was, for its time....
GPU-accelerated Sentiment Analysis Using Pytorch and Huggingface on Databricks
Sentiment analysis is commonly used to analyze the sentiment present within a body of text, which could range from a review, an email or a tweet. Deep learning-based techniques are one of the most popular ways to perform such an analysis. However, these techniques tend to be very computationally intensive and often require the use...
Introducing Apache Spark™ 3.2
We are excited to announce the availability of Apache Spark™ 3.2 on Databricks as part of Databricks Runtime 10.0. We want to thank the Apache Spark community for their valuable contributions to the Spark 3.2 release. The number of monthly maven downloads of Spark has rapidly increased to 20 million. The year-over-year growth rate represents...
MLflow for Bayesian Experiment Tracking
This post is the third in a series on Bayesian inference ([1], [2] ). Here we will illustrate how to use managed MLflow on Databricks to perform and track Bayesian experiments using the Python package PyMC3. This results in systematic and reproducible experimentation ML pipelines that can be shared across data science teams due to...
Pandas API on Upcoming Apache Spark™ 3.2
We’re thrilled to announce that the pandas API will be part of the upcoming Apache Spark™ 3.2 release. pandas is a powerful, flexible library and has grown rapidly to become one of the standard data science libraries. Now pandas users will be able to leverage the pandas API on their existing Spark clusters. A few...
Catalog and Discover Your Databricks Notebooks Faster
This is a collaborative post from Databricks and Elsevier. We thank Darin McBeath, Director Disruptive Technologies -- Elsevier, for his contributions. As a global leader in information and analytics, Elsevier helps researchers and healthcare professionals advance science and improve health outcomes for the benefit of society. It has supported the work of its research...
Managing Model Ensembles With MLflow
In machine learning, an ensemble is a collection of diverse models that provide more predictive power together than any single model would on its own. The outputs of multiple learning algorithms are combined through a process of averaging or voting, resulting in potentially a better prediction for a given set of inputs. However, there are...