Real-time Point-of-Sale Analytics With a Data Lakehouse
Disruptions in the supply chain – from reduced product supply and diminished warehouse capacity – coupled with rapidly shifting consumer expectations for seamless consumer demands in the new normal. In this blog, we'll address the need for real-time data in retail, and how to overcome the challenges of moving real-time streaming of point-of-sale data at...
Announcing Databricks Autologging for Automated ML Experiment Tracking
Machine learning teams require the ability to reproduce and explain their results--whether for regulatory, debugging or other purposes. This means every production model must have a record of its lineage and performance characteristics. While some ML practitioners diligently version their source code, hyperparameters and performance metrics, others find it cumbersome or distracting from their rapid...
Improving On-Shelf Availability for Items With AI Out of Stock Modeling
This post was written in collaboration with Databricks partner Tredence. We thank Rich Williams, Vice President Data Engineering, and Morgan Seybert, Chief Business Officer, of Tredence for their contributions. Retailers are missing out on nearly $1 trillion in global sales because they don’t have on-hand what customers want to buy in their stores. Adding...
Solution Accelerator: Multi-touch Attribution
Behind the growth of every consumer-facing product is the acquisition and retention of an engaged user base. When it comes to customer acquisition, the goal is to attract high-quality users as cost effectively as possible. With marketing dollars dispersed across a wide array of different touchpoints -- campaigns, channels, and creatives -- measuring effectiveness is...
How We Built Databricks on Google Kubernetes Engine (GKE)
Our release of Databricks on Google Cloud Platform (GCP) was a major milestone toward a unified data, analytics and AI platform that is truly multi-cloud. Databricks on GCP, a jointly-developed service that allows you to store all of your data on a simple, open lakehouse platform, is based on standard containers running on top of...
An Experimentation Pipeline for Extracting Topics From Text Data Using PySpark
This post is part of a series of posts on topic modeling. Topic modeling is the process of extracting topics from a set of text documents. This is useful for understanding or summarizing large collections of text documents. A document can be a line of text, a paragraph or a chapter in a book. The...
The Delta Between ML Today and Efficient ML Tomorrow
Delta Lake and MLflow both come up frequently in conversation but often as two entirely separate products. This blog will focus on the synergies between Delta Lake and MLflow for machine learning use cases and explain how you can leverage Delta Lake to deliver strong ML results based on solid data foundations. If you are...
Monitoring ML Models With Model Assertions
This is a guest post from the Stanford University Computer Science Department. We thank Daniel Kang, Deepti Raghavan and Peter Bailis of Stanford University for their contributions. Machine learning (ML) models are increasingly used in a wide range of business applications. Organizations deploy hundreds of ML models to predict customer churn, optimal pricing, fraud...
AML Solutions at Scale Using Databricks Lakehouse Platform
Anti-Money Laundering (AML) compliance has been undoubtedly one of the top agenda items for regulators providing oversight of financial institutions across the globe. As AML evolved and became more sophisticated over the decades, so have the regulatory requirements designed to counter modern money laundering and terrorist financing schemes. The Bank Secrecy Act of 1970 provided...
Feature Engineering at Scale
Feature engineering is one of the most important and time-consuming steps of the machine learning process. Data scientists and analysts often find themselves spending a lot of time experimenting with different combinations of features to improve their models and to generate BI reports that drive business insights. The larger, more complex datasets with which data...