Building a Geospatial Lakehouse, Part 1
An open secret of geospatial data is that it contains priceless information on behavior, mobility, business activities, natural resources, points of interest and more. Geospatial data can turn into critically valuable insights and create significant competitive advantages for any organization. Look no further than Google, Amazon, Facebook to see the necessity for adding a dimension...
Ray on Databricks
Ray is an open-source project first developed at RISELab that makes it simple to scale any compute-intensive Python workload. With a rich set of libraries and integrations built on a flexible distributed execution framework, Ray brings new use cases and simplifies the development of custom distributed Python functions that would normally be complicated to create....
10 Powerful Features to Simplify Semi-structured Data Management in the Databricks Lakehouse
Hassle Free Data IngestionDiscover how Databricks simplifies semi-structured data ingestion into Delta Lake with detailed use cases, a demo, and live Q&A. WATCH NOW Ingesting and querying JSON with semi-structured data can be tedious and time-consuming, but Auto Loader and Delta Lake make it easy. JSON data is very flexible, which makes it powerful, but...
GPU-accelerated Sentiment Analysis Using Pytorch and Huggingface on Databricks
Sentiment analysis is commonly used to analyze the sentiment present within a body of text, which could range from a review, an email or a tweet. Deep learning-based techniques are one of the most popular ways to perform such an analysis. However, these techniques tend to be very computationally intensive and often require the use...
Introducing Apache Spark™ 3.2
We are excited to announce the availability of Apache Spark™ 3.2 on Databricks as part of Databricks Runtime 10.0. We want to thank the Apache Spark community for their valuable contributions to the Spark 3.2 release. The number of monthly maven downloads of Spark has rapidly increased to 20 million. The year-over-year growth rate represents...
Creating an IP Lookup Table of Activities in a SIEM Architecture
When working with cyber security data, one thing is for sure: there is no shortage of available data sources. If anything, there are too many data sources with overlapping data. Your traditional SIEM (security information and event management) tool is not really fit to handle the complex transformation and stitching of multiple data sources in...
Developing Databricks’ Runbot CI Solution
Runbot is a bespoke continuous integration (CI) solution developed specifically for Databricks' needs. Originally developed in 2019, Runbot incrementally replaces our aging Jenkins infrastructure with something more performant, scalable, and user friendly for both users and maintainers of the service. This blog post will explore the motivations behind developing Runbot, the core design decisions that...
How YipitData Extracts Insights From Alternative Data Using Delta Lake
This is a guest post from YipitData. We thank Anup Segu, Data Engineering Tech Lead, and Bobby Muldoon: Director of Data Engineering, at YipitData for their contributions. Choosing the right storage format for any data lake is an important responsibility for data administrators. Tradeoffs between storage costs, performance, migration cost, and compatibility are top...
Real-time Point-of-Sale Analytics With a Data Lakehouse
Disruptions in the supply chain – from reduced product supply and diminished warehouse capacity – coupled with rapidly shifting consumer expectations for seamless consumer demands in the new normal. In this blog, we'll address the need for real-time data in retail, and how to overcome the challenges of moving real-time streaming of point-of-sale data at...
How Incremental ETL Makes Life Simpler With Data Lakes
Incremental ETL (Extract, Transform and Load) in a conventional data warehouse has become commonplace with CDC (change data capture) sources, but scale, cost, accounting for state and the lack of machine learning access make it less than ideal. In contrast, incremental ETL in a data lake hasn’t been possible due to factors such as the...