Databricks runs on AWS and integrates with all of the major services you use like S3, EC2, Redshift, and more. In this demo, we’ll show you how Databricks integrates with each of these services simply and seamlessly to enable you to build a lakehouse architecture.
The Databricks Lakehouse platform sits at the heart of the AWS ecosystem, and easily integrates with popular Data and AI services like Kinesis streams, S3 buckets, Glue, Athena, Redshift, QuickSight, and much more. In this demo, we’ll show you how Databricks integrates with each of these services in a simple seamless way.
When we start up a Spark cluster on Databricks, we can configure it to use the Glue Catalog, and also attach it to an IAM instance profile that allows Databricks to provision and manage EC2 instances, S3 buckets, and other AWS services.
One of the first things we do while working with AWS Databricks is to set up a Spark cluster in your Virtual Private Cloud, which can auto scale up and down to control cloud costs as your data workloads change. Databricks Spark clusters use EC2 instances on the backend, and you can configure them to use the AWS Glue data catalog. You can also set up AWS instance profiles on your cluster to control and manage access to S3 buckets and other resources.
Now that our autoscaling Spark cluster is up and running, let’s start by ingesting real-time data from a Kinesis stream in just a few lines of code, using Spark Structured Streaming and the built-in Databricks–Kinesis connector. First we’ll view some of the raw data from our streaming DataFrame. Next, we can save it in Delta Lake format to a Delta Lake Bronze table stored in S3, using the code you see here. Delta Lake is the foundation of a lakehouse architecture, providing ACID transactions on cloud object storage, as well as tables that unify batch and streaming data processing to simplify your data architecture.
We can view the table we just created from within Databricks by running the SHOW TABLES command in a notebook, or by clicking the Data tab and navigating to the database where the tables are stored. Since we set up our cluster to integrate with the AWS Glue data catalog, we can also view these Delta Lake tables directly in the Glue Console. When we search for them, you can see that all of the tables we viewed in Databricks are now present in Glue.
Databricks also makes it easy to work with data stored in your Redshift data warehouse, too. Here, we’re writing some sample data to Redshift using the built-in Databricks Redshift connector. We can also read from Redshift using the same connector. Alternatively, you can choose a Postgres connector or Redshift Data API from Databricks to do the same. Or we can jump into the Redshift console and query the table we just created from Databricks.
Finally, we can also connect to AWS QuickSight dashboards from Databricks, to explore our data visually, and create attractive dashboards and reports.
As we’ve seen, Databricks provides a simple, open, and collaborative Lakehouse platform that deeply integrates with all of your AWS services. Download the notebooks used in this demo on the Databricks Demo Hub, by clicking the link in the description below. Or visit databricks.com/try to get started with a free trial of Databricks on AWS today.
Return to top