Implementing More Effective FAIR Scientific Data Management With a Lakehouse
Data powers scientific discovery and innovation. But data is only as good as its data management strategy, the key factor in ensuring data quality, accessibility, and reproducibility of results – all requirements of reliable scientific evidence. As large datasets have become more and more important and accessible to scientists across disciplines, the problems of big...
Security Best Practices for AWS on Databricks
The Databrick Lakehouse Platform is the world’s first lakehouse architecture -- an open, unified platform to enable all of your analytics workloads. A lakehouse enables true cross-functional collaboration across data teams of data engineers, data scientists, ML engineers, analysts and more. In this article, we will share a list of cloud security features and capabilities...
Custom DNS With AWS Privatelink for Databricks Workspaces
This post was written in collaboration with Amazon Web Services (AWS). We thank co-authors Ranjit Kalidasan, senior solutions architect, and Pratik Mankad, partner solutions architect, of AWS for their contributions. Last week, we were excited to announce the release of AWS PrivateLink for Databricks Workspaces, now in public preview, which enables new patterns and...
Allow Simple Cluster Creation with Full Admin Control Using Cluster Policies
What is a Databricks cluster policy? A Databricks cluster policy is a template that restricts the way users interact with cluster configuration. Today, any user with cluster creation permissions is able to launch an Apache Spark™ cluster with any configuration. This leads to a few issues: Administrators are forced to choose between control and flexibility....
Data Quality Monitoring on Streaming Data Using Spark Streaming and Delta Lake
Try this notebook to reproduce the steps outlined below In the era of accelerating everything, streaming data is no longer an outlier- instead, it is becoming the norm. We often no longer hear customers ask, "can I stream this data?" so much as "how fast can I stream this data?", and the pervasiveness of technologies...