Spark and AI Summit 2020: The revolution will be streamed

Author
R Tyler Croy

Published
September 9, 2020

Team
Core Platform

Earlier this summer I was able to present at Spark and AI Summit about some of the work we have been doing in our efforts to build the Real-time Data Platform. At a high level, what I had branded the “Real-time Data Platform” is really: Apache Kafka, Apache Airflow, Structured streaming with Apache Spark, and a smattering of microservices to help shuffle data around. All sitting on top of Delta Lake which acts as an incredibly versatile and useful storage layer for the platform.

In the presentation I outline how we tie together Kafka, Databricks, and Delta Lake.

The presentation also complements some of our blog posts:

Streaming data in and out of Delta Lake
Streaming development work with Kafka
Ingesting production logs with Rust
Migrating Kafka to the cloud

I am incredibly proud of the work the Platform Engineering organization has done at Scribd to make real-time data a reality. I also cannot recommend Kafka + Spark + Delta Lake highly enough for those with similar requirements.

Related Jobs View All Jobs

KeepReading

Terraform module to manage Oxbow Lambda and its components

Cloud-native Data Ingestion with AWS Aurora and Delta Lake

The Evolution of the Machine Learning Platform

Data and AI Summit Wrap-up

Keep
Reading