Earlier this summer I was able to present at Spark and AI Summit about some of the work we have been doing in our efforts to build the Real-time Data Platform. At a high level, what I had branded the “Real-time Data Platform” is really: Apache Kafka, Apache Airflow, Structured streaming with Apache Spark, and a smattering of microservices to help shuffle data around. All sitting on top of Delta Lake which acts as an incredibly versatile and useful storage layer for the platform.
In the presentation I outline how we tie together Kafka, Databricks, and Delta Lake.
The presentation also complements some of our blog posts:
- Streaming data in and out of Delta Lake
- Streaming development work with Kafka
- Ingesting production logs with Rust
- Migrating Kafka to the cloud
I am incredibly proud of the work the Platform Engineering organization has done at Scribd to make real-time data a reality. I also cannot recommend Kafka + Spark + Delta Lake highly enough for those with similar requirements.