Scribd stores billions of records in Delta Lake but writing or reading that data had been constrained to a single tech stack, all of that changed with the creation of delta-rs. Historically using Delta Lake required applications to be implemented with or accompanied by Apache Spark. Many of our batch and streaming data processing applications are all Spark-based, but that’s not everything that exists! In mid-2020 it became clear that Delta Lake would be a powerful tool in areas adjacent to the domain that Spark occupies. From my perspective, I figured that would soon need to bring data into and out of Delta Lake in dozens of different ways. Some discussions and prototyping led to the creation of “delta-rs”, a Delta Lake client written in Rust that can be easily embedded in other languages such as Python, Ruby, NodeJS, and more.
The Delta Lake protocol is not that complicated as it turns out. At an extremely high level, Delta Lake is a JSON-based transaction log coupled with Apache Parquet files stored on disk/object storage. This means the core implementation of Delta in Rust is similarly quite simple. Take the following example from our integration tests which “opens” a table, reads it’s transaction log and provides a list of Parquet files contained within:
let table = deltalake::open_table("./tests/data/delta-0.2.0")
.await
.unwrap();
assert_eq!(
table.get_files(),
vec![
"part-00000-cb6b150b-30b8-4662-ad28-ff32ddab96d2-c000.snappy.parquet",
"part-00000-7c2deba3-1994-4fb8-bc07-d46c948aa415-c000.snappy.parquet",
"part-00001-c373a5bd-85f0-4758-815e-7eb62007a15c-c000.snappy.parquet",
]
);
Our primary motivation for delta-rs was to create something which would accommodate high-throughput writes to Delta Lake and allow embedding for languages like Python and Ruby such that users of those platforms could perform light queries and read operations.
The first notable writer-based application being co-developed with delta-rs is kafka-delta-ingest. The project aims to provide a highly efficient daemon for ingesting Kafka-originating data into Delta tables. In Scribd’s stack, it will effectively bridge JSON flowing into Apache Kafka topics into pre-defined Delta tables, translating a single JSON message into a single row in the table.
From the reader standpoint, the Python interface built on top of delta-rs,
contributed largely by Florian Valeye makes
working with Delta Lake even simpler, and for most architectures you only need
to run pip install deltalake
:
from deltalake import DeltaTable
from pprint import pprint
if __name__ == '__main__':
# Load the Delta Table
dt = DeltaTable('s3://delta/golden/data-reader-primitives')
print(f'Table version: {dt.version()}')
# List out all the files contained in the table
for f in dt.files():
print(f' - {f}')
# Create a Pandas dataframe to execute queries against the table
df = dt.to_pyarrow_table().to_pandas()
pprint(df.query('as_int % 2 == 1'))
I cannot stress enough how much potential the above Python snippet has for
machine learning and other Python-based applications at Scribd. For a number
of internal applications developers have been launching Spark clusters for the
sole purpose of reading some data from Delta Lake in order to start their model
training workloads in Python. With the maturation of the Python deltalake
package, now there is a fast and easy way to load Delta Lake into basic Python
applications.
From my perspective, it’s only the beginning with delta-rs. Delta Lake is a deceptively simple technology with tremendous potential across the data platform. I will be sharing more about delta-rs at Data and AI Summit on May 27th at 12:10 PDT. I hope you’ll join my session with your questions about delta-rs and where we’re taking it!