As data sets grow and the needs of the business change, ingesting, transforming, and combining data becomes an area of focus unto itself. Data is cheap, understanding it is expensive: Data Engineering helps build that understanding.
The Data Engineering team at Scribd delivers text, analytics, and behavioral datasets almost “as a product” for internal customers, helping them build upon and understand the datasets that enable business and product operations. Our customers include the Product team, Business Analytics, Finance, and effectively the entire Engineering organization.
The history of growth for Scribd has been incremental and organic. For our data this has meant steady growth which can be deceptive. Organizations with rapid growth quickly see where data pipelines break down. More organic growth allows data pipelines to continue happily operating until they reach a tipping point where the amount of data exceeds the capacity or design of the pipeline.
To me the exciting things about Data Engineering are the size of the datasets and the incredible potential our work has to impact the future of Scribd.
In order to do that, we have to change and grow.
Scaling Data Engineering
At the beginning of 2019 we did not have a Data Engineering team, the need was not yet understood. When we started talking about the ambitions for the Data Science teams, the plans for the Business Analytics team, and what product initiatives we wanted to accomplish in 2020, the need became abundantly clear. We need Data Engineering to build the foundation for data usage across the company.
We have already hired some talented Data Engineers, but we need more talented people to help bring us into the cloud and enable new ideas we haven’t even had yet! Scaling a team is one thing, but that is not the only way in which we need to scale.
Scaling our Approach
We are currently moving out of an on-premise managed data center into AWS. There’s plenty of excitement in the teams that build Scribd.com and the services behind it. For the Data Science, Machine Learning, and Data Engineering teams at Scribd, AWS represents incredible potential. We have long been limited in the questions we can ask of our data by the fixed footprint of data center-based infrastructure. As our datasets move into S3 and our compute workloads move into Databricks, we’re already starting to identify new and interesting ways to ingest and examine the data in order to make Scribd more useful for our readers.
A cloud-native data platform can simplify, but during the “zero-downtime” migration we are facing a number of interesting challenges.
- Running workloads between multiple data centers, while sharing data between them, requires more sophistication in how Data Engineering manages our catalog, which itself will soon grow into multiple catalogues.
- Wasteful workloads are less noticeable in an on-premise environment. A job which inadvertently generates numerous small files, or a Spark job which performs excessive shuffle reads become more problematic in a usage-based pricing model. Time starts to matter more.
- Technology skew between on-premise and the cloud forces a little more forethought. We can run the same versions of Spark in both places but our on-premise and cloud-based vendors are necessarily different. Using both temporarily increases our management complexity.
The size of our datasets makes this project much more fun to work on. Migrating a data warehouse of a few hundred gigabytes without any downtime to its customers wouldn’t be that hard. Doing the same migration with multiple petabytes of data across thousands of discrete tables, processed by hundreds of automated jobs is a very different ballgame.
Moving into a cloud-native environment also enables a number of new approaches and opportunities which by themselves will help us scale our data engineering practices. A unified data platform for business analytics, data science, and machine learning in the cloud can take advantage of completely different instance types for more optimized workloads. Easily spinning up and tearing down environments for exploration of new and different tools enables teams to use the tool that fits, as opposed to the tool that everybody else uses. This shift is coupled with a broader shift to the cloud and our architecture, will also open the door to more stream processing of data rather than massive periodic batches.
“The cloud” is not without its shortfalls, but across Data Engineering we’re already seeing tremendous improvements by our cloud-centric approach.
Scaling our Quality
Supporting numerous parts of the business means that Data Engineering has to increase the quality of the datasets consumed. Some datasets have grown over time in their utility, where they were once “nice to have” they are now critical to certain business functions. This mandates that the pipelines which produce those datasets be treated just like the production software deployed to scribd.com.
I often think of the quote by Charles Babbage:
On two occasions I have been asked,
‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’
I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
Data quality is a concern that anybody in the Data Engineering space is familiar. For Scribd I think “quality” on two axis:
- Integrity: is each record within this set formed the way the customer expects it, or in adherence with a predefined schema.
- Lineage: is the pipeline of this dataset clear, monitored, and functioning properly to ensure my job continues to receive good inputs. Additionally, understanding when a pipeline contains personally-identifiable information, or other sensitive information which must have extra care added in order to safe-guard our readers’ privacy.
Unfortunately data quality is an area where I think we need to substantial improvements. Data was at one time treated as a by-product of production systems. Now it is rightfully recognized as business-critical, and our practices must rise to meet the challenge.
AWS does not offer us any silver bullet to help scale our quality, fortunately however Scribd leadership recognizes the importance of both data and Data Engineering, so I’m confident we will be able to finish 2020 in a much better position.
When Scribd first started “big data” was just coming into vogue. As the tools and practices available for working with data have changed, so too has Scribd. Our datasets are larger than they may appear from the outside: analytics from billions of requests each year combined with hundreds of million text documents are challenging to manage. These hefty datasets are also a challenge to make available, insightful, and of high quality. Data by itself tells us nothing, but well-managed data pipelines that allow us to identify characteristics of text documents, or content which is interesting to read, is incredibly valuable to Scribd. Data Engineering helps us understand our data which helps Scribd build products which deliver great reads to the world