As the the data space has matured, data engineering has emerged as a separate and related role that works in concert with data scientists.
Ian Buss, principal solutions architect at Cloudera, notes that data scientists focus on finding new insights from a data set, while data engineers are concerned with the production readiness of that data and all that comes with it: formats, scaling, resilience, security, and more.
Ryan Blue, a senior software engineer at Netflix and a member of the company’s data platform team, says roles on data teams are becoming more specific because certain functions require unique skill sets. “For a long time, data scientists included cleaning up the data as part of their work,” Blue says. “Once you try to scale up an organization, the person who is building the algorithm is not the person who should be cleaning the data or building the tools. In a modern big data system, someone needs to understand how to lay that data out for the data scientists to take advantage of it.”
What do data engineers focus on?
Data engineers primarily focus on the following areas.
Build and maintain the organization’s data pipeline systems
Data pipelines encompass the journey and processes that data undergoes within a company. Data engineers are responsible for creating those pipelines. Jesse Anderson explains how data engineers and pipelines intersect in his article “Data engineers vs. data scientists”:
Creating a data pipeline may sound easy or trivial, but at big data scale, this means bringing together 10-30 different big data technologies. More importantly, a data engineer is the one who understands and chooses the right tools for the job. A data engineer is the one who understands the various technologies and frameworks in-depth, and how to combine them to create solutions to enable a company’s business processes with data pipelines.
Clean and wrangle data into a usable state
Data engineers make sure the data the organization is using is clean, reliable, and prepped for whatever use cases may present themselves. Data engineers wrangle data into a state that can then have queries run against it by data scientists.
What does wrangling involve? Data Wrangling with Python authors Katharine Jarmul and Jacqueline Kazil explain the process in their book:
Data wrangling is about taking a messy or unrefined source of data and turning it into something useful. You begin by seeking out raw data sources and determining their value: How good are they as data sets? How relevant are they to your goal? Is there a better source? Once you’ve parsed and cleaned the data so that the data sets are usable, you can utilize tools and methods (like Python scripts) to help you analyze them and present your findings in a report. This allows you to take data no one would bother looking at and make it both clear and actionable.
Data wrangling is a significant problem when working with big data, especially if you haven’t been trained to do it, or you don’t have the right tools to clean and validate data in an effective and efficient way, says Blue. A good data engineer can anticipate the questions a data scientist is trying to understand and make their life easier by creating a usable data product, Blue adds.
What skills do data engineers need?
Those “10-30 different big data technologies” Anderson references in “Data engineers vs. data scientists” can fall under numerous areas, such as file formats, ingestion engines, stream processing, batch processing, batch SQL, data storage, cluster management, transaction databases, web frameworks, data visualizations, and machine learning. And that’s just the tip of the iceberg.
Buss says data engineers should have the following skills and knowledge:
- They need to know Linux and they should be comfortable using the command line.
- They should have experience programming in at least Python or Scala/Java.
- They need to know SQL.
- They need some understanding of distributed systems in general and how they are different from traditional storage and processing systems.
- They need a deep understanding of the ecosystem, including ingestion (e.g. Kafka, Kinesis), processing frameworks (e.g. Spark, Flink) and storage engines (e.g. S3, HDFS, HBase, Kudu). They should know the strengths and weaknesses of each tool and what it's best used for.
- They need to know how to access and process data.
A holistic understanding of data is also important. “We need [data engineers] to know how the entire big data operation works and want [them] to look for ways to make it better,” says Blue. Sometimes, he adds, that can mean thinking and acting like an engineer and sometimes that can mean thinking more like a traditional product manager.
Right people in the right roles
Data engineering and data science are different jobs, and they require employees with unique skills and experience to fill those rolls. By understanding this distinction, companies can ensure they get the most out of their big data efforts.
Anderson explains why the division of work is important in “Data engineers vs. data scientists”:
I’ve seen companies task their data scientists with things you’d have a data engineer do. The data scientists were running at 20-30% efficiency. The data scientist doesn’t know things that a data engineer knows off the top of their head. Creating a data pipeline isn’t an easy task—it takes advanced programming skills, big data framework understanding, and systems creation. These aren’t skills that an average data scientist has. A data scientist can acquire these skills; however, the return on investment (ROI) on this time spent will rarely pay off. Don’t misunderstand me: a data scientist does need programming and big data skills, just not at the levels that a data engineer needs them.
There is also the issue of data scientists being relative amateurs in this data pipeline creation. A data scientist will make mistakes and wrong choices that a data engineer would (should) not. A data scientist often doesn’t know or understand the right tool for a job. Everything will get collapsed to using a single tool (usually the wrong one) for every task. The reality is that many different tools are needed for different jobs. A qualified data engineer will know these, and data scientists will often not know them.
Ready to dive deeper into data engineering? Check out these recommended resources from O’Reilly’s editors.
Data engineers vs. data scientists — Jesse Anderson explains why data engineers and data scientists are not interchangeable.
Data Wrangling with Python — Katharine Jarmul and Jacqueline Kazil’s hands-on guide covers how to acquire, clean, analyze, and present data efficiently.
Expert Data Wrangling with R — Garrett Grolemund shows you how to streamline your code—and your thinking—by introducing a set of principles and R packages that make data wrangling faster and easier.
Building Data Pipelines with Python — Katharine Jarmul explains how to build data pipelines and automate workflows.