Chapter 1. Introducing data engineering design patterns
Design patterns are a well established term in the software engineering space but they have only recently begun getting traction in the data engineering world. Consequently, I owe you a few words of introduction and explanation on what these design patterns are in the context of data engineering.
What are design patterns?
You may be surprised how many times you rely on patterns in daily life. Let’s take a vital example of cooking and one of my favorite desserts, flan; if you like creamy desserts and haven’t tried flan yet, I highly recommend it! When you want to prepare a flan, you need to get all the ingredients and follow a list of preparation steps. As an outcome, you get a tasty dessert.
Why am I giving this cooking example as the introduction to a technical book about design patterns? Preparing a recipe is a great representation of what a design pattern should be, a predefined and customizable template solving a problem. How does this flan example apply to this definition?-
Ingredients and the list of preparation steps are the predefined template. They give you instructions but remain customizable, as you might decide to use brown sugar instead of white for example.
-
There can be a single or many uses. The flan can be a dessert you’ll share with family at the tea-time, or it can be a product that you’re going to sell to make your living. This is the contextualization of a design pattern. They always respond to a specific problem, which in the example is either pleasure of sharing, or business.
-
You can decide to prepare this delicious dessert once or many times, if it happens to be your new favorite meal. For each new preparation you won’t reinvent the wheel. There are more chances you’ll rely on the same successful recipe you tried just before. That’s the reusability of the pattern.
-
But you must also be aware that preparing and eating flan has some implications on your life and health. If you prepare it every day, you’ll maybe have less time for sports practice and as a result, you might have some health issues in the long run. These are the consequences of a pattern.
-
Finally, the recipe saves you time as it has been tested by many other people before. Additionally, it introduces a common dictionary that will make your life easier while discussing with other people. Finding the recipe for flan is easier than for a caramel custard, even though you’ll find results for both terms.
Now, how does all this relate to data engineering? Again, let’s use an example. You need to process a semi-structured dataset from a continuously running job. From time to time you might be processing a record with a completely invalid format that will throw an exception and stop your job. But you don’t want your whole job to fail because of that simple malformed record. This is our contextualization.
To solve this processing issue, you’ll apply a set of best practices to your data processing logic, such as wrapping the risky transformation with a try-catch
block to capture bad records and write them to another destination for analysis. That’s the predefined template. These are the rules you can adapt to your specific use case. For example, you could decide not to send these bad records to another database and instead, simply count their occurrences.
Turns out, the example of handling erroneous records without breaking the pipeline has a specific name which is Dead-Lettering. Now, if you encounter the same problem again, but for a slightly different context - maybe while working on an ELT pipeline and performing the transformations in a data warehouse directly - you can apply the same logic. That’s the reusability of the pattern. Dead-Letter is one of the Error management patterns detailed in Chapter 3.
However, you shouldn’t follow the Dead-Letter pattern blindly. As for eating a flan every day, implementing the pattern has some consequences you should be aware of. Here, you add an extra logic that adds some extra complexity to the code base. You must be ready to accept them.
Finally, a data engineering design pattern represents a holistic picture of a solution for a given problem. It then saves time and also introduces a common language that can simplify a lot in the discussions with your teammates or data engineers you have just met.
Yet another design patterns?
If you write software, you’ve heard about the Gangs of Four’s design patterns 1 and maybe even have been considering them as one of the clean code pillars. And now, you’re probably asking yourself, are they not enough for data engineering projects? Unfortunately, no.
Software design patterns are the recipes that you can use to keep an easily maintainable code base. Since the patterns are standardized way to represent a given concept, they’re quickly understandable by any new person in the project.
For example a pattern to avoid allocating unnecessary objects is Singleton. A newcomer aware of the design patterns can quickly identify it and understand the purpose in the code.
Writing a maintainable code does indeed apply to data engineering projects but it’s not enough. Besides pure software aspects, you need to think about the data ones, such as the aforementioned failures management, backfilling, idempotency, or data correctness.
Common data engineering patterns
The failed record management from the previous section is only one example of many design patterns you’ll find in this book. Their organization follows a typical data flow where everything starts with the Data ingestion patterns. Once you have the data coming in, you can start applying some transformation logic, and face many challenges related to it. You might have some errors to deal with. You might need to rerun an already completed job 2. Both issues are handled with the Error management and Idempotency patterns.
After defining the strategies to deal with issues that could degrade the data quality and propagate poor data to the downstream consumers, you’ll see how to leverage Data value patterns to generate valuable datasets for your business domains.
Unfortunately, those datasets are rarely the final parts of a data engineering system. Often other pipelines transform or combine them later to provide even more important data value. That’s why in the following chapter you’ll discover Data flow patterns. It also marks the end of patterns for data processing.
However, a data engineer’s responsibility doesn’t stop here. There are other important aspects to consider, to start with Data security patterns. In the new chapter you’ll see there how to handle data privacy requirements and how to reduce the intrusion risk.
Next, in the Data storage patterns chapter, you’ll see how to leverage the storage layer for an optimized data organization. But again, storing data is not enough and it’s rarely the last step in a data engineer’s scope. That’s the reason why you’ll also see Data quality and Data observability patterns. They should help guarantee the stored data can be trusted.
Case study used in the book
The design patterns from the book are not tied to one specific business domain. However, understanding them without any business context would be hard, especially for the less experienced readers. For that reason, you’ll see each pattern introduced in the context of our case study project which is a blog data analytics platform.
Our project follows common data practices and is divided around the layers presented in Figure 1-1.The Figure 1-1 highlights the three most important parts of the project, which are:
- Online and offline data ingestion components. The online part applies to the data generated by the users interacting with the blogs hosted on our platform. The offline part, marked here as “data provider”, applies to the static external or internal datasets such as referential datasets, produced at a more regular schedule than the visit events, for example once an hour.
- It’s our real-time layer where you can find streaming jobs processing events data from a streaming broker. The jobs here may be of one of two types. The first is a business-facing job that generates data to the stakeholders, such as a real-time session aggregation. The second type is a technical job that is often a technical enabler to other business use cases. An example here would be data synchronization to the data at-rest storage for ad-hoc querying.
- The third layer follows a common nowadays datasets organization based on the Medallion architecture3 principle where a dataset may live in one of three different areas, Bronze, Silver, and Gold. Each of them applies to a different data maturity level. The Bronze one stores data in its raw format, unaltered and probably with serious data quality issues. The Silver layer is responsible for the cleansed and enriched datasets. Finally, the Gold area exposes data in the format expected by the final users, such as data marts or reference datasets.
Why this three storage areas layout is interesting in the context of the book? Each layer represent a different data maturity level, exactly like the design patterns presented here. The ones impacting business value will mostly expose the data in the Gold area while the others will remain behind, in the Bronze or Silver layer. Problem statements section for the patterns may reference those areas to help you understand the encountered issue better.
The schema doesn’t present any implementation details on purpose. Focusing on them could shift your focus on the technology instead of the universal pattern-based solutions that are the main topic in the book. But it doesn’t mean you won’t see any technical details in the next chapters. On the contrary! Each pattern has a dedicated Examples section where you will see different implementations of the presented pattern.
Summary
After this first chapter you should understand not only that flan is a great creamy dessert but also that its recipe is also a great analogy of data engineering design patterns that you will discover in the next 11 chapters. I know, it’s a lot but with a cup of coffee or of tea, your favorite dessert (why not flan!), it’ll be an exciting learning journey!
1 Design Patterns: Elements of Reusable Object-Oriented Software ( Addison-Wesley Professional ) is colloquially known as the “Gangs of Four” because of the four authors who share 23 software engineering standard patterns.
2 It’s known as reprocessing. To not confuse you, since now we’re going to refer to any task processing past data as backfilling, no matter if the data have already been processed or not. Technically there is a small difference between reprocessing and backfilling that you will learn in the Glossary available on Github
3 You can learn more about the Medallion architecture in the Chapter 4 of “Delta Lake: The Definitive Guide” available at https://www.oreilly.com/library/view/delta-lake-the/9781098151935/
Get Data Engineering Design Patterns now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.