97 Things Every Data Engineer Should Know

Book description

Take advantage of today's sky-high demand for data engineers. With this in-depth book, current and aspiring engineers will learn powerful real-world best practices for managing data big and small. Contributors from notable companies including Twitter, Google, Stitch Fix, Microsoft, Capital One, and LinkedIn share their experiences and lessons learned for overcoming a variety of specific and often nagging challenges.

Edited by Tobias Macey, host of the popular Data Engineering Podcast, this book presents 97 concise and useful tips for cleaning, prepping, wrangling, storing, processing, and ingesting data. Data engineers, data architects, data team managers, data scientists, machine learning engineers, and software engineers will greatly benefit from the wisdom and experience of their peers.

Topics include:

  • The Importance of Data Lineage - Julien Le Dem
  • Data Security for Data Engineers - Katharine Jarmul
  • The Two Types of Data Engineering and Data Engineers - Jesse Anderson
  • Six Dimensions for Picking an Analytical Data Warehouse - Gleb Mezhanskiy
  • The End of ETL as We Know It - Paul Singman
  • Building a Career as a Data Engineer - Vijay Kiran
  • Modern Metadata for the Modern Data Stack - Prukalpa Sankar
  • Your Data Tests Failed! Now What? - Sam Bail

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. O’Reilly Online Learning
    2. How to Contact Us
    3. Acknowledgments
  2. 1. A (Book) Case for Eventual Consistency
    1. Denise Koessler Gosnell, PhD
  3. 2. A/B and How to Be
    1. Sonia Mehta
  4. 3. About the Storage Layer
    1. Julien Le Dem
  5. 4. Analytics as the Secret Glue for Microservice Architectures
    1. Elias Nema
  6. 5. Automate Your Infrastructure
    1. Christiano Anderson
  7. 6. Automate Your Pipeline Tests
    1. Tom White
      1. Build an End-to-End Test of the Whole Pipeline
      2. Use a Small Amount of Representative Data
      3. Prefer Textual Data Formats over Binary
      4. Ensure That Tests Can Be Run Locally
      5. Make Tests Deterministic
      6. Make It Easy to Add More Tests
  8. 7. Be Intentional About the Batching Model in Your Data Pipelines
    1. Raghotham Murthy
      1. Data Time Window Batching Model
      2. Arrival Time Window Batching Model
      3. ATW and DTW Batching in the Same Pipeline
  9. 8. Beware of Silver-Bullet Syndrome
    1. Thomas Nield
  10. 9. Building a Career as a Data Engineer
    1. Vijay Kiran
  11. 10. Business Dashboards for Data Pipelines
    1. Valliappa (Lak) Lakshmanan
  12. 11. Caution: Data Science Projects Can Turn into the Emperor’s New Clothes
    1. Shweta Katre
  13. 12. Change Data Capture
    1. Raghotham Murthy
  14. 13. Column Names as Contracts
    1. Emily Riederer
  15. 14. Consensual, Privacy-Aware Data Collection
    1. Katharine Jarmul
      1. Attach Consent Metadata
      2. Track Data Provenance
      3. Drop or Encrypt Sensitive Fields
  16. 15. Cultivate Good Working Relationships with Data Consumers
    1. Ido Shlomo
      1. Don’t Let Consumers Solve Engineering Problems
      2. Adapt Your Expectations
      3. Understand Consumers’ Jobs
  17. 16. Data Engineering != Spark
    1. Jesse Anderson
      1. Batch and Real-Time Systems
      2. Computation Component
      3. Storage Component
      4. NoSQL Databases
      5. Messaging Component
  18. 17. Data Engineering for Autonomy and Rapid Innovation
    1. Jeff Magnusson
      1. Implement Reusable Patterns in the ETL Framework
      2. Choose a Framework and Tool Set Accessible Within the Organization
      3. Move the Logic to the Edges of the Pipelines
      4. Create and Support Staging Tables
      5. Bake Data-Flow Logic into Tooling and Infrastructure
  19. 18. Data Engineering from a Data Scientist’s Perspective
    1. Bill Franks
      1. Database Administration, ETL, and Such
      2. Why the Need for Data Engineers?
      3. What’s the Future?
  20. 19. Data Pipeline Design Patterns for Reusability and Extensibility
    1. Mukul Sood
  21. 20. Data Quality for Data Engineers
    1. Katharine Jarmul
  22. 21. Data Security for Data Engineers
    1. Katharine Jarmul
      1. Learn About Security
      2. Monitor, Log, and Test Access
      3. Encrypt Data
      4. Automate Security Tests
      5. Ask for Help
  23. 22. Data Validation Is More Than Summary Statistics
    1. Emily Riederer
  24. 23. Data Warehouses Are the Past, Present, and Future
    1. James Densmore
  25. 24. Defining and Managing Messages in Log-Centric Architectures
    1. Boris Lublinsky
  26. 25. Demystify the Source and Illuminate the Data Pipeline
    1. Meghan Kwartler
  27. 26. Develop Communities, Not Just Code
    1. Emily Riederer
  28. 27. Effective Data Engineering in the Cloud World
    1. Dipti Borkar
      1. Disaggregated Data Stack
      2. Orchestrate, Orchestrate, Orchestrate
      3. Copying Data Creates Problems
      4. S3 Compatibility
      5. SQL and Structured Data Are Still In
  29. 28. Embrace the Data Lake Architecture
    1. Vinoth Chandar
      1. Common Pitfalls
      2. Data Lakes
      3. Advantages
      4. Implementation
  30. 29. Embracing Data Silos
    1. Bin Fan and Amelia Wong
      1. Why Data Silos Exist
      2. Embracing Data Silos
  31. 30. Engineering Reproducible Data Science Projects
    1. Dr. Tianhui Michael Li
  32. 31. Five Best Practices for Stable Data Processing
    1. Christian Lauer
      1. Prevent Errors
      2. Set Fair Processing Times
      3. Use Data-Quality Measurement Jobs
      4. Ensure Transaction Security
      5. Consider Dependency on Other Systems
      6. Conclusion
  33. 32. Focus on Maintainability and Break Up Those ETL Tasks
    1. Chris Moradi
  34. 33. Friends Don’t Let Friends Do Dual-Writes
    1. Gunnar Morling
  35. 34. Fundamental Knowledge
    1. Pedro Marcelino
  36. 35. Getting the “Structured” Back into SQL
    1. Elias Nema
  37. 36. Give Data Products a Frontend with Latent Documentation
    1. Emily Riederer
  38. 37. How Data Pipelines Evolve
    1. Chris Heinzmann
  39. 38. How to Build Your Data Platform like a Product
    1. Barr Moses and Atul Gupte
      1. Align Your Product’s Goals with the Goals of the Business
      2. Gain Feedback and Buy-in from the Right Stakeholders
      3. Prioritize Long-Term Growth and Sustainability over Short-Term Gains
      4. Sign Off on Baseline Metrics for Your Data and How You Measure It
  40. 39. How to Prevent a Data Mutiny
    1. Sean Knapp
  41. 40. Know the Value per Byte of Your Data
    1. Dhruba Borthakur
  42. 41. Know Your Latencies
    1. Dhruba Borthakur
  43. 42. Learn to Use a NoSQL Database, but Not like an RDBMS
    1. Kirk Kirkconnell
  44. 43. Let the Robots Enforce the Rules
    1. Anthony Burdi
  45. 44. Listen to Your Users—but Not Too Much
    1. Amanda Tomlinson
  46. 45. Low-Cost Sensors and the Quality of Data
    1. Dr. Shivanand Prabhoolall Guness
  47. 46. Maintain Your Mechanical Sympathy
    1. Tobias Macey
  48. 47. Metadata ≥ Data
    1. Jonathan Seidman
  49. 48. Metadata Services as a Core Component of the Data Platform
    1. Lohit VijayaRenu
      1. Discoverability
      2. Security Control
      3. Schema Management
      4. Application Interface and Service Guarantee
  50. 49. Mind the Gap: Your Data Lake Provides No ACID Guarantees
    1. Einat Orr
  51. 50. Modern Metadata for the Modern Data Stack
    1. Prukalpa Sankar
      1. Data Assets > Tables
      2. Complete Data Visibility, Not Piecemeal Solutions
      3. Built for Metadata That Itself Is Big Data
      4. Embedded Collaboration at Its Heart
  52. 51. Most Data Problems Are Not Big Data Problems
    1. Thomas Nield
  53. 52. Moving from Software Engineering to Data Engineering
    1. John Salinas
  54. 53. Observability for Data Engineers
    1. Barr Moses
      1. How Good Data Turns Bad
      2. Introducing Data Observability
  55. 54. Perfect Is the Enemy of Good
    1. Bob Haffner
  56. 55. Pipe Dreams
    1. Scott Haines
  57. 56. Preventing the Data Lake Abyss
    1. Scott Haines
      1. Establishing Data Contracts
      2. From Generic Data Lake to Data Structure Store
  58. 57. Prioritizing User Experience in Messaging Systems
    1. Jowanza Joseph
  59. 58. Privacy Is Your Problem
    1. Stephen Bailey, PhD
  60. 59. QA and All Its Sexiness
    1. Sonia Mehta
  61. 60. Seven Things Data Engineers Need to Watch Out for in ML Projects
    1. Dr. Sandeep Uttamchandani
  62. 61. Six Dimensions for Picking an Analytical Data Warehouse
    1. Gleb Mezhanskiy
      1. Scalability
      2. Price Elasticity
      3. Interoperability
      4. Querying and Transformation Features
      5. Speed
      6. Zero Maintenance
  63. 62. Small Files in a Big Data World
    1. Adi Polak
      1. What Are Small Files, and Why Are They a Problem?
      2. Why Does It Happen?
      3. Detect and Mitigate
      4. Conclusion
      5. References
  64. 63. Streaming Is Different from Batch
    1. Dean Wampler, PhD
  65. 64. Tardy Data
    1. Ariel Shaqed
  66. 65. Tech Should Take a Back Seat for Data Project Success
    1. Andrew Stevenson
  67. 66. Ten Must-Ask Questions for Data-Engineering Projects
    1. Haidar Hadi
      1. Question 1: What Are the Touch Points?
      2. Question 2: What Are the Granularities?
      3. Question 3: What Are the Input and Output Schemas?
      4. Question 4: What Is the Algorithm?
      5. Question 5: Do You Need Backfill Data?
      6. Question 6: When Is the Project Due Date?
      7. Question 7: Why Was That Due Date Set?
      8. Question 8: Which Hosting Environment?
      9. Question 9: What Is the SLA?
      10. Question 10: Who Will Be Taking Over This Project?
  68. 67. The Data Pipeline Is Not About Speed
    1. Rustem Feyzkhanov
  69. 68. The Dos and Don’ts of Data Engineering
    1. Christopher Bergh
      1. Don’t Be a Hero
      2. Don’t Rely on Hope
      3. Don’t Rely on Caution
      4. Do DataOps
  70. 69. The End of ETL as We Know It
    1. Paul Singman
      1. Replacing ETL with Intentional Data Transfer
      2. Agreeing on a Data Model Contract
      3. Removing Data Processing Latencies
      4. Taking the First Steps
  71. 70. The Haiku Approach to Writing Software
    1. Mitch Seymour
      1. Understand the Constraints Up Front
      2. Start Strong Since Early Decisions Can Impact the Final Product
      3. Keep It as Simple as Possible
      4. Engage the Creative Side of Your Brain
  72. 71. The Hidden Cost of Data Input/Output
    1. Lohit VijayaRenu
      1. Data Compression
      2. Data Format
      3. Data Serialization
  73. 72. The Holy War Between Proprietary and Open Source Is a Lie
    1. Paige Roberts
  74. 73. The Implications of the CAP Theorem
    1. Paul Doran
  75. 74. The Importance of Data Lineage
    1. Julien Le Dem
  76. 75. The Many Meanings of Missingness
    1. Emily Riederer
  77. 76. The Six Words That Will Destroy Your Career
    1. Bartosz Mikulski
  78. 77. The Three Invaluable Benefits of Open Source for Testing Data Quality
    1. Tom Baeyens
  79. 78. The Three Rs of Data Engineering
    1. Tobias Macey
      1. Reliability
      2. Reproducibility
      3. Repeatability
      4. Conclusion
  80. 79. The Two Types of Data Engineering and Data Engineers
    1. Jesse Anderson
      1. Types of Data Engineering
      2. Types of Data Engineers
      3. Why These Differences Matter to You
  81. 80. The Yin and Yang of Big Data Scalability
    1. Paul Brebner
  82. 81. Threading and Concurrency in Data Processing
    1. Matthew Housley, PhD
      1. Operating System Threading
      2. Threading Overhead
      3. Solving the C10K Problem
      4. Scaling Is Not a Magic Bullet
      5. Further Reading
  83. 82. Three Important Distributed Programming Concepts
    1. Adi Polak
      1. MapReduce Algorithm
      2. Distributed Shared Memory Model
      3. Message Passing/Actors Model
      4. Conclusions
  84. 83. Time (Semantics) Won’t Wait
    1. Marta Paes Moreira and Fabian Hueske
  85. 84. Tools Don’t Matter, Patterns and Practices Do
    1. Bas Geerdink
  86. 85. Total Opportunity Cost of Ownership
    1. Joe Reis
  87. 86. Understanding the Ways Different Data Domains Solve Problems
    1. Matthew Seal
  88. 87. What Is a Data Engineer? Clue: We’re Data Science Enablers
    1. Lewis Gavin
      1. AI and Machine Learning Models Require Data
      2. Clean Data == Better Model
      3. Finally Building a Model
      4. A Model Is Useful Only If Someone Will Use It
      5. So What Am I Getting At?
  89. 88. What Is a Data Mesh, and How Not to Mesh It Up
    1. Barr Moses and Lior Gavish
      1. Why Use a Data Mesh?
      2. The Final Link: Observability
  90. 89. What Is Big Data?
    1. Ami Levin
  91. 90. What to Do When You Don’t Get Any Credit
    1. Jesse Anderson
  92. 91. When Our Data Science Team Didn’t Produce Value
    1. Joel Nantais
  93. 92. When to Avoid the Naive Approach
    1. Nimrod Parasol
  94. 93. When to Be Cautious About Sharing Data
    1. Thomas Nield
  95. 94. When to Talk and When to Listen
    1. Steven Finkelstein
  96. 95. Why Data Science Teams Need Generalists, Not Specialists
    1. Eric Colson
  97. 96. With Great Data Comes Great Responsibility
    1. Lohit VijayaRenu
      1. Put Yourself in the User’s Shoes
      2. Ensure Ethical Use of User Information
      3. Watch Your Data Footprint
  98. 97. Your Data Tests Failed! Now What?
    1. Sam Bail, PhD
      1. System Response
      2. Logging and Alerting
      3. Alert Response
      4. Stakeholder Communication
      5. Root Cause Identification
      6. Issue Resolution
  99. Contributors
  100. Index

Product information

  • Title: 97 Things Every Data Engineer Should Know
  • Author(s): Tobias Macey
  • Release date: June 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492062417