Chapter 1. Analytics Engineering
The historical development of analytics includes significant milestones and technologies that have shaped the field into what it is today. It began with the advent of data warehousing in the 1980s, which created the foundational framework for organizing and analyzing business data. Bill Inmon, a computer scientist who continued to publish throughout the 1980s and 1990s, is widely regarded as providing the first solid theoretical foundation for data warehousing.
A subsequent wave of development occurred when Ralph Kimball, another leading contributor to data warehousing and business intelligence (BI), published his influential work, The Data Warehouse Toolkit, in 1996. Kimball’s work laid the foundation for dimensional modeling, marking another crucial milestone in the evolution of analytics. Together, the contributions of Inmon and Kimball, spanning the late 20th century, played pivotal roles in shaping the landscape of data warehousing and analytics.
In the early 2000s, the emergence of tech giants like Google and Amazon created the need for more advanced solutions for processing massive amounts of data, leading to the release of the Google File System and Apache Hadoop. This marked the era of Big Data Engineering, in which professionals used the Hadoop framework to process large amounts of data.
The rise of public cloud providers like Amazon Web Services (AWS) revolutionized the way software and data applications were developed and deployed. One of the pioneering offerings from AWS was Amazon Redshift, introduced in 2012. It represented an interesting blend of online analytical processing (OLAP) and traditional database technologies. In its early days, Redshift required database administrators to manage tasks like vacuuming and scaling to maintain optimal performance.
Over time, cloud native technologies have continued to evolve, and Redshift itself has undergone significant enhancements. While retaining its core strengths, newer versions of Redshift, along with cloud native platforms like Google BigQuery and Snowflake, have streamlined many of these administrative tasks, offering advanced data processing capabilities to enterprises of all sizes. This evolution highlights the ongoing innovation within the cloud data processing ecosystem.
The modern data stack, consisting of tools like Apache Airflow, data build tool (dbt), and Looker, further transformed data workflows. With these advances, the term “Big Data engineer” became obsolete, making way for a data engineer’s broader and more inclusive role. This shift was recognized in the influential articles of Maxime Beauchemin—creator of Apache Superset and Airflow and one of the first data engineers at Facebook and Airbnb—particularly in his article “The Rise of the Data Engineer”, which highlighted the growing importance of data engineering in the industry. All of these rapid developments in the data field have led to significant changes in the role of data professionals. With the advent of data tools, simple tasks are becoming strategic tasks.
Today’s data engineers have a multifaceted role that encompasses data modeling, quality assurance, security, data management, architectural design, and orchestration. They are increasingly adopting software engineering practices and concepts, such as functional data engineering and declarative programming, to enhance their workflows. While Python and structured query language (SQL) stand out as indispensable languages for data engineers, it’s important to note that the choice of programming languages can vary widely in this field. Engineers may leverage other languages such as Java (commonly used for managing Apache Spark and Beam), Scala (also prevalent in the Spark and Beam ecosystem), Go, and more, depending on the specific needs and preferences of their projects. The combination of languages like Java and SQL is also common among data engineers at large organizations.
Organizations are increasingly moving toward decentralized data teams, self-service platforms, and alternative data storage options. As data engineers are forced to adapt to all these market changes, we often see some taking on a more technical role, focusing on platform enablement. Other data engineers work closer to the business, designing, implementing, and maintaining systems that turn raw data into high-value information as they adapt to this accelerated industry that is bringing new tools to market every day and spawning the fantastic world of analytics engineering.
In this chapter, we provide an introduction to the field of analytics engineering and its role in the data-driven decision-making process. We discuss the importance of analytics engineering in today’s data-driven world and the primary roles of an analytics engineer. In addition, we will explore how the analytics engineering lifecycle is used to manage the analytics process and how it ensures the quality and accuracy of the data and insights generated. We will also address the current trends and technologies shaping the field of analytics engineering, from history to the present, touching on emerging concepts like data mesh, and discussing the fundamental choices between extract, load, and transform (ELT) and extract, transform, and load (ETL) strategies as well as the many data modeling techniques being adopted around the world.
Databases and Their Impact on Analytics Engineering
For a long time now, data has increasingly become the focus of interest for companies that want to stay one step ahead of the competition, improve their internal processes, or merely understand the behavior of their customers. With new tools, new ways of working, and new areas of knowledge such as data science and BI, it’s becoming increasingly difficult to fully survey and understand the data landscape these days.
The natural progress of technology has caused an oversupply of data analysis, visualization, and storage tools, each offering unique features and capabilities. Nevertheless, an accelerated deployment of those tools has resulted in a fragmented landscape, requiring individuals and organizations to remain up-to-date with the most recent technological developments while at the same time having to make prudent choices on how to use them. Sometimes this abundance creates confusion and requires a continuous cycle of learning and adaptation.
The evolution of work practices is accompanied by a diversification of tools. Dynamic and Agile methodologies have replaced traditional approaches to data management and analysis. Iterative practices and cross-functional collaboration introduce flexibility and speed to data projects, but they also pose a challenge in harmonizing workflows across diverse teams and roles. Effective communication and alignment are crucial as diverse facets of the data process converge, creating a need for a comprehensive understanding of these novel work practices.
Specialized areas such as data science and BI have increased the complexity of the data field as well. Data scientists apply advanced statistical and machine learning techniques to detect complex patterns, whereas BI experts extract valuable information from raw data to produce practical insights. Such specialized areas introduce refined techniques that require regular skill development and learning. A successful adoption of these practices necessitates a dedicated commitment to education and a flexible approach to skill acquisition.
As data spreads across the digital domain, it carries with it unforeseen amounts, varieties, and speeds. The flood of data, along with the complex features of present-day data sources, such as Internet of things (IoT) gadgets and unorganized text, makes data management even more demanding. The details of incorporating, converting, and assessing data precision become more apparent, emphasizing the need for strong methods that guarantee reliable and precise insights.
The multifaceted nature of the data world compounds its complexity. As an outcome of converging skills from various domains, including computer science, statistics, and field-specific proficiency, a cooperative and communicative strategy is necessary. This multidisciplinary interaction accentuates the significance of efficient teamwork and knowledge sharing.
But that has not always been the case. For decades, spreadsheets were the standard technology for storing, managing, and analyzing data at all levels, both for business operational management and for analytics to understand it. However, as businesses have become more complex, so has the need for data-related decision making. And the first of these came in the form of a revolution called databases. Databases can be defined as an organized collection of structured information or data, usually stored electronically in a computer system. This data can be in the form of text, numbers, images, or other types of digital information. Data is stored in a way that facilitates access and retrieval using a set of predefined rules and structures called a schema.
Databases are an essential part of analytics because they provide a way to efficiently store, organize, and retrieve large amounts of data, allowing analysts to easily access the data they need to perform complex analyses to gain insights that would otherwise be difficult or impossible to obtain. In addition, databases can be configured to ensure data integrity, which guarantees that the data being analyzed is accurate and consistent and thus makes the analysis more reliable and trustworthy.
One of the most common ways to use databases for analytics is the data warehousing technique, that is, to construct and use a data warehouse. A data warehouse is a large, centralized data store designed to simplify data use. The data in a data warehouse is typically extracted from a variety of sources, such as transactional systems, external data feeds, and other databases. The data is then cleansed, transformed, and integrated into a single, consistent data model that typically follows a dimensional modeling technique such as the star schema or Data Vault.
Another important use of databases in analytics is the process of data mining. Data mining uses statistical and machine learning techniques to uncover patterns and relationships in large datasets. In this way, trends can be identified, future behavior can be predicted, and other types of predictions can be made.
Database technologies and data scientists have thus played a crucial role in the emergence of data science by providing a way to efficiently store, organize, and retrieve large amounts of data, enabling data scientists to work with large datasets and focus on what matters: gaining knowledge from data.
The use of SQL and other programming languages, such as Python or Scala, that allow interaction with databases has enabled data scientists to perform complex data queries and manipulations. Also, the use of data visualization tools such as Tableau and Microsoft Power BI, which easily integrate with database engines, has made it easier for data scientists to present their findings in a clear and intuitive way.
With the advent of Big Data and the growing demand to store and process vast datasets, various database technologies have emerged to meet diverse needs. For instance, data analysts often rely on databases for a wide range of applications, including data warehousing, data mining, and integration with BI tools like Tableau.
However, it’s important to delve deeper into these use cases to understand the need for analytics engineering. When connecting BI tools directly to operational databases (online transaction processing [OLTP] replicas), performance and scalability can be limited. This approach may work well for smaller datasets and simple queries, but as data volumes grow and the complexity of analytics increases, it can lead to performance bottlenecks and suboptimal query response times.
This is where analytics engineering comes into play. Analytics engineers are experts in optimizing data workflows, transforming and aggregating data to ensure it’s in the right format for analytical tasks. They design and maintain data pipelines that ETL data from various sources into optimized data warehouses or data lakes. By doing so, they help organizations overcome the limitations of direct OLTP connections, enabling faster and more efficient data analysis with tools like Tableau. In essence, analytics engineering bridges the gap between raw data and actionable insights, ensuring that data analysts and scientists can work with large, complex datasets effectively.
Cloud Computing and Its Impact on Analytics Engineering
In recent decades, the world has faced a series of complicated challenges with significant technical implications. Economic downturns have driven innovations in financial technologies and risk management systems. Geopolitical tensions have required advances in cybersecurity to protect critical infrastructure and sensitive data. Global health crises have underscored the importance of advanced data analytics and predictive modeling for disease surveillance and management. In addition, the urgent need to combat climate change has driven the development of cutting-edge renewable energy technologies and sustainable engineering solutions to meet climate goals.
Amid these challenges, the pursuit of profit and growth remains a key driver for businesses worldwide. However, the value of human labor time has taken on a new dimension, leading to significant changes in the way businesses operate and how cloud computing accommodates them. This change is reflected in the increasing adoption of managed and serverless offerings that reduce reliance on full-time support staff such as database administrators.
As companies adapt to this changing landscape, innovation, differentiation, and sustainability of business models and strategies have become essential considerations for companies seeking to succeed in a rapidly changing world. The information technology and systems industry found in this context a good opportunity to grow its capabilities in helping organizations overcome this world of uncertainty and pressure. The rationalization of operating models has become urgent, requiring a re-evaluation of data centers and pricing structures. In addition, product and service offerings must focus primarily on ease of use, lower latency, improved security, a broader range of real-time tools, more integration, more intelligence, less code, and a faster time to market.
Organizations have recognized the importance of investing in innovative tools, driving digital transformation, and adopting a data-centric approach to decision making to achieve greater agility and competitive advantage. To achieve these goals, many are focusing on leveraging well-curated data from internal and external sources. This carefully structured data can provide valuable insights into business performance.
In the industry, the practice of creating, visualizing, and analyzing interconnected business data in an accessible format is commonly referred to as data analytics. Historically, it has also been known as business intelligence, and the two terms are closely related. While BI is a subset of analytics and focuses on business-oriented decision making, data analytics encompasses a broader spectrum that includes product analytics, operational analytics, and several other specialized areas. Both BI and data analytics play pivotal roles in helping organizations gain a competitive edge through data-driven insights.
Although data analytics offers numerous benefits for improving and reshaping business strategies and monitoring performance, it requires significant financial investment in servers, software licenses, and specialized staff such as data engineers, data scientists, and data visualization specialists. In times of economic crisis, the high up-front and operational costs associated with IT hardware, software, and specialists can be perceived as impractical and unattractive.
As a result, on-premises solutions, where the infrastructure for data analytics is set up and managed on a company’s own premises, often lose their appeal. This is especially true for newcomers to analytics who are unfamiliar with the concept. On-premises solutions typically require significant investment in hardware, software, and ongoing maintenance. They are also less flexible and scalable compared to cloud-based data analytics solutions. This shift in preferences is clearing the way for new cloud-based data analytics solutions that meet similar business needs as traditional data analytics. However, instead of relying on on-premises servers and software, cloud-based solutions leverage cloud computing services to accelerate deployment and minimize infrastructure costs.
The increasing adoption of cloud computing in various industries has led software vendors such as Microsoft, Google, and Amazon to develop advanced tools for data analysis and data warehousing. These tools are designed to operate in the cloud computing paradigm and leverage shared network resources to enable greater accessibility and streamlined deployment. A vivid example of this trend is Microsoft’s comprehensive data analytics platform, Microsoft Fabric.
In parallel, dbt from dbt Labs, which we discuss in more detail later in this book, stands out as a versatile hybrid product. dbt, like Hadoop, is an open source solution that gives users the flexibility to deploy it according to their specific needs, whether in the cloud or on premises. In its cloud version, dbt integrates seamlessly with leading cloud platforms, including Microsoft Azure, Google Cloud Platform (GCP), and AWS. This open source nature gives organizations the ability to customize their deployment to their unique requirements and infrastructure preferences.
While cloud-based data analytics solutions and platforms are a global trend and a central concept of the modern data platform, it’s important to recognize that cloud computing solutions bring both benefits and risks that shouldn’t be overlooked. These risks include potential security issues, the physical location of servers, and the costs associated with moving away from a particular provider.
Nonetheless, cloud technologies are currently changing the way organizations deploy and construct information systems and technology solutions, and data analytics is no exception. That’s why it’s essential to recognize that moving to the cloud will soon no longer be an option but a necessity. Understanding the benefits of analytics solutions in the form of services is important. Otherwise, providing timely information to decision-makers with on-premises solutions that lack flexibility and scalability could become increasingly challenging if this transition isn’t addressed.
However, although cloud technologies bring several benefits, such as economies of scale and flexibility, they also bring information security issues. The concentration of data in cloud infrastructures makes them attractive targets for unauthorized attacks. To succeed in the cloud in the data context, organizations must understand and mitigate the risks associated with cloud computing. Key risks include data privacy, loss of control, incomplete or insecure deletion of data, unauthorized internal access, data availability, and complex costing.
Data privacy is a significant concern because it’s challenging to verify that vendors are handling data in compliance with laws and standards, even though public audit reports from vendors can help build trust. In nonintegrated scenarios, data security risks multiply as data flows among various systems and data centers, increasing the risk of interception and synchronization. Another important risk is vendor dependency, which occurs when responsibility for data management rests solely within one service provider in such a way that it limits the ability to migrate to other solutions. This kind of dependency ends up limiting an organization’s control over decision making and authority over data. While these are just a few known risks, we can already understand that organizations need to get a handle on these risks to effectively reap the benefits of cloud-based data analytics solutions. This requires careful consideration, adherence to security standards and best practices, and ongoing cost control to measure the return on investment.
If all risks are correctly addressed and mitigated in a proper data strategy that outlines how an organization will manage its information assets, including the cloud strategy, technology, processes, people, and rules involved, an organization can gain a substantial competitive advantage when compared to one that doesn’t have a data strategy. By focusing on cloud computing and leveraging a cloud data platform, organizations can transform raw data into meaningful insights, accelerating the process of building a solid data foundation. This enables efficient sourcing, structuring, and analysis of relevant data, and it even supports the adoption of AI technologies while driving value in less time and at a lower cost than traditional methods.
Interestingly, the relationship between a cloud data platform, analytics, and AI is symbiotic. Implementing a cloud data platform accelerates the adoption of an analytics-driven architecture and enables the full operationalization of AI initiatives. It empowers organizations to use all relevant data, gain enterprise-wide insights, and unlock new business opportunities. By eliminating the need to manage multiple tools, organizations can focus on data modernization, accelerate insight discovery, and benefit from existing technology partnerships, thereby advancing their AI journey.
This is why it’s fair to say that cloud computing has been a core component of both modern data platforms and the cloud-based analytics and AI platforms that continuously grow in volume every day and thus contribute to the disruption of this industry.
The Data Analytics Lifecycle
The data analytics lifecycle is a series of steps to transform raw data into valuable and easily consumable data products. These can range from well-managed datasets to dashboards, reports, APIs, or even web applications. In other words, it describes how data is created, collected, processed, used, and analyzed to achieve a specific product or business goal.
The increasing complexity in organizational dynamics directly impacts how data is handled. Numerous people must use the same data but with different goals. While a top executive might need to know just a few top-level key performance indicators to track business performance, a middle manager might need a more granular report to support daily decisions.
This highlights the need for a governed and standardized approach to creating and maintaining data products based on the same data foundation. Given the many decisions an organization must make regarding its data governance, technologies, and management processes, following a structured approach is fundamental to documenting and continuously updating an organization’s data strategy.
The data analytics lifecycle is, therefore, an essential framework for understanding and mapping the phases and processes involved in creating and maintaining an analytics solution (Figure 1-1). It is an essential concept in data science and analytics and provides a structured approach to managing the various tasks and activities required to create an effective analytics solution.
The data analytics lifecycle typically includes the following stages:
- Problem definition
-
The first phase of the analytics cycle is about understanding the problem that needs to be solved. This includes identifying the business objectives, the available data, and the resources needed to solve the problem.
- Data modeling
-
After the business requirements are identified, and an assessment of data sources is completed, you can begin modeling your data according to the modeling technique that best meets your needs. You can choose a diamond strategy, a star schema, a Data Vault, or even a fully denormalized technique. All these concepts will be discussed in Chapter 2.
- Data ingestion and transformation
-
The next phase is to ingest and prepare the data that’s coming from the source systems to match the models created. Depending on the overall information architecture, you can opt for a schema-on-write strategy, where you put more effort into transforming the raw data directly into your models, or a schema-on-read strategy, where you ingest and store the data with minimal transformations and move heavy transformations to the downstream layers of your data platform.
- Data storage and structuring
-
Once the data pipelines are designed and potentially implemented, you need to decide on the file formats to use—simple Apache Parquet or more advanced formats like Delta Lake or Apache Iceberg—as well as the partitioning strategies and storage components to use—a cloud-based object store like Amazon Simple Storage Service (S3) or a more data warehouse–like platform like Redshift, BigQuery, or Snowflake.
- Data visualization and analysis
-
Once the data is available, the next step is to explore it, visualize it, or create dashboards that directly support decision making or enable business process monitoring. This phase is very business oriented and should be created in close coordination with business stakeholders.
- Data quality monitoring, testing, and documentation
-
Although illustrated as the final phase of the analytics lifecycle, data quality should be an end-to-end concern and ensured by design across the whole flow. It involves implementing all quality controls to ensure that stakeholders can trust your exposed data models, documenting all transformations and semantic meanings, and ensuring proper testing along the pipelines as the data continues to flow.
Note
With dbt, several of these components are deployed more easily and efficiently because it allows us to build them in parallel and across the lifecycle. Documentation, testing, and quality become common tasks performed in parallel. This will be extensively elaborated in Chapter 4.
The analytics lifecycle is a key concept that enables organizations to approach data engineering, science, and analytics processes in a structured and consistent manner. By following a structured process, organizations can ensure they are solving the right problem, using the right data, and building data products that are accurate and reliable, ultimately leading to better decision making and better business results.
The New Role of Analytics Engineer
As mentioned in previous sections, data scientists and analysts can now easily access the data they need to perform complex analyses and gain insights that would otherwise be difficult or impossible to obtain. However, as the amount of data stored and analyzed continues to grow, it is becoming increasingly important for organizations to have data specialists to help them manage that data and provide the infrastructure needed.
The recently created branch of specialized data engineers, called analytics engineers, plays an integral role in developing and maintaining databases and data pipelines, allowing data scientists and analysts to focus on more advanced analytics tasks. Analytics engineers are responsible for designing, building, and maintaining the data architecture that enables organizations to turn data into valuable insights and make data-driven decisions.
In addition, the move from traditional ETL processes with enforced schemas-on-write to an ELT with schema-on-read approach means that data now ends up in the data repositories before it has been transformed. This is an opportunity for super-technical analysts who both know the business very well and have the technical skills to model the raw data into clean, well-defined datasets—analytics engineers. If you were looking for these types of skills in the world of data warehouses and the ETL paradigm, there would need to be specialists with both software engineering and data analytics skills—which would be much harder to find.
The analytics engineer acts as a bridge between data platform engineers, focused on building the technical infrastructure to enable data platforms, and data analysts, focused on converting data into insightful data products. Their job is to create well-tested, up-to-date, and documented datasets that the rest of the organization can use to answer their own questions. They are technically savvy enough to apply software development best practices such as version control and continuous integration and continuous deployment (CI/CD) but also need to be able to communicate effectively with stakeholders.
We can draw an analogy to civil engineering: data platform engineers are the foundation of an analytics project, responsible for ensuring that the infrastructure is robust, including plumbing, electrical systems, and the structural foundation. They lay the groundwork for everything to come.
Analytics engineers can be likened to architects. They take the solid foundation created by data engineers and design structures that align with the business model, constructing everything from exceptional dashboards to valuable data models. They bridge the gap between the technical infrastructure and the business objectives.
Data analysts, in this analogy, serve as interior designers. They step inside the constructed buildings, not only ensuring that the content is aligned with users but also making it user-friendly and tailored to the specific needs of data consumers. Together, these roles collaborate to create a holistic and functional analytics environment.
Looking at the data analytics lifecycle, data platform engineers build platforms and ingest raw data into enterprise-wide data stores. On the other hand, analytics engineers take the raw data and transform it to match the analytical data models the business needs to support decision making.
Responsibilities of an Analytics Engineer
The role of an analytics engineer is becoming increasingly important as both the volume and complexity of data, as well as its diverse applications, continue to grow. This includes everything from designing and implementing data storage and retrieval systems, to creating and maintaining data pipelines, and developing and deploying machine learning models. In this dynamic landscape, analytics engineers play a vital role in harnessing the increasing data resources and maximizing their value across a wide range of applications.
Based on the latest role trends, one of the main responsibilities is to design and implement efficient data storage and retrieval systems. This includes working with databases and data warehousing technologies to design data models and structures that can handle large and complex datasets. Another immediate responsibility is creating and maintaining data pipelines that extract data from various sources, transform it, and load it into a central repository for analysis.
For most analytics engineers, the development and use of machine learning models is somewhat less observable but still happening. This includes working with data scientists to understand their requirements, selecting and implementing the appropriate algorithms, and ensuring that the models are properly trained and deployed with the correct set of training and testing data. When this is not the case, analytics engineers collaborate on building the proper data pipelines to continuously feed data scientists with proper training and testing data.
In addition, analytics engineers are responsible for monitoring and maintaining the performance of machine learning models, both by helping to structure offline evaluation and by combining model-specific metrics with business metrics for online monitoring.
An analytics engineer is typically proficient in programming languages and tools such as Python, R, SQL, and Spark to implement data pipelines, data models, and machine learning models. They should also be familiar with cloud computing platforms like AWS, GCP, or Azure to deploy and scale their solutions.
When observing the responsibilities analytics engineers have in several companies, they can include the following:
-
Design and implement data storage and retrieval systems, such as databases and data warehouses, that can handle large and complex datasets. Create and maintain data pipelines to extract, transform, and load data from various sources into a central repository for analysis.
-
Ensure data is accurate, complete, consistent, and accessible by performing data quality checks, tracking data flows, and implementing data security measures.
-
Leverage cloud computing platforms such as AWS, GCP, or Azure to deploy and scale analytics solutions, as well as scalability, security, and cost optimization of data infrastructure.
-
Optimize the performance of data storage and retrieval systems, data pipelines, and machine learning models to ensure they can handle the volume and complexity of data.
-
Use programming languages and tools such as Python, R, SQL, and Spark to implement data pipelines, data models, and machine learning models.
-
Collaborate with data scientists to understand their requirements, select and implement appropriate algorithms, and ensure machine learning models are properly trained and deployed. Monitor and maintain the performance of machine learning models and troubleshoot and optimize as needed.
-
Keep up-to-date with the latest technologies and trends in data engineering, machine learning, and analytics, and continually seek opportunities to improve the organization’s data infrastructure and analytics capabilities.
The role of an analyst is broad and requires a combination of technical skills, problem-solving skills, and an understanding of business needs. Analytics engineers must be comfortable with data science’s technical and business aspects and should be able to bridge the gap between data scientists and IT.
Enabling Analytics in a Data Mesh
A data mesh is a modern framework outlining an organization’s data strategy. It enables business domain teams to take ownership of their data and the services that provide access to it instead of relying only on a central data team. It decomposes a monolithic data architecture into a set of independent, autonomous data services, enabling finer scaling, more autonomy, and better data management. It provides more flexibility in handling different types of data and enables a culture of experimentation, innovation, and collaboration. With a data mesh, enterprises should be able to move faster and respond more quickly to changing business needs.
The emergence of data mesh methodology as an architectural pattern has revolutionized the way analysts interact with data infrastructure. By decomposing a monolithic data architecture into a series of independent, autonomous data services that can be developed, deployed, and operated independently, teams can address the challenges of scalability, manageability, and autonomy of the data architecture in a more granular and effortless way.
With this novel approach, teams can scale their data infrastructure more granularly, reducing the risk of data silos and duplication. Each business domain team also has more autonomy, allowing them to choose the best tools and technologies for their specific needs but leverage centrally offered services to manage the whole data lifecycle. This permits them to move faster, be more agile, and respond quickly to changing business needs. In addition, a data mesh approach provides more flexibility in handling different types of data, such as structured, semi-structured, and unstructured data. It also enables better data governance practices by breaking down the monolithic data architecture and enabling clear mapping of data services.
An analytics engineer can deliver value in a data mesh organization by focusing on building and maintaining independent, autonomous data services that support the needs of multiple teams and applications, such as shared data models, well-governed and documented to ensure effortless data discoverability, accessibility, and security.
Another meaningful aspect of working on a data mesh is ensuring data governance and security, which can include implementing data policies and procedures, such as data access controls, data sequencing, and data quality checks, to ensure that data is secure and of high quality. In addition, analytics engineers should work with data owners and stakeholders to understand and comply with all data storage and management regulatory requirements.
Working in a data mesh requires a different mindset than in traditional monolithic data architectures. Analytics engineers must move away from the notion that data is a centralized resource and consider it as distributed autonomous services that various teams can use.
Data Products
Another concept we have been using that is important to define is that of a data product. These are accessible applications providing access to data-driven insights that will support business decision-making processes or even automate them. Internally they may contain components for retrieving, transforming, analyzing, and interpreting data. Another important aspect is that data products should expose their data in such a way that it can be accessed and used by other internal or external applications or services.
Some examples of data products are as follows:
-
A REST API that allows users to query a specific business data model
-
A data pipeline that ingests and processes data from various sources
-
A data lake that stores and manages large amounts of structured and unstructured data
-
A data visualization tool that helps users understand and communicate data insights
Data products can also consist of microservices. These are small, independent, and focused services that can be developed, deployed, and scaled independently. They can be accessed via an API and reused across the enterprise.
dbt as a Data Mesh Enabler
dbt is an open source tool that helps data engineers, analytics engineers, and data analysts build a data mesh by providing a way to create, test, and manage data services. It allows teams to define, test, and build data models and create a clear and well-defined interface for these models so that other teams and applications can easily use them.
The dbt features that support the creation of a data mesh include the following:
- Data modeling capabilities
-
Data modeling capabilities allow teams to define their data models by using a simple and familiar SQL-based syntax that makes it easy for data engineers and data analysts to define and test data models together.
- Data testing capabilities
-
dbt provides a testing framework that allows teams to test their data models and ensure that they are accurate and reliable. This helps identify errors early in the development process and ensures that data services are of high quality.
- Data documentation
-
dbt enables data models and services to be documented so that they can be easily understood and used by other teams and applications.
- Data tracking capabilities
-
Data tracking capabilities allow teams to trace the origin of data models. This makes it easy to understand how data is used and where it came from.
- Data governance capabilities
-
Data governance capabilities make it possible to enforce data governance policies such as data access controls, data lineage, and data quality checks, which help ensure that data is secure and of high quality.
While the primary focus of analytics engineering is on designing and implementing data models, it’s important to note that data tracking and governance capabilities can significantly enhance the effectiveness of analytics engineering processes. These capabilities can be particularly valuable in scenarios where data models need to trace the origin of data and adhere to stringent data governance policies. Adoption of such practices and governance models, including a data mesh, may vary depending on the specific needs and complexity of the data environment. Many successful dbt deployments start with simpler single-star schema data models and may explore advanced concepts like data mesh as their data needs evolve over time.
The Heart of Analytics Engineering
Data transformation converts data from one format or structure to another to make it more useful or suitable for a particular application or purpose. This process is necessary because it enables organizations to transform raw, unstructured data into valuable insights that can inform business decisions, improve operations, and drive growth.
Data transformation is a critical step in the analytics lifecycle, and it is important that organizations have the tools and technology to perform this task efficiently and effectively. Some examples of data transformation include cleaning and preparing data, aggregating and summarizing data, and enriching data with additional information. The use of dbt is widespread for data transformation because it allows organizations to perform complex data transformation tasks quickly and easily, and it can be integrated with other tools, such as Airflow, for end-to-end data pipeline management.
dbt is the gemba for analysts and enterprise stakeholders. The value to businesses and stakeholders comes when data is transformed and delivered in an easy-to-use form.
Tip
Gemba is a Japanese term meaning “the real place.” In the corporate world, gemba refers to the place where value is created.
In an ETL strategy, data transformation is typically performed before the data is loaded into a target system, such as a data warehouse or data lake. Data is extracted from various sources, transformed to match the structure and format of the target system, and then loaded into the target system. This process ensures that the data is consistent and usable across systems and applications.
In contrast, an ELT strategy represents a newer and more flexible approach to data processing. In this strategy, data is first extracted and loaded into a target system before undergoing transformation. ELT offers several advantages, including increased flexibility and the ability to support a wider range of data applications than the traditional ETL paradigm. One significant benefit is its versatility in accommodating various data transformations and real-time insights directly within the target system. This flexibility empowers organizations to derive actionable insights from their data more rapidly and adapt to changing analytical needs.
However, it’s important to acknowledge that ELT can come with higher storage and ingestion costs, given the storage of raw or minimally transformed data. Many businesses find these costs justifiable because of the substantial value—in particular, the flexibility it brings to their operations. Therefore, ELT has gained popularity, especially with the emergence of cloud-based data warehousing solutions and the improved data transformation and processing capabilities they offer.
Regardless of the strategy used, without proper data cleaning, transformation, and standardization, data may end up inaccurate, incomplete, or difficult to use, resulting in poor decision making.
The Legacy Processes
Traditionally, legacy ETL processes were often complex, time-consuming, and required specialized skills to develop, implement, and maintain. They also typically required significant manual coding and data manipulation, making them error-prone and difficult to scale.
In addition, these processes were often inflexible and could not be adapted to changing business needs or new data sources. With the growing volume, variety, and velocity of data, legacy ETL processes are becoming increasingly inadequate and, so, are being replaced by more modern and flexible approaches such as ELT.
In the past, ETL was usually performed using custom scripts or specialized visual-based ETL tools. These scripts or tools extracted data from various sources, such as flat files or databases, performed the necessary transformations on the data, and then loaded the data into a target system, such as a data warehouse.
An example of a legacy ETL process would be using a combination of SQL scripts and programming languages such as Java or C# to extract data from a relational database, transforming the data using the programming language, and then loading the transformed data into a data warehouse. Another example is using specialized ETL tools such as Oracle Data Integrator or IBM InfoSphere DataStage to extract, transform, and load data across systems. These legacy ETL processes can be complex, challenging to maintain and scale, and often require a dedicated team of developers.
Using SQL and Stored Procedures for ETL/ELT
In the past, specific data platforms used stored procedures in a relational database management system (RDBMS) such as SQL Server or Oracle for ETL purposes. Stored procedures are prepared SQL code that you can store in your database engine so that the code can be used repeatedly. Depending on whether it is a data inflow or outflow, the scripts are executed either in the source or the target database.
Suppose you want to create a simple stored procedure to extract from a table, transform the data, and load it into another table, as shown in Example 1-1.
Example 1-1. SQL procedure to extract data
CREATE
PROCEDURE
etl_example
AS
BEGIN
-- Extract data from the source table
SELECT
*
INTO
#
temp_table
FROM
source_table
;
-- Transform data
UPDATE
#
temp_table
SET
column1
=
UPPER
(
column1
),
column2
=
column2
*
2
;
-- Load data into the target table
INSERT
INTO
target_table
SELECT
*
FROM
#
temp_table
;
END
This stored procedure first uses the SELECT INTO
statement to extract all data from the source table and store it in a temporary table (#temp_table
). Then it uses the UPDATE
statement to change the values of column1
to uppercase and double the value of column2
. Finally, the stored procedure uses the INSERT INTO
statement to load the data from the #temp_table
into the target_table
.
Note
Don’t be afraid if you aren’t familiar with the SQL syntax. Chapter 3 is fully dedicated to giving you the foundations to working with it.
It is important to note that this is an elementary example and that actual ETL processes are often much more complex and involve many more steps, such as data validation, handling null values and errors, and logging process results.
Although it is possible to use stored procedures for ETL processes, it is essential to note that using them may have some implications, such as the need for specialized knowledge and expertise to write and maintain those procedures and the lack of flexibility and scalability. In addition, using stored procedures for ETL can make it challenging to integrate with other systems and technologies and troubleshoot problems that arise during the ETL process.
Using ETL Tools
As previously mentioned, ETL tools are software applications that accelerate the process of building ingestion and transformation pipelines by providing a visual interface, a software development kit (SDK), or a programming library with prepackaged code and artifacts that can be used for extracting, transforming, and loading data from various sources into a target, such as a data warehouse or data lake. They are generally used in many organizations to automate the process of transferring data from various systems and databases to a central data warehouse or data lake, where it can be analyzed.
Airflow is a popular open source platform for managing and scheduling data pipelines. Developed by Airbnb, it has gained popularity in recent years because of its flexibility and scalability. Airflow allows users to define, schedule, and monitor data pipelines using Python code, making them easy for data engineers and scientists to create.
Example 1-2 presents a simple Airflow DAG. A directed acyclic graph (DAG) is a directed graph with no directed cycles.
Example 1-2. An Airflow DAG
from
airflow
import
DAG
from
airflow.operators.bash_operator
import
BashOperator
from
datetime
import
datetime
,
timedelta
default_args
=
{
'owner'
:
'me'
,
'start_date'
:
datetime
(
2022
,
1
,
1
),
'depends_on_past'
:
False
,
'retries'
:
1
,
'retry_delay'
:
timedelta
(
minutes
=
5
),
}
dag
=
DAG
(
'simple_dag'
,
default_args
=
default_args
,
schedule_interval
=
timedelta
(
hours
=
1
),
)
task1
=
BashOperator
(
task_id
=
'print_date'
,
bash_command
=
'date'
,
dag
=
dag
,
)
task2
=
BashOperator
(
task_id
=
'sleep'
,
bash_command
=
'sleep 5'
,
retries
=
3
,
dag
=
dag
,
)
task1
>>
task2
This code defines a DAG named simple_dag
that runs every hour. It has two tasks, print_date
and sleep
. The first task executes the date
command, which prints the current date and time. The second task executes the sleep 5
command, which makes the task sleep for five seconds. The second task has the number of retries set to 3. So if it fails, it will retry three times before giving up. The two tasks are connected by the operator >>
. This also means task2
depends on task1
and will be executed only after task1
completes successfully.
Airflow is a productive tool for scheduling and managing ETL pipelines, but it has some limitations. First, Airflow can be very complex to set up and manage, especially for large or complicated pipelines. Second, it is not explicitly designed for data transformation and may require additional tools or custom code to perform certain types of data manipulation.
Note
dbt can address these Airflow limitations by providing a set of best practices and conventions for data transformation and a simple, straightforward interface for performing and managing data transformation. It can also be integrated with Airflow to supply a complete ETL/ELT solution that is easy to set up and manage while providing high flexibility and control over data pipelines.
The dbt Revolution
dbt is an open source command-line tool that is becoming increasingly popular in the data analytics industry because it simplifies and streamlines the process of data transformation and modeling. On the other hand, Airflow is a powerful open source platform for programmatically creating, scheduling, and monitoring workflows. When dbt is integrated with Airflow, the data pipeline can be more efficiently managed and automated. Airflow can be used to schedule dbt runs, and dbt can be used to perform the data transformation tasks in the pipeline.
This integration enables teams to manage the entire data pipeline from data extraction to loading into a data warehouse, ensuring that data is always up-to-date and accurate. The integration makes it easier to automate data pipeline tasks, schedule and monitor the pipeline, and troubleshoot issues as they arise.
To illustrate the simplicity of building a simple dbt model, imagine that you want to build a model that calculates a company’s total revenue by adding the revenue for each order. The model could be defined using a dbt model file specifying the calculation’s SQL code and any required dependencies or parameters. Example 1-3 presents what the model file could look like.
Example 1-3. A dbt model
{{
config
(
materialized
=
'table'
)
}}
select
sum
(
orders
.
revenue
)
as
total_revenue
from
{{
ref
(
'orders'
)
}}
as
orders
One of the main advantages of dbt is that analytics engineers can write reusable, maintainable, and testable code for data transformations in a simple high-level language that eliminates the complexity of writing SQL. This facilitates team collaboration on data projects and reduces the risk of errors in the data pipeline.
Another benefit of dbt is that it enables more efficient data pipeline management. By integrating with orchestration tools like Airflow and others such as Dagster or Prefect, as well as dbt Labs’ own dbt Cloud product, dbt empowers teams to effectively plan, schedule, and monitor data pipelines. This ensures that data remains consistently up-to-date and accurate. The synergy between dbt and orchestration tools like Airflow allows for seamless data refresh and the deployment of new logic, akin to CI/CD practices in software engineering. This integration ensures that as new data becomes available or transformations are updated, the data pipeline can be orchestrated and executed efficiently to deliver reliable and timely insights.
Overall, dbt is becoming widespread for organizations looking to improve their data analytics capabilities and streamline their data pipelines. Although it is still a relatively new technology, it is being used by many companies and is considered a valuable tool for data professionals. Chapter 4 will provide a more in-depth view of dbt and its capabilities and features.
Summary
In recent decades, the field of data management has undergone profound transformations, transitioning from structured methods of data storage and access, such as SQL-based stored procedures, to more flexible and scalable workflows. These modern workflows are supported by powerful tools like Airflow and dbt. Airflow facilitates dynamic orchestration, while dbt elevates analytics code to the level of production-grade software, introducing innovative approaches to data testing and transformation.
Amid this dynamic environment, new roles have emerged, with the analytics engineer standing at the intersection of data engineering and data analytics, ensuring the delivery of robust insights. Despite the evolution of tools and roles, the intrinsic value of data remains unchanged. However, data management is evolving into a discipline that focuses not only on data itself but also on the professionals who wield it.
Even with these advancements, the core challenges persist: acquiring critical data, maintaining the highest data-quality standards, storing data efficiently, and meeting stakeholder expectations in data delivery. At the heart of the data value chain lies the revitalization of data modeling. Efficient data modeling goes beyond data gathering; it structures and organizes data to reflect real-world relationships and hierarchies. Chapter 2 will delve into data modeling and its pivotal role in analytics engineering.
Throughout this chapter, we have explored the evolution of data management, the emergence of the analytics engineer role, and concepts like data mesh and the distinction between ELT and ETL strategies. This diverse set of topics aims to provide a comprehensive overview of the data landscape.
Get Analytics Engineering with SQL and dbt now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.