Chapter 4. Generate Data Observations

As explained in Chapter 3, data observability combines technology and people to gather information about the state of a system from the data perspective, and the expectations of that state. It then uses the information to make the system more adaptable or resilient.

This chapter explains how to apply the data observability practice. I will start with “data observability at the source,” a method to introduce collection strategies in your day-to-day data work, and I’ll show you how to minimize its impact on efficiency. Then, the chapter elaborates on implementing expectations that will subscribe to the software delivery lifecycle, such as continuous integration and continuous deployment (CI/CD).

As with any emerging practice and technology, to increase data observability adoption, you need to lower the barrier to entry; that way, people have less reason to argue against the change. However, people are also part of the solution, as their involvement in the process is crucial to identify their expectations and codify the rules. To this end, you’ll learn several ways to reduce the effort required to generate observations and understand how to introduce them at the right moment of the development lifecycle.

At the Source

Chapter 2 explained the sources and types of information that help the observer. But how do you generate and collect the information from these sources?

It starts with data observability at the source. The term “source” refers to the applications responsible for reading, transforming, and writing data. These applications can either be the root cause of issues or the means to solve them, as explained in Chapter 3. Additionally, applications are within our control as engineers and organizations, unlike the data itself.

Hence, the method of data observability at the source turns on the ability for applications to generate data observations following the model covered in Chapter 2; in other words, the applications are made data observable.

Pure data and analytics observations—such as data sources, schemas, lineages, and metrics—correlate to reading, transforming, and writing activities. The strategies described in this section address these activities by explaining what to consider when running them.

Generating Data Observations at the Source

Producing data observations at the source begins by generating additional information capturing the behavior of specific activities with data: reading, transforming, and writing. For example, developers can add lines of logs that contain the information needed to generate visibility about what their application does. The instructions use the channels—that is, the logs, metrics, and traces or lineages—covered in Chapter 2 to convey the observations that can be centralized in a logging system.

In the next section, you learn how to create data observations in JSON format that can be collected (published) to local files, a local service, a distant (web) service, or similar destinations. For example, following the data observability core model in Chapter 2, the data source and schema entity of a Postgres table would look like Example 4-1.

Example 4-1. Example of data observations encoded as JSON

{
  "id": "f1813697-339f-5576-a7ce-6eff6eb63249",
  "name": "gold.crm.customer",
  "location": "main-pg:5432/gold/crm/table",
  "format": "postgres"
}
{
  "id": "f1813697-339f-5576-a7ce-6eff6eb63249",
  "data_source_ref": {"by_id": "e21ce0a8-a01e-5225-8a30-5dd809c0952e"},
  "fields": [
  { "name": "lastname", "type": "string", "nullable": true }, 
  { "name": "id", "type": "int", "nullable": false }
  ]
}

This ability to collect data observations in the same model on a centralized platform is key to generating the value of data observability at scale, such as across applications, teams, and departments. That’s why using the data observability core model is important to aggregating data observations with ease, especially along lineages.

Let’s see how to generate data observations in Python using a low-level API, which will be used to introduce higher-level abstractions (covered in the next chapters).

Low-Level API in Python

This strategy of using a low-level API requires a lot of your time and involvement because you are responsible for creating every observation explicitly. However, this strategy also gives you the greatest flexibility, because it doesn’t involve any higher-level abstractions.

On the flip side, the support of data observability at this level, especially during exploration and maintenance, requires the developer to be consistent and always think about what they may want to observe in production (for example, any senior developer should be producing as many lines for logs and checks as for business logic).¹ During development, developers must then create visibility about the modifications brought to the logic or behavior of the application with the generation of the associated observations. Examples of such observations include a connection to a new table, the creation of a new file, or the alteration of a structure with a new field.

In the following sections, you will go through a complete example of data applications written in Python that generate data observations alongside their data usage by doing the following:

Understanding applications without data observability capability
Adding instructions to generate data observations and their purpose
Gaining insights into the pros and cons of using this strategy

Description of the Data Pipeline

Throughout the remainder of the chapter, I will use a data pipeline written in Python that we will make data observable. The code of the pipeline on GitHub allows you to run the examples in this chapter. It uses the pandas library to manipulate CSV files and is composed of two applications, ingestion and reporting, as presented in Figure 4-1.

Both applications (ingestion and reporting) use Python and pandas and share data from the “Daily Stock Prices” data source to create two downstream reports (BuzzFeed stock and AppTech).

The ingestion application reads the CSV files for the daily stock prices that the stock market team provides each month. The team partitioned the files by year and month, then merged the prices into monthly views stored as a separate file, as shown in ingestion in Example 4-2.

Example 4-2. Data ingestion application

import pandas as pd

AppTech = pd.read_csv( 
   "data/AppTech.csv",
   parse_dates=["Date"], 
   dtype={"Symbol": "category"}, 
)
Buzzfeed = pd.read_csv(
   "data/Buzzfeed.csv",
   parse_dates=["Date"],
   dtype={"Symbol": "category"},
)

monthly_assets = pd.concat([AppTech, Buzzfeed]) \ 
   .astype({"Symbol": "category"})
monthly_assets.to_csv(
   "data/monthly_assets.csv", index=False
)

The ingestion application processes two CSV files for the following stocks:

BuzzFeed
AppTech

The application reads the files from the “../data” folder, for reference; a row in these files looks like this:

`1`	`Date`	`Symbol`	`Open`	`High`	`Low`	`Close`	`AdjClose`	`Volume`
`2`	`2022-01-03`	`APCX`	`14.725`	`15.200`	`14.25`	`14.25`	`14.25`	`1600`

During exploration, our data engineer determined that the files have two fields that need to be preprocessed. The Date field must be parsed as a date, and the Symbol field must be treated as a category.

The application reads and processes the daily stocks before making them available in memory. Then, the application merges them into a single variable, which in this case is a new pandas DataFrame, containing all categories in Symbol.

The script writes the result into the same folder as a new file named monthly_assets.csv. This way, other analysts or data engineers can start there if they need all daily stock prices for further analysis.

After this first script runs and the file is available, the remaining pipeline scripts can run to generate the BuzzFeed and AppTech stock reports. This example has only one script, the reporting Python file, shown in Example 4-3.

Example 4-3. Data reporting application

import pandas as pd

all_assets = pd.read_csv("data/monthly_assets.csv", parse_dates=['Date']) 

apptech = all_assets[all_assets['Symbol'] == 'APCX'] 
buzzfeed = all_assets[all_assets['Symbol'] == 'BZFD']

buzzfeed['Intraday_Delta'] = buzzfeed['Adj Close'] - buzzfeed['Open'] 
apptech['Intraday_Delta'] = apptech['Adj Close'] - apptech['Open']

kept_values = ['Open', 'Adj Close', 'Intraday_Delta']

buzzfeed[kept_values].to_csv("data/report_buzzfeed.csv", index=False) 
apptech[kept_values].to_csv("data/report_appTech.csv", index=False)

This application performs the following actions:

Reads the data source created by the previous application

Filters out the data for the two reports it creates: BuzzFeed and AppTech.

Computes the Intraday_Delta scores, which are the daily stock evolutions to report.

Reports the Open, Adj Close, and the computed score Intraday_Delta values in the dedicated files for each stock.

To execute the pipeline appropriately, it is important to respect the dependencies between the reporting and ingestion applications. That is, you must run the ingestion application successfully before you run the reporting application. Otherwise, the pipeline will fail. A solution to achieve this is to use an orchestrator that runs ingestion before reporting, such as Dagster and Airflow. In fact, the orchestrator is another application to configure by explicitly hardcoding the dependencies between applications at the data level. However, the applications themselves are still ignorant of their downstream dependencies.

The flip side of hardcoding the dependencies in an orchestrator is that it is a new asset to maintain (e.g., the accuracy of the explicit dependencies). Also, additional constraints are imposed when creating a pipeline because teams must stay independent and therefore can’t simply update an existing pipeline to meet their own needs. Thus, an extension must be created at a higher level by adding a new, separated DAG, which could disconnect the explicit dependencies.

Returning to our pipeline, let’s discuss the functional dependencies between the applications; that is, ingestion must run successfully before reporting can be run. But what does successfully mean?

Definition of the Status of the Data Pipeline

To determine whether an execution of a data pipeline is successful, I will analyze the opposite question: What kind of failures can occur in the ingestion?

Explicit failure: This causes the application to crash. This type of failure is easy for the orchestrator to handle if you use one, because it’s a flag not to trigger the next application: the reporting application, in our example.
Silent failure: In this case, the application finishes, without, for example, an error code or log. Because it doesn’t run as expected, you must consider the notion of expectations introduced in Chapter 2.

An observer of the ingestion application would face explicit failures like these:

File not found errors: Occur when any of the data files, such as Buzzfeed.csv, aren’t available in the folder because it was renamed or changed to lowercase, or the file wasn’t created before running the ingestion application.
Type errors (TypeError): Occur if some values aren’t coercible to match the types provided to the read_csv function, such as when the symbol should be a category.
Hardcoded name errors: Occur when any of the fields explicitly used in the code to access the values, such as the column name Date in Example 4-3 and Symbol in this case, aren’t present or the name has changed.
Filesystem errors: Happen when the files aren’t readable or the folder isn’t writable for the user who runs the application.
Memory errors: Happen when the files increase in size to the point that the memory allocated to the application is no longer sufficient.
System errors: Triggered if the disk doesn’t have any more space to write the aggregated result.

But from the perspective of the engineer who observes it, the following examples indicate silent failures:

The Date column isn’t parsable as a date because it’s malformed, the pattern changed, or the time zone isn’t consistent. In this case, the column isn’t a datetime anymore but an object.
The Date column contains values but not for the current month. All values are past or future dates.
The Date column contains values in the future because the generator might have run later and generated information about the future to compare to the month in which it’s processing. This situation can generate duplicates later, or inconsistent values for the same dates, and can likely fail some aggregations.
The Symbol category changes for various reasons, such as because of a typo or a capitalization change, length, or reference. This category is used across the files and is written in the output file as a category.

Also, the reporting application might regard the ingestion as failed because it provoked the following failures in reporting:

The monthly aggregated file wasn’t written when the reporting started, making it unavailable when the reporting tool is scheduled to run at a given interval.
Any of the fields used to filter in the arithmetic aren’t available or their name changed.
The same errors occur with read/write access, size, and space.

Also, reporting can fail silently due to the silent failures of the ingestion, plus:

The APCX and ENFA symbols aren’t available in the provided file.
Adj Close or Open are missing values, which generates garbage in the Intraday_Delta output. This problem can also be an explicit failure, in which case the values become pandas’ NA.
The aggregated file doesn’t contain information about the current month, but the dates are in the past.

Because any of these failures can occur, you must have visibility when they happen and—even better—anticipate them earlier to prevent their propagation (see “Fail Fast and Fail Safe”) from the ingestion application.

The explicit failures should already have been made visible, as a development practice, to catch those errors explicitly (try…except in Python). However, for an observer to identify and discover silent failures, they need the application to produce the appropriate observations.

Data Observations for the Data Pipeline

In this section, I will provide an overview of the data observations the data pipeline must generate. For this, let’s have a quick glance at Figure 4-2, which shows how a low-level API implements the model presented in Chapter 2. It is interesting to note that they share a similar structure and even some entities (labeled); in the next paragraphs, I will detail each portion individually to highlight these facts.

The observations for ingestion and reporting and their similarities

In this diagram, you’ll notice labeled entities as uppercase letters A, B, C, and D in circles. The “A” data sources highlight the observations generated by the ingestion application about the data it generates, and by the reporting as it is reading it, hence making explicit the implicit dependency.

In fact, both applications generate several similar observations, which represent all dependencies that tie them together. In Figure 4-2, the following similar observations are also highlighted:

The “B” entities observe the server the data was retrieved from.
The “C” entities observe the user executing commands.
The “D” entities observe the schema of the data generated by the ingestion, as read by the reporting.

Let’s dig into what we need to add to the application’s code to generate the observations shown in Figure 4-2. Because the code is in Python, we’ll use the logging module to print the observations that are encoded in JSON.

Generate Contextual Data Observations

In this section, I will cover the code needed to generate observations about the execution context of the ingestion application shown in Figure 4-3 (note that reporting can reuse the same code).

Data observations about the execution context of the ingestion application

Insert the code in Example 4-4 at the beginning of the file to generate the observations for the ingestion application.

Example 4-4. Generate data observations about the ingestion application

app_user = getpass.getuser() 
repo = git.Repo(os.getcwd(), search_parent_directories=True) 
code_repo = repo.remote().url
commit = repo.head.commit
code_version = commit.hexsha
code_author = commit.author.name
application_name = os.path.basename(os.path.realpath(__file__)) 
application_start_time = datetime.datetime.now().isoformat()

Retrieve user name information.

Retrieve git information.

Retrieve information about the application execution.

The extra instructions in Example 4-5 create variables for the observations, but nothing is done yet with them. To log the information as mentioned previously, we use a JSON representation of the information model, encoded as in Example 4-5.

Example 4-5. Modeling data observations about the ingestion runtime

application_observations = { 
   "name": application_name,
   "code": {
       "repo": code_repo,
       "version": code_version,
       "author": code_author
   },
   "execution": {
       "start": application_start_time,
       "user": app_user
   }
}

This code creates the JSON that’s formed with all the observations created so far. However, this section is about using a low-level API for data observability. As we move on, we will encounter a similar pattern, giving us the opportunity to create functions to simplify the code and share it across the ingestion and reporting applications, or any other Python applications in the future.

To create an API, we create a model that mimics the observation core model in JSON, turns each entity into a class, and converts relations into references (see Example 4-6 ).

Example 4-6. Modeling the application data observations using dedicated classes

    class Application: 
       name: str
    
       def __init__(self, name: str, repository: ApplicationRepository) -> None:
           self.name = name
           self.repository = repository
    
       def to_json(self):
           return {"name": self.name, "repository": self.repository.to_json()}
    
    class ApplicationRepository: 
       location: str
    
       def __init__(self, location: str) -> None:
           self.location = location
    
       def to_json(self):
           return {"location": self.location}
    
    app_repo = ApplicationRepository(code_repo)
    app = Application(application_name, app_repo)

That means the application entity has to have an Application class with a property name that can hold the name of the file as an application_name variable, and a reference to an ApplicationRepository instance. This ApplicationRepository entity will be encoded as an ApplicationRepository class that has a property location set as the git remote location. This structure will help build the model and generate a JSON representation more easily that will be reusable and able to lead to standardization.

An additional benefit of encoding concepts into API classes is that it gives them the responsibility to propose helpers to extract associated observations, as in Example 4-7 .

Example 4-7. Leverage classes to define helpers related to modeled entities

class ApplicationRepository:
   location: str

   # [...]

   @staticmethod
   def fetch_git_location(): 
       import git
       code_repo = git.Repo(os.getcwd(), search_parent_directories=True).remote().url
       return code_repo


class Application:
   name: str

   # [...]
   @staticmethod
   def fetch_file_name(): 
       import os
       application_name = os.path.basename(os.path.realpath(__file__))
       return application_name


app_repo = ApplicationRepository(ApplicationRepository.fetch_git_location())
app = Application(Application.fetch_file_name(), app_repo)

This strategy might be a straightforward way to implement the model. However, we prefer another approach that weakens the links between entities. In Example 4-8, all information will be logged in one JSON with entities spread into the tree of information with Application at its root. This encoding forces us to create all observations before logging the root one, which is the Application instance. The Application constructor would become something like Example 4-8.

Example 4-8. Bloated constructor for Application without separation of concerns

class Application:
    name: str
    def __init__(self, name: str, version: ApplicationVersion, 
                 repo: ApplicationRepository, execution: ApplicationExecution, 
                 server: Server, author: User) -> None:
        pass

To avoid this complexity and constraint, a better way is to reverse the dependencies between the entities. Instead of having the Application contain its ApplicationVersion or ApplicationRepository, we’ll create the Application alone and then add a weak reference to it from within the ApplicationVersion and ApplicationRepository. Example 4-9 shows how this model would look.

Example 4-9. Reversing the dependencies between entities and introducing `id`

class ApplicationRepository:
   location: str
   application: Application
   id: str

   def __init__(self, location: str, application: Application) -> None:
       self.location = location
       self.application = application
       id_content = ",".join([self.location, self.application.id])
       self.id = md5(content.encode("utf-8")).hexdigest() 

   def to_json(self):
       return { "id": self.id, 
                "location": self.location, 
                "application": self.application.id }

   @staticmethod
   def fetch_git_location():
       import git
       code_repo = git.Repo(os.getcwd(), 
                            search_parent_directories=True).remote().url
       return code_repo

With this model, we can log the observations individually—the two calls to logging.info—reducing the amount of information to retain. Because we need to recompose the entity relationships, we introduce the id variable to reduce the amount of information to log and observations to link. Using the logs, id can reconstruct the links in the model, like the dependency between the ApplicationRepository and the Application, by their ids, because they have already been logged.

In this example, the application generated id locally, resulting in a poor design that will make it inconsistent across several executions. To circumvent this problem, we must define a functional id that can identify the entities across executions, deployments, and applications. This notion is known as the primary key in modeling. You can use a primary key as an input in a hashing algorithm, for example, that will generate the id in a deterministic fashion, in this case using hashlib.

Example 4-9 illustrates how to use the primary key to generate the id consistently, for example, by using md5 . We’ll use this strategy throughout this chapter to generate entities.

Generate Data-Related Observations

Let’s now cover the observations of the main data observability components: data source, schema, and data metrics. Figure 4-4 shows the observations we must log.

Entities data source, schema, and data metrics of ingestion.

The observations to generate are related to the data source that’s read and written. Then, the transformation (lineage) is addressed. In the ingestion code, many sources are read using the pandas read_csv function.

The observation for the data source will be mainly the file path provided as an argument to the read_csv function. Another observation would be the data source format provided by the fact that read_csv is used without any specific parser attributes, such as sep, and thus the data is a “real” CSV.

Also, read_csv isn’t given any information about the header. Therefore, the first non-blank line is expected to be the column names, which will help create an observation about the data the ingestion application accesses.

The type of columns is another piece of information that can be observable. pandas infers most of the numeric types. However, strings, categories, and dates are kept as objects. The read_csv function provides two hints: the Date is date information, and the Symbol is categorical data, or a string in this case.

Describe Function

You have likely used the describe function many times when exploring and developing your application. This function is essential for knowing what you must do with the data to achieve your goal. This exploration, which is based on a set of assumptions about data, leads to understanding the data. Therefore, logging the result of the describe function gives you the information to understand the data later if the assumptions about it hold true in the future.

For example, in Example 4-1, computing and logging as JSON, the BuzzFeed’s describe could be logging.info(json.dumps(Buzzfeed.describe().to_dict())), which logs the following JSON line:

 
{"Open": {"count": 20.0, "mean": 4.4985, "std": 0.503014857999663, 
"min": 3.72, "25%": 4.077500000000001, "50%": 4.43, "75%": 4.9525, 
"max": 5.28}, "High": {"count": 20.0, "mean": 4.6921, 
"std": 0.5150902731305302, "min": 3.98, "25%": 4.30675, "50%": 4.66, 
"75%": 5.0425, "max": 5.57}, "Low": {"count": 20.0, "mean": 4.31515, 
"std": 0.5218698441283289, "min": 3.622, "25%": 3.96, "50%": 4.27, 
"75%": 4.595, "max": 5.24}, "Close": {"count": 20.0, "mean": 4.4345, 
"std": 0.506894310586796, "min": 3.73, "25%": 4.0275, "50%": 4.39, 
"75%": 4.71, "max": 5.29}, "Adj Close": {"count": 20.0, "mean": 4.4345, 
"std": 0.506894310586796, "min": 3.73, "25%": 4.0275, "50%": 4.39, 
"75%": 4.71, "max": 5.29}, "Volume": {"count": 20.0, "mean": 193095.0, 
"std": 104746.62901245884, "min": 79300.0, "25%": 104350.0, 
"50%": 168350.0, "75%": 234975.0, "max": 386500.0}}

With the values, pandas creates a DataFrame, a sort of table we can use to catch the information listed so far. Moreover, the DataFrame also allows us to compute additional intrinsic information about the values themselves, such as descriptive statistics. We use the describe function to compute these basic statistics on numerical values and other types like categorical or date values using the argument: include='all'.

Finally, the ingestion application writes the monthly assets CSV file by using the to_csv function on the DataFrame that holds in memory the values to be written. This function gives almost the same level of information as the read_csv function. So, we can derive the file path, columns, and types.

Let’s see how we can model these observations in a reusable API. The first class is DataSource, which is relatively straightforward to modelize, as shown in Example 4-10. We mainly want to know the format and where the data source the application is reading or writing is located . Therefore, it will model the path to the file.

Example 4-10. DataSource entity class

class DataSource:
   location: str
   format: str
   id: str

   def __init__(self, location: str, format: str = None) -> None: 
       self.location = location
       self.format = format
       id_content = ",".join([self.location, self.application.id])
       self.id = md5(content.encode("utf-8")).hexdigest() 

   def to_json(self):
       return {"id": self.id, "location": self.location, "format": self.format}

Also, id is generated based on the provided path, which is relative (starts with ./), and should be expanded to use the absolute path. Further, if we use a database table, such as a method other than read_csv, we’d need to add other helpers to handle connection strings.

Next, Example 4-11 can modelize the schema that the application manipulates in the data source.

Example 4-11. Schema entity class

class Schema:
   fields: list[tuple[str, str]]
   data_source: DataSource
   id: str

   def __init__(self, fields: list[tuple[str, str]], data_source: DataSource) -> None:
       self.fields = fields
       self.data_source = data_source
       linearized_fields = ",".join(list(map(lambda x: x[0] + "-" + x[1], 
                                    sorted(self.fields))))
       id_content = ",".join([linearized_fields, self.data_source.id])
       self.id = hashlib.md5(id_content.encode("utf-8")).hexdigest()

   def to_json(self):
       from functools import reduce
       jfields = reduce(lambda x, y: dict(**x, **y), 
                        map(lambda f: {f[0]: f[1]}, 
                        self.fields))
       return {"id": self.id, "fields": jfields, "data_source": self.data_source.id}

   @staticmethod
   def extract_fields_from_dataframe(df: pd.DataFrame): 
       fs = list(zip(df.columns.values.tolist(), map(lambda x: str(x), 
             df.dtypes.values.tolist())))
       return fs

The Schema class has the fields property, which models the columns of the CSV file. Here, we can extract it from the metadata of the pandas DataFrame: columns and dtypes. We chose it to represent the fields as a list of pairs (field name, field type) .

The last class for this part is DataMetrics, which will model the metrics of the files that the ingestion application reads and writes. However, it’s an incomplete version of the class because it encodes only the relationship with Schema. It must be extended to ensure it provides visibility into the data metrics for a specific usage. This purpose requires a lineage that comes later (see Example 4-15).

The current class looks like Example 4-12.

Example 4-12. DataMetrics entity class

class DataMetrics:
   schema: Schema
   metrics: list[tuple[str, float]]
   id: str

   def __init__(self, metrics: list[tuple[str, float]], schema: Schema) -> None:
       self.metrics = metrics
       self.schema = schema
       self.id = hashlib.md5(",".join([self.schema.id]).encode("utf-8")).hexdigest()

   def to_json(self):
       from functools import reduce
       jfields = reduce(lambda x, y: dict(**x, **y), 
                        map(lambda f: {f[0]: f[1]}, 
                        self.metrics))
       return {"id": self.id, "metrics": jfields, "schema": self.schema.id}

   @staticmethod
   def extract_metrics_from_dataframe(df: pd.DataFrame): 
       d = df.describe(include='all', datetime_is_numeric=True) 
       import math
       import numbers
       metrics = {}
       filterf = lambda x: isinstance(x[1], numbers.Number) and not math.isnan(x[1])
       mapperf = lambda x: (field + "." + x[0], x[1])
       for field in d.columns[1:]:
           msd = dict(filter(filterf, map(mapperf, d[field].to_dict().items())))
           metrics.update(msd)
       # metrics looks like: 
       # {"Symbol.count": 20, "Symbol.unique": 1, "Symbol.freq": 20, 
       #   "Open.count": 20.0, "Open.mean": 3.315075, "Open.min": 1.68, 
       #   "Open.25%": 1.88425"Open.75%": 2.37, "Open.max": 14.725, 
       #   "Open.std": 3.7648500643766463, ...}
       return list(metrics.items())

The metrics were chosen in this simplified form to represent only the numerical descriptive statistics as a list of pairs (metric name, numerical value). These observations will convert directly from the result of the describe function called on the DataFrame that holds the data values.

Now that we have defined all entities as classes, we can update our code (see Example 4-13) to ensure it generates the observations appropriately.

Example 4-13. Ingestion application including data observations entities

app = Application(Application.fetch_file_name())
app_repo = ApplicationRepository(ApplicationRepository.fetch_git_location(), app)

AppTech = pd.read_csv(
   "data/AppTech.csv",
   parse_dates=["Date"],
   dtype={"Symbol": "category"},
)
AppTech_DS = DataSource("data/AppTech", "csv")
AppTech_SC = Schema(Schema.extract_fields_from_dataframe(AppTech), AppTech_DS)
AppTech_M = DataMetrics(DataMetrics.extract_metrics_from_dataframe(AppTech), 
             AppTech_SC)

Buzzfeed = pd.read_csv(
   "data/Buzzfeed.csv",
   parse_dates=["Date"],
   dtype={"Symbol": "category"},
)
Buzzfeed_DS = DataSource("data/Buzzfeed", "csv")
Buzzfeed_SC = Schema(Schema.extract_fields_from_dataframe(Buzzfeed), Buzzfeed_DS)
Buzzfeed_M = DataMetrics(DataMetrics.extract_metrics_from_dataframe(Buzzfeed), 
                          Buzzfeed_SC)

monthly_assets = pd.concat([AppTech, Buzzfeed]) \
   .astype({"Symbol": "category"})
monthly_assets.to_csv(
   "data/monthly_assets.csv", index=False
)
monthly_assets_DS = DataSource("data/monthly_assets", "csv")
monthly_assets_SC = Schema(Schema.extract_fields_from_dataframe(monthly_assets), 
                           monthly_assets_DS)
monthly_assets_M = DataMetrics(
                    DataMetrics.extract_metrics_from_dataframe(monthly_assets),
                                monthly_assets_SC)

Code Evolution

This version of the code requires tremendous changes and an extra layer of instructions. Reviewing this code to understand what it does helps us understand what we need to observe and how. Examples 4-14 and 4-21 show simplified versions with a function removing duplicate instructions.

Also, the next sections introduce many strategies to reduce the number of changes and work to the point where data observation generation is almost automated and invisible.

We now have code that generates the observations we need to observe the data behavior when running the ingestion application. However, plenty of code is almost identical. This function in Example 4-14 can be used to remove most of the noise.

Example 4-14. Wrapping duplicated code in higher level function

def observations_for_df(df_name: str, df_format: str, df: pd.DataFrame) -> None:
   ds = DataSource(df_name, df_format)
   sc = Schema(Schema.extract_fields_from_dataframe(df), ds)
   ms = DataMetrics(DataMetrics.extract_metrics_from_dataframe(df), sc)

In this section, we work with files, but what if you have to deal with tables instead, using SQL to read, transform, and write the data? If SQL is used directly to perform these operations, it contains the same operations as a Python code would have. That is, it reads tables, transforms them, and eventually writes them; for example, an insert of a sub-select. However, SQL doesn’t provide many capabilities to generate information in-the-middle, such as metadata extraction or metrics. Another application, which the database server should be responsible for, runs the SQL.

Remember SQL holds a lot of information that your application can leverage. It contains the names of the tables, columns, types, and transformations performed; hence, parsing the queries gives similar observations, as presented in Example 4-14. Actually, I recommend this extraction strategy for SQL queries because an SQL query is likely the result of several iterations and experiments until it does exactly what it is meant to do (like any code, in fact). Therefore, the SQL query implements all assumptions made in each iteration—for example, including the following:

IS NOT NULL represents the assumption that columns can be null and it is fine to filter them out (assuming that their number is not impacting the final logic).
cast("Amount" as "INT64") represents the assumption that the Amount is always castable as an integer.

An SQL Query Is a Program

SQL is procedural logic. Like Python logic, it is encoded in a domain specific language (DSL) and, like pandas, as a character string sent and processed by the database server. In fact, it is very similar to a Python code, being a character string sent and processed by the Python executable.

After adding the ingestion capability to generate data and application observations, we can work on another area that will provide even more information—observations about the analytical part—when the application is deployed.

Generate Lineage-Related Data Observations

To generate observations about the interaction at the data level between the applications in the pipeline, down to the column level, I will cover in this section how to generate the lineage entity as shown in Figure 4-5.

Entities representing the lineage executing by the ingestion application

As discussed in Chapter 2, the part of the model that connects the lineage—the data flow and transformations, in this case—glues the observations from the main component of the data observability area to the application component. It’s in the analytics observability area, at the intersection of all three areas.

To encode this part, we must generate observations about which data sources (inputs) are used to produce other data sources (outputs). For the ingestion, the stock CSV files help generate the monthly_assets CSV file. Going one level deeper, ingestion concatenates all values from all columns and then writes the result in a file. The lineage at the field level is direct, meaning the input and output column names are identical and an output takes its value only from this column for each input data source.

Example 4-15 shows how to include the generation of the lineage—OutputDataLineage —in the reporting application with a dedicated class defining the dependencies of an output with its inputs.

Example 4-15. Class to model lineage observations

class OutputDataLineage: 
   schema: Schema
   input_schemas: list[tuple[Schema, dict]]
   id: str

   def __init__(self, schema: Schema, 
                input_schemas_mapping: list[tuple[Schema, dict]]) -> None:
       self.schema = schema
       self.input_schemas_mapping = input_schemas_mapping
       self.id = hashlib.md5(",".join([self.schema.id]).encode("utf-8") \
                  + self.linearize().encode("utf-8")).hexdigest()

   def to_json(self):
       return {"id": self.id, "schema": self.schema.id, 
              "input_schemas_mapping": self.input_schemas_mapping}

Example 4-16 presents a static function part of the OutputDataLineage that will map each input with the output at the field level, connecting them in this way: output_field -> list of input fields for each input . The generate_direct_mapping helper binds the data sources using an overly simple heuristic that maps output data sources when field names match.

This strategy fails in most real-life use cases, especially with aggregations, which require more conscientious tracking to manage all connections.

You can avoid this situation by using one of the strategies discussed in the next chapter.

Example 4-16. Helper method to generate a lineage easily

   @staticmethod
   def generate_direct_mapping(output_schema: Schema, input_schemas: list[Schema]): 
       input_schemas_mapping = []
       output_schema_field_names = [f[0] for f in output_schema.fields]
       for schema in input_schemas:
           mapping = {}
           for field in schema.fields:
               if field[0] in output_schema_field_names:
                   mapping[field[0]] = [field[0]]
           if len(mapping):
               input_schemas_mapping.append((schema, mapping))
       return input_schemas_mapping

   define linearize(self):
       [...]
       return linearized

Then, I define in Example 4-17 the lineage execution, ApplicationExecution, which makes the connection with the contextual (application-related) observations.

Example 4-17. Class to model the execution of a lineage

class DataLineageExecution:
   lineage: OutputDataLineage
   application_execution: ApplicationExecution 
   start_time: str
   id: str

   def __init__(self, lineage: OutputDataLineage, 
                application_execution: ApplicationExecution) -> None:
       self.lineage = lineage
       self.application_execution = application_execution
       self.start_time = datetime.datetime.now().isoformat()
       self.id = hashlib.md5(
           ",".join([self.lineage.id, self.application_execution.id, 
                    self.start_time]).encode("utf-8")).hexdigest()

   def to_json(self):
       return {"id": self.id, "lineage": self.lineage.id, 
              "application_execution": self.application_execution.id,
               "start_time": self.start_time}

Also, the data observability core model introduced in Chapter 2 linked the data metrics entity to the execution of the lineage to give visibility about the data when it’s used or modified. Therefore, we adapt the DataMetrics model to represent that link by adding the lineage_execution property , as shown in Example 4-18.

Example 4-18. Adding the lineage execution to the metric for context

class DataMetrics:
   schema: Schema
   lineage_execution: DataLineageExecution 
   metrics: list[tuple[str, float]]
   id: str

   def __init__(self, metrics: list[tuple[str, float]], schema: Schema,
                lineage_execution: DataLineageExecution) -> None:
       self.metrics = metrics
       self.schema = schema
       self.lineage_execution = lineage_execution
       id_content = ",".join([self.schema.id, self.lineage_execution.id]
       self.id = hashlib.md5(id_content).encode("utf-8")).hexdigest()

   def to_json(self):
       from functools import reduce
       jfields = reduce(lambda x, y: dict(**x, **y), 
                        map(lambda f: {f[0]: f[1]}, self.metrics))
       return {"id": self.id, "metrics": jfields, "schema": self.schema.id,
               "lineage_execution": self.lineage_execution.id}

   @staticmethod
   def extract_metrics_from_df(df: pd.DataFrame):
       d = df.describe(include='all', datetime_is_numeric=True)
       import math
       import numbers
       metrics = {}
       filterf = lambda x: isinstance(x[1], numbers.Number) and not math.isnan(x[1])
       mapperf = lambda x: (field + "." + x[0], x[1])
       for field in d.columns[1:]:
           msd = dict(filter(filterf, map(mapperf, d[field].to_dict().items())))
           metrics.update(msd)
       return list(metrics.items())

Now all the pieces are ready to generate the observations shown in Figure 4-5. The final ingestion script can be seen in the GitHub repository.

Wrap-Up: The Data-Observable Data Pipeline

Before moving on with the analyses of how the observations help with the explicit and silent failures introduced at the beginning of this section, we reuse what we’ve done so far for the ingestion application to make the reporting application data observable. See Example 4-19.

Example 4-19. Reporting application with verbose data observations generated from code

import ApplicationRepository.fetch_git_location
import ApplicationVersion.fetch_git_version
app = Application(Application.fetch_file_name())
app_repo = ApplicationRepository(fetch_git_location(), app)
git_user = User(ApplicationVersion.fetch_git_author())
app_version = ApplicationVersion(fetch_git_version(), git_user, app_repo)
current_user = User("Emanuele Lucchini")
app_exe = ApplicationExecution(app_version, current_user)

all_assets = pd.read_csv("data/monthly_assets.csv", parse_dates=['Date'])

apptech = all_assets[all_assets['Symbol'] == 'APCX']
buzzfeed = all_assets[all_assets['Symbol'] == 'BZFD']

buzzfeed['Intraday_Delta'] = buzzfeed['Adj Close'] - buzzfeed['Open']
apptech['Intraday_Delta'] = apptech['Adj Close'] - apptech['Open']

kept_values = ['Open', 'Adj Close', 'Intraday_Delta']

buzzfeed[kept_values].to_csv("data/report_buzzfeed.csv", index=False)
apptech[kept_values].to_csv("data/report_appTech.csv", index=False)

all_assets_ds = DataSource("data/monthly_assets.csv", "csv")
all_assets_sc = Schema(Schema.extract_fields_from_dataframe(all_assets), 
                 all_assets_ds)
buzzfeed_ds = DataSource("data/report_buzzfeed.csv", "csv")
buzzfeed_sc = Schema(Schema.extract_fields_from_dataframe(buzzfeed), buzzfeed_ds)
apptech_ds = DataSource("data/report_appTech.csv", "csv")
apptech_sc = Schema(Schema.extract_fields_from_dataframe(apptech), apptech_ds)
# First lineage
lineage_buzzfeed = OutputDataLineage(buzzfeed_sc,
                                    OutputDataLineage.generate_direct_mapping
                                    (buzzfeed_sc, [all_assets_sc]))
lineage_buzzfeed_exe = DataLineageExecution(lineage_buzzfeed, app_exe)
all_assets_ms_1 = DataMetrics(DataMetrics.extract_metrics_from_df(all_assets), 
                    all_assets_sc, lineage_buzzfeed_exe)
buzzfeed_ms = DataMetrics(DataMetrics.extract_metrics_from_df(buzzfeed), buzzfeed_sc, 
               lineage_buzzfeed_exe)
# Second lineage
lineage_apptech = OutputDataLineage(apptech_sc,
                                    OutputDataLineage.generate_direct_mapping
                                    (apptech_sc, [all_assets_sc]))
lineage_apptech_exe = DataLineageExecution(lineage_apptech, app_exe)
all_assets_ms_2 = DataMetrics(DataMetrics.extract_metrics_from_df(all_assets), 
                    all_assets_sc, lineage_apptech_exe)
apptech_ms = DataMetrics(DataMetrics.extract_metrics_from_df(apptech), apptech_sc, 
              lineage_apptech_exe)

By adding the observations this way, we keep the modifications similar to what we did in the ingestion application. This approach enables us to build habits and abstractions, such as a framework, that will reduce the number of changes needed—almost an implicit law in development.

In Example 4-19, notice that the observations generated for the inputs shifted to the end . We made this implementation choice for the simplicity of the example. An upside is the additional computations are done at the end without impacting the business flow. A downside is that, if something fails in between, no observations about the data sources and their schema are sent. Of course, it is possible to avoid this situation with some adaptations to the code. Also, for ‌this introduction to a low-level API, we have to add some boilerplate to generate the right information, which may sound like noise, proportionally to the business code. However, keep in mind that the script initially had no logs. In general, logs are added sporadically to generate some information about the script’s own behavior, which is what we’ve done, but for the data.

Also, keep these points in mind:

We reused the OutputDataLineage.generate_direct_mapping helper as is to create lineage mapping between the outputs and input. However, it won’t work, because we aggregated Adj Close and Open from the monthly_assets.csv file into the new Intraday_Delta column. Because the fields don’t have the same name, the “direct” heuristic won’t pick up this dependency.
A warning message is displayed about reporting the same observations a second time for the monthly_assets. We did this because we encoded lineage per output. We now have two lineages, each producing output for the report_buzzfeed.csv and report_AppTech.csv files. Because the output reuses the same data as the input (filtered), we must report what the input looks like for each output to avoid them appearing as duplicates. As an alternative, we could reuse the observations or adapt the model to address this duplication. You might consider the following options instead:
- If we change our strategy to read every time the data is accessed instead of loading it in memory, then if the data changes between the two writing operations, the observations will no longer be identical. If one output has issues, we prefer to synchronize the observations of the input with this lineage. The probability of this situation increases when you consider each writing operation can take hours, not seconds.
- The reporting application generates all outputs at every run, but refactoring later can change this and make it parameterable. For example, only one output might be created, such as BuzzFeed. Therefore, each reporting dataset is generated by independent runs. In this case, the observations already represent this appropriately, so we don’t need to adjust the logic. In other words, sending the observations of a given input as often as it is used to create an output represents reality, rather than trying to optimize to reduce data that might look like duplicates.

Let’s address the first point and ensure that the lineage can represent the real connections between the data sources. To do this in a simplified manner, we will update the helper function introduced in Example 4-19 with extra information. Example 4-20 shows the new version of this helper which now includes an argument to provide nondirect mapping for each input. Later sections present strategies to handle this common use case in a much easier, efficient, maintainable, and accurate way, such as by using monkey patching.

Example 4-20. Generated field level lineage based on matching field names

@staticmethod
def generate_direct_mapping(output_schema: Schema, 
                            input_schemas: list[tuple[Schema, dict]]):
   input_schemas_mapping = []
   output_schema_field_names = [f[0] for f in output_schema.fields]
   for (schema, extra_mapping) in input_schemas:
       mapping = {}
       for field in schema.fields:
           if field[0] in output_schema_field_names:
               mapping[field[0]] = [field[0]]
       mapping.update(extra_mapping) 
       if len(mapping):
           input_schemas_mapping.append((schema, mapping))
   return input_schemas_mapping

Example 4-21 shows the final observations part of the reporting application.

Example 4-21. Reporting application made data observable with clean code

# First lineage
intraday_delta_mapping = {"Intraday_Delta": ['Adj Close', 'Open']}
a = (all_assets_sc, intraday_delta_mapping)
lineage_buzzfeed = OutputDataLineage(buzzfeed_sc,
                                    OutputDataLineage.generate_direct_mapping(
                                    buzzfeed_sc, [(all_assets_sc, 
                                    intraday_delta_mapping)]))
lineage_buzzfeed_exe = DataLineageExecution(lineage_buzzfeed, app_exe)
all_assets_ms_1 = DataMetrics(DataMetrics.extract_metrics_from_df(all_assets), 
                   all_assets_sc, lineage_buzzfeed_exe)
buzzfeed_ms = DataMetrics(DataMetrics.extract_metrics_from_df(buzzfeed), 
               buzzfeed_sc, lineage_buzzfeed_exe)
# Second lineage
lineage_apptech = OutputDataLineage(apptech_sc,
                                   OutputDataLineage.generate_direct_mapping(
                                    apptech_sc, [(all_assets_sc, 
                                                 intraday_delta_mapping)]))
lineage_apptech_exe = DataLineageExecution(lineage_apptech, app_exe)
all_assets_ms_2 = DataMetrics(DataMetrics.extract_metrics_from_df(all_assets), 
                   all_assets_sc, lineage_apptech_exe)
apptech_ms = DataMetrics(DataMetrics.extract_metrics_from_df(apptech), apptech_sc, 
              lineage_apptech_exe)

Using Data Observations to Address Failures of the Data Pipeline

Now, we can deploy, run, and monitor our pipeline by using the observations they generate each time they’re run. The low-level API required quite a bit of extra effort and conviction to achieve. Nevertheless, we are satisfied with the results. Every minute spent on those tasks will give us 100 times the benefit in the form of avoidance of revenue loss—following the total quality cost rule, 1-10-100—when issues happen in production.

Let’s look at the issues that can arise that were mentioned in this section, starting with the ingestion application failures:

Input files not found

The DataSource observations are sent every run. After they’ve been read, the failing runs are, therefore, not sending any of them. Even for someone without any knowledge about the application logic, it’s clear the files used until now are missing.

Type errors at read

The Schema observations are sent and contain the field names associated with their type. Therefore, the expected type is clear for the observer without going to the application or the files for the previous months.

Errors due to missing fields

The same observations as for “Type errors at read” help the observer quickly identify which fields were present in the previous runs that are missing in the current one.

Filesystem errors

The exception thrown by the pandas library normally provides the path that resulted in an error. The remaining information available for the observer to identify the issue is which server this path used. The IP provided in the server observations associated with the DataSource gives instant visibility about which server was related to that path.

Memory error

This issue happens mostly when the data increases suddenly. Many cases can be considered, but they are mostly handled intuitively by the observer using DataMetrics observations that contain the number of rows, the schema with the number of fields, or the number of DataSources. Regardless, it might require the observations to be sent earlier than at the end of the script, such as in the following two cases:

One of the input files has a larger size than before. The file is detected easily because no DataMetrics are available for it.
The output has grown a lot because all files have grown. The size difference is detected because the DataMetrics for the outputs are missing. Also, the DataMetrics for the inputs show an increase in the number of rows.

Filesystem space errors

These errors most likely happen when writing the output, considering that we’re addressing data observability cases here. Therefore, the same information as for “Memory error” gives the observer instant visibility about why the available space is no longer sufficient and which files couldn’t be written.

The Date is not parsable as a date

In this case, the schema has the type observation for the date field change from date to something else, such as str or object.

Date column doesn’t contain values for the current year/month

The DataMetrics observations include the minimum and maximum time stamp, which gives instant visibility into the difference between the execution time and the available data. For example, let’s consider the maximum time stamp is two days in the past to the time when the data source is read, then the data can be considered too old if the acceptable period is only one day.

Date column contains values in the future

This one is easy, as the same kind of observations as for the previous failure due to missing values for the current month will give you this visibility.

Symbol categories changed

If we consider only numeric DataMetrics, we can quickly identify this case by using the number of categories that would grow in the output file. One or some of the files wouldn’t be consistent anymore, as they would refer to different categories.

Then, we must consider under what circumstances the reporting application might regard the ingestion application as failed, and where applicable, what the reporting application an observer could use. These situations include the following:

The monthly aggregated file is not available

The DataSource observation or DataMetrics hasn’t been sent by the ingestion application.

Aggregation uses missing fields, such as Close

The schema of the monthly data sent by the ingestion application is missing those fields too.

Errors with read/write access, size, and space

The same solutions apply to the reporting observer as the ingestion observer. There is no bias of information across teams or team members.

APCX and ENFA symbols

The number of categories reported by the ingestion observer has changed, giving hints of what is happening, in some cases. However, we can extend DataMetrics to also report non-numerical observations and report the categories.

Missing values in Adj Close or Open leading to abnormal numbers

The DataMetrics “number of nulls” covers this case because when the number of nulls is greater than zero, the computation of the Intraday_Delta will return NAs.

Wrong date in the monthly assets

The same solutions used by the ingestion observer related to date failures apply here. For example, the application could use the minimum and maximum value for the Date column compared to the currently reported month.

We’re now ready to handle various situations wherein the data is the source of a problem. Without this knowledge, these situations would require long, high-stress meetings to understand them and exhausting hours or days spent debugging that could turn into months, because we can’t access the data in production.

The issues discussed so far are ones we know can happen. However, many other issues are likely to propagate throughout the pipeline that we know little about yet—the unknown-unknowns. For example, the values of one of the stocks are incorrect for a previous month due to errors introduced into the data during the CSV export. This problem will happen for at least one of your applications, and others like it.

Because of these unknown-unknowns, data observations mustn’t be restricted to cover only predefined cases, but the applications must report as many observations as possible—with possible constraints on computation resources and time—to generate visibility about the anticipated or unmet situation. In this example, the distribution of the monthly stock values would be useful later for comparison with other months, and they could provide hints about whether the values are equal or similar.

The advantage of using low-level logging is having full flexibility over what you can generate as visible. Examples are custom metrics and key performance indicators (KPIs).

All jobs aren’t equal; every company, project, and application has its own specific details. You’re likely to control specific metrics, whether they are linked to consumed data or produced data. Such metrics, for a table, could be the sum of the number of items multiplied by the cost per unit minus the amount received from a web service, count(items) * cost_per_unit. The result must always be greater than zero. This easily can be added to the source code, but it has to be added by the engineer, as this constitutes specific metrics associated with the business logic (and the semantics of the columns).

Another reason for customizing observations is KPIs—the numbers stakeholders request that are important to the underlying business. KPIs are often reported periodically or computed on demand and used at random or fixed intervals. However, their meaning is so strong that stakeholders have high expectations for them, with little to no time to wait for corrections. Therefore, you must also create visibility about how KPIs evolve over time, because, if a stakeholder has doubts about it, time starts ticking the moment they ask you about its correctness. To increase responsiveness, you must generally know how the KPIs evolve, identify how they change before they do, and understand why they change based on their historical values and lineage.

As you might have anticipated during the reporting application update, defining the API—the model, encoding, and functions—isn’t a task for each application. Rather, you must standardize and reuse it across applications. Standardization reduces the amount of work per application. More importantly, it enables the observations to be homogeneous, independently of the application, to simplify the observers’ work in matching the behaviors of other applications that participate in the pipeline they are analyzing.

Standardization is also helpful for reusing entities across applications, such as the assets_monthly DataSource, which is the ingestion application’s output and the reporting application’s input. With the homogenous representation of the observations, you can consolidate the status of the entire pipeline by reusing the entities across applications.

Part of the architecture to support data observability must include creating an external system that aggregates the observations to build a global view, systematically. By having observers that rely on this aggregated view and act upon it, the system can evolve to perform some of the actions currently being done by the observers, which is where machine learning comes into play.

Conclusion

This chapter has provided a comprehensive exploration of data observability at the source and its significance in enhancing data quality and operational excellence. We delved into the concept of generating data observations within the code of data applications, highlighting the importance of incorporating observation generation code across various components of the application. These components include the application itself, the data being utilized, their relationships, and the content they hold.

Additionally, we discussed the creation of a data observability Python API at the low level, which offers a powerful toolset for developers to seamlessly integrate data observation capabilities into their applications. Through this API, practitioners gain the ability to generate data observations, track data flows, and ensure the reliability and accuracy of their data.

To reinforce these concepts, we presented a fully working example that showcased the transformation of a non-data-observable data pipeline written in Python into a robust and data-observability-driven pipeline. By leveraging the dedicated data observability Python API, we demonstrated how data observations can be generated, captured, and utilized to enhance visibility, identify issues, and drive continuous improvement.

As we move forward, the principles and strategies explored in this chapter serve as a foundation for incorporating data observability into the very fabric of data applications. By adopting these practices, organizations can ensure that their data pipelines are robust, reliable, and capable of providing valuable insights with a high level of trust.

Despite the hyper-customizability and flexibility of low-level logging, its adoption can be hindered by the initial effort required. This argument also applies to the adoption of tests. Therefore, it is crucial to simplify the usage complexity at this level. Additionally, we need to explore alternative approaches that complement low-level logging while promoting the widespread adoption of data observability across teams and individuals. The upcoming sections will delve into this subject, commencing with an exploration of event-based systems.

¹ This is what led me to start thinking about the data-observability-driven development method.

Get Fundamentals of Data Observability now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial