Chapter 1. Modernizing Your Data Platform: An Introductory Overview
Data is a valuable asset that can help your company make better decisions, identify new opportunities, and improve operations. Google in 2013 undertook a strategic project to increase employee retention by improving manager quality. Even something as loosey-goosey as manager skill could be studied in a data-driven manner. Google was able to improve management favorability from 83% to 88% by analyzing 10K performance reviews, identifying common behaviors of high-performing managers, and creating training programs. Another example of a strategic data project was carried out at Amazon. The ecommerce giant implemented a recommendation system based on customer behaviors that drove 35% of purchases in 2017. The Warriors, a San Francisco basketball team, is yet another example; they enacted an analytics program that helped catapult them to the top of their league. All these—employee retention, product recommendations, improving win rates—are examples of business goals that were achieved by modern data analytics.
To become a data-driven company, you need to build an ecosystem for data analytics, processing, and insights. This is because there are many different types of applications (websites, dashboards, mobile apps, ML models, distributed devices, etc.) that create and consume data. There are also many different departments within your company (finance, sales, marketing, operations, logistics, etc.) that need data-driven insights. Because the entire company is your customer base, building a data platform is more than just an IT project.
This chapter introduces data platforms, their requirements, and why traditional data architectures prove insufficient. It also discusses technology trends in data analytics and AI, and how to build data platforms for the future using the public cloud. This chapter is a general overview of the core topics covered in more detail in the rest of the book.
The Data Lifecycle
The purpose of a data platform is to support the steps that organizations need to carry out to move from raw data to insightful information. It is helpful to understand the steps of the data lifecycle (collect, store, process, visualize, activate) because they can be mapped almost as-is to a data architecture to create a unified analytics platform.
The Journey to Wisdom
Data helps companies to develop smarter products, reach more customers, and increase their return on investment (ROI). Data can also be leveraged to measure customer satisfaction, profitability, and cost. But the data by itself is not enough. Data is raw material that needs to pass through a series of stages before it can be used to generate insights and knowledge. This sequence of stages is what we call a data lifecycle. There are many definitions available in the literature, but from a general point of view, we can identify five main stages in modern data platform architecture:
- 1. Collect
-
Data has to be acquired and injected into the target systems (e.g., manual data entry, batch loading, streaming ingestion, etc.).
- 2. Store
-
Data needs to be persisted in a durable fashion with the ability to easily access it in the future (e.g., file storage system, database).
- 3. Process/transform
-
Data has to be manipulated to make it useful for subsequent steps (e.g., cleansing, wrangling, transforming).
- 4. Analyze/visualize
-
Data needs to be studied to derive business insights via manual elaboration (e.g., queries, slice and dice) or automatic processing (e.g., enrichment using ML application programming interfaces—APIs).
- 5. Activate
-
Surfacing the data insights in a form and place where decisions can be made (e.g., notifications that act as a trigger for specific manual actions, automatic job executions when specific conditions are met, ML models that send feedback to devices).
Each of these stages feeds into the next, similar to the flow of water through a set of pipes.
Water Pipes Analogy
To understand the data lifecycle better, think of it as a simplified water pipe system. The water starts at an aqueduct and is then transferred and transformed through a series of pipes until it reaches a group of houses. The data lifecycle is similar, with data being collected, stored, processed/transformed, and analyzed before it is used to make decisions (see Figure 1-1).
You can see some similarities between the plumbing world and the data world. Plumbing engineers are like data engineers, who design and build the systems that make data usable. People who analyze water samples are like data analysts and data scientists, who analyze data to find insights. Of course, this is just a simplification. There are many other roles in a company that use data, like executives, developers, business users, and security administrators. But this analogy can help you remember the main concepts.
In the canonical data lifecycle, shown in Figure 1-2, data engineers collect and store data in an analytics store. The stored data is then processed using a variety of tools. If the tools involve programming, the processing is typically done by data engineers. If the tools are declarative, the processing is typically done by data analysts. The processed data is then analyzed by business users and data scientists. Business users use the insights to make decisions, such as launching marketing campaigns or issuing refunds. Data scientists use the data to train ML models, which can be used to automate tasks or make predictions.
The real world may differ from the preceding idealized description of how a modern data platform architecture and roles should work. The stages may be combined (e.g., storage and processing) or reordered (e.g., processing before storage, as in ETL [extract-transform-load], rather than storage before processing, as in ELT [extract-load-transform]). However, there are trade-offs to such variations. For example, combining storage and processing into a single stage leads to coupling that results in wasted resources (if data sizes grow, you’ll need to scale both storage and compute) and scalability issues (if your infrastructure can’t handle the extra load, you’ll be stuck).
Now that we have defined the data lifecycle and summarized the various stages of the data journey from raw data collection to activation, let us go through each of the five stages of the data lifecycle in turn.
Collect
The first step in the design process is ingestion. Ingestion is the process of transferring data from a source, which could be anywhere (on premises, on devices, in another cloud, etc.), to a target system where it can be stored for further analysis. This is the first opportunity to consider the 3Vs of big data:
- Volume
-
What is the size of the data? Usually when dealing with big data this means terabyte (TB) or petabyte (PB) of data.
- Velocity
-
What is the speed of the data coming in? Generally this is megabyte/second (MB/s) or TB/day. This is often termed the throughput.
- Variety
-
What is the format of the data? Tables, flat files, images, sound, text, etc.
Identify the data type (structured, semistructured, unstructured), format, and generation frequency (continuously or at specific intervals) of the data to be collected. Based on the velocity of the data and the capability of the data platform to handle the resulting volume and variety, choose between batch ingestion, streaming ingestion, or a hybrid of the two.
As different parts of the organization may be interested in different data sources, design this stage to be as flexible as possible. There are several commercial and open source solutions that can be used, each specialized for a specific data type/approach mentioned earlier. Your data platform will need to be comprehensive and support the full range of volume, velocity, and variety required for all the data that needs to be ingested into the platform. You could have simple tools that transfer files between File Transfer Protocol (FTP) servers on regular intervals, or you could have complex systems, even geographically distributed, that collect data from IoT devices in real time.
Store
In this step, store the raw data you collected in the previous step. You don’t change the data at all, you just store it. This is important because you might want to reprocess the data in a different way later, and you need to have the original data to do that.
Data comes in many different forms and sizes. The way you store it will depend on your technical and commercial needs. Some common options include object storage systems, relational database management systems (RDBMSs), data warehouses (DWHs), and data lakes. Your choice will be driven to some extent by whether the underlying hardware, software, and artifacts are able to cope with the scalability, cost, availability, durability, and openness requirements imposed by your desired use cases.
Scalability
Scalability is the ability to grow and manage increased demands in a capable manner. There are two main ways to achieve scalability:
- Vertical scalability
-
This involves adding extra expansion units to the same node to increase the storage system’s capacity.
- Horizontal scalability
-
This involves adding one or more additional nodes instead of adding new expansion units to a single node. This type of distributed storage is more complex to manage, but it can achieve improved performance and efficiency.
It is extremely important that the underlying system is able to cope with the volume and velocity required by modern solutions that have to work in an environment where the data is exploding and its nature is transitioning from batch to real time: we are living in a world where the majority of the people are continuously generating and requiring access to the information leveraging their smart devices; organizations need to be able to provide their users (both internal and external) with solutions that are able to provide real-time responses to the various requests.
Performance versus cost
Identify the different types of data you need to manage, and create a hierarchy based on the business importance of the data, how often it will be accessed, and what kind of latency the users of the data will expect.
Store the most important and most frequently accessed data (hot data) in a high-performance storage system such as a data warehouse’s native storage. Store less important data (cold data) in a less expensive storage system such as cloud storage (which itself has several tiers). If you need even higher performance, such as for interactive use cases, you can use caching techniques to load a meaningful portion of your hot data into a volatile storage tier.
High availability
High availability means having the ability to be operational and deliver access to the data when requested. This is usually achieved via hardware redundancy to cope with possible physical failures/outages. This is achieved in the cloud by storing the data in at least three availability zones. Zones may not be physically separated (i.e., they may be on the same “campus”) but will tend to have different power sources, etc. Hardware redundancy is usually referred to as system uptime, and modern systems usually come with four 9s or more.
Durability
Durability is the ability to store data for a long-term period without suffering data degradation, corruption, or outright loss. This is usually achieved through storing multiple copies of the data in physically separate locations. Such data redundancy is implemented in the cloud by storing the data in at least two regions (e.g., in both London and Frankfurt). This is extremely important when dealing with data restore operations in the face of natural disasters: if the underlying storage system has a high durability (modern systems usually come with 11 9s), then all of the data can be restored with no issues unless a cataclysmic event takes down even the physically separated data centers.
Openness
As far as possible, use formats that are not proprietary and that do not generate lock-in. Ideally, it should be possible to query data with a choice of processing engines without generating copies of the data or having to move it from one system to another. That said, it is acceptable to use systems that use a proprietary or native storage format as long as they provide an easy export capability.
As with most technology decisions, openness is a trade-off, and the ROI of a proprietary technology may be high enough that you are willing to pay the price of lock-in. After all, one of the reasons to go to the cloud is to reduce operational costs—these cost advantages tend to be higher in fully managed/serverless systems than on managed open source systems. For example, if your data use case requires transactions, Databricks (which uses a quasi-open storage format based on Parquet called Delta Lake) might involve lower operating costs than Amazon EMR or Google Dataproc (which will store data in standard Parquet on S3 or Google Cloud Storage [GCS] respectively)—the ACID (Atomicity, Consistency, Isolation, Durability) transactions that Databricks provides in Delta Lake will be expensive to implement and maintain on EMR or Dataproc. If you ever need to migrate away from Databricks, export the data into standard Parquet. Openness, per se, is not a reason to reject technology that is a better fit.
Process/Transform
Here’s where the magic happens: raw data is transformed into useful information for further analysis. This is the stage where data engineers build data pipelines to make data accessible to a wider audience of nontechnical users in a meaningful way. This stage consists of activities that prepare data for analysis and use. Data integration involves combining data from multiple sources into a single view. Data cleansing may be needed to remove duplicates and errors from data. More generally, data wrangling, munging, and transformation are carried out to organize the data into a standard format.
There are several frameworks that can be used, each with its own capabilities that depend on the storage method you selected in the previous step. In general, engines that allow you to query and transform your data using pure SQL commands (e.g., AWS Athena, Google BigQuery, Azure DWH, and Snowflake) are the most efficient, cost effective,1 and easy to use. However, the capabilities they offer are limited in comparison to engines based on modern programming languages, usually Java, Scala, or Python (e.g., Apache Spark, Apache Flink, or Apache Beam running on Amazon EMR, Google Cloud Dataproc/Dataflow, Azure HDInsight, and Databricks). Code-based data processing engines allow you not only to implement more complex transformations and ML in batch and in real time but also to leverage other important features such as proper unit and integration tests.
Another consideration in choosing an appropriate engine is that SQL skills are typically much more prevalent in an organization than programming skills. The more of a data culture you want to build within your organization, the more you should lean toward SQL for data processing. This is particularly important if the processing steps (such as data cleansing or transformation) require domain knowledge.
This stage may also employ data virtualization solutions that abstract multiple data sources, and related logic to manage them, to make information directly available to the final users for analysis. We will not discuss virtualization further in this book, as it tends to be a stopgap solution en route to building a fully flexible platform. For more information about data virtualization, we suggest Chapter 10 of the book The Self-Service Data Roadmap by Sandeep Uttamchandani (O’Reilly).
Analyze/Visualize
Once you arrive at this stage, the data starts finally to have value in and of itself—you can consider it information. Users can leverage a multitude of tools to dive into the content of the data to extract useful insights, identify current trends, and predict new outcomes. At this stage, visualization tools and techniques that allow users to represent information and data in a graphical way (e.g., charts, graphs, maps, heat maps, etc.) play an important role because they provide an easy way to discover and evaluate trends, outliers, patterns, and behavior.
Visualization and analysis of data can be performed by several types of users. On one hand are people who are interested in understanding business data and want to leverage graphical tools to perform common analysis like slice and dice roll-ups and what-if analysis. On the other hand, there could be more advanced users (“power users”) who want to leverage the power of a query language like SQL to execute more fine-grained and tailored analysis. In addition, there might be data scientists who can leverage ML techniques to implement new ways to extract meaningful insights from the data, discover patterns and correlations, improve customer understanding and targeting, and ultimately increase a business’s revenue, growth, and market position.
Activate
This is the step where end users are able to make decisions based on data analysis and ML predictions, thus enabling a data decision-making process. From the insights extracted or predicted from the available information set, it is the time to take some actions.
The actions that can be carried out fall into three categories:
- Automatic actions
-
Automated systems can use the results of a recommendation system to provide customized recommendations to customers. This can help the business’s top line by increasing sales.
- SaaS integrations
-
Actions can be performed by integrating with third-party services. For instance, a company might implement a marketing campaign to try to reduce customer churn. They could analyze data and implement a propensity model to identify customers who are likely to respond positively to a new commercial offer. The list of customer email addresses can then be sent automatically to a marketing tool to activate the campaign.
- Alerting
-
You can create applications that monitor data in real time and send out personalized messages when certain conditions are met. For instance, the pricing team may receive proactive notifications when the traffic to an item listing page exceeds a certain threshold, allowing them to check whether the item is priced correctly.
The technology stack for these three scenarios is different. For automatic actions, the “training” of the ML model is carried out periodically, usually by scheduling an end-to-end ML pipeline (this will be covered in Chapter 11). The predictions themselves are achieved by invoking the ML model deployed as a web service using tools like AWS SageMaker, Google Cloud Vertex AI, or Azure Machine Learning. SaaS integrations are often carried out in the context of function-specific workflow tools that allow a human to control what information is retrieved, how it is transformed, and the way it is activated. In addition, using large language models (LLMs) and their generative capabilities (we will dig more into those concepts in Chapter 10) can help automate repetitive tasks by closely integrating with core systems. Alerts are implemented through orchestration tools such as Apache Airflow, event systems such as Google Eventarc, or serverless functions such as AWS Lambda.
In this section, we have seen the activities that a modern data platform needs to support. Next, let’s examine traditional approaches in implementing analytics and AI platforms to have a better understanding of how technology evolved and why the cloud approach can make a big difference.
Limitations of Traditional Approaches
Traditionally, organizations’ data ecosystems consist of independent solutions that are used to provide different data services. Unfortunately, such task-specific data stores, which may sometimes grow to an important size, can lead to the creation of silos within an organization. The resulting siloed systems operate as independent solutions that are not working together in an efficient manner. Siloed data is silenced data—it’s data from which insights are difficult to derive. To broaden and unify enterprise intelligence, securely sharing data across business units is critical.
If the majority of solutions are custom built, it becomes difficult to handle scalability, business continuity, and disaster recovery (DR). If each part of the organization chooses a different environment to build their solution in, the complexity becomes overwhelming. In such a scenario, it is difficult to ensure privacy or to audit changes to data.
One solution is to develop a unified data platform and, more precisely, a cloud data platform (please note that unified does not necessarily imply centralized, as will be discussed shortly). The purpose of the data platform is to allow analytics and ML to be carried out over all of an organization’s data in a consistent, scalable, and reliable way. When doing that, you should leverage, to the maximum extent possible, managed services so that the organization can focus on business needs instead of operating infrastructure. Infrastructure operations and maintenance should be delegated totally to the underlying cloud platform. In this book, we will cover the core decisions that you need to make when developing a unified platform to consolidate data across business units in a scalable and reliable environment.
Antipattern: Breaking Down Silos Through ETL
It is challenging for organizations to have a unified view of their data because they tend to have a multitude of solutions for managing it. Organizations typically solve this problem by using data movement tools. ETL applications allow data to be transformed and transferred between different systems to create a single source of truth. However, relying on ETL is problematic, and there are better solutions available in modern platforms.
Often, an ETL tool is created to extract the most recent transactions from a transactional database on a regular basis and store them in an analytics store for access by dashboards. This is then standardized. ETL tools are created for every database table that is required for analytics so that analytics can be performed without having to go to the source system each time (see Figure 1-3).
The central analytics store that captures all the data across the organization is referred to as either a DWH or a data lake depending on the technology being used. A high-level distinction between the two approaches is based on the way the data is stored within the system: if the analytics store supports SQL and contains governed, quality-controlled data, it is referred to as a DWH. If instead it supports tools from the Apache ecosystem (such as Apache Spark) and contains raw data, it is referred to as a data lake. Terminology for referring to in-between analytics stores (such as governed raw data or ungoverned quality-controlled data) varies from organization to organization—some organizations call them data lakes and others call them DWH. As you will see later in the book, this confusing vocabulary is not a problem because data lake (Chapter 5) and DWH (Chapter 6) approaches are converging into what is known as data lakehouse (Chapter 7).
There are a few drawbacks to relying on data movement tools to try building a consistent view of the data:
- Data quality
-
ETL tools are often written by consumers of the data who tend to not understand it as well as the owners of the data. This means that, very often, the data that is extracted is not the right data.
- Latency
-
ETL tools introduce latency. For example, if the ETL tool to extract recent transactions runs once an hour and takes 15 minutes to run, the data in the analytics store could be stale by up to 75 minutes. This problem can be addressed by streaming ETL where events are processed as they happen.
- Bottleneck
-
ETL tools typically involve programming skills. Therefore, organizations set up bespoke data engineering teams to write the code for ETL. As the diversity of data within an organization increases, an ever-increasing number of ETL tools need to be written. The data engineering team becomes a bottleneck on the ability of the organization to take advantage of data.
- Maintenance
-
ETL tools need to be routinely run and troubleshot by system administrators. The underlying infrastructure system needs to be continually updated to cope with increased compute and storage capacity and to guarantee reliability.
- Change management
-
Changes in the schema of the input table require the extract code of the ETL tool to be changed. This either makes changes hard to do or results in the ETL tool being broken by upstream changes.
- Data gaps
-
It is very likely that many errors have to be escalated to the owners of the data, the creators of the ETL tool, or the users of the data. This adds to maintenance overhead, and very often to tool downtime. There are quite frequently large gaps in the data record because of this.
- Governance
-
As ETL processes proliferate, it becomes increasingly likely that the same processing is carried out by different processes, leading to multiple sources of the same information. It’s common for the processes to diverge over time to meet different needs, leading to inconsistent data being used for different decisions.
- Efficiency and environmental impact
-
The underlying infrastructure that supports these types of transformations is a concern, as it typically operates 24/7, incurring significant costs and increasing carbon footprint impact.
The first point in the preceding list (data quality) is often overlooked, but it tends to be the most important over time. Often you need to preprocess the data before it can be “trusted” to be made available in production. Data coming from upstream systems is generally considered to be raw, and it may contain noise or even bad information if it is not properly cleaned and transformed. For example, ecommerce web logs may need to be transformed before use, such as by extracting product codes from URLs or filtering out false transactions made by bots. Data processing tools must be built specifically for the task at hand. There is no global data quality solution or common framework for dealing with quality issues.
While this situation is reasonable when considering one data source at a time, the total collection (see Figure 1-4) leads to chaos.
The proliferations of storage systems, together with tailor-made data management solutions developed to satisfy the desires of different downstream applications, bring about a situation where analytics leaders and chief information officers (CIOs) face the following challenges:
-
Their DWH/data lake is unable to keep up with the ever-growing business needs.
-
Increasingly, digital initiatives (and competition with digital natives) have transformed the business to be one where massive data volumes are flooding the system.
-
The need to create separate data lakes, DWHs, and special storage for different data science tasks ends up creating multiple data silos.
-
Data access needs to be limited or restricted due to performance, security, and governance challenges.
-
Renewing licenses and paying for expensive support resources become challenging.
It is evident that this approach cannot be scaled to meet the new business requirements, not only because of the technological complexity but also because of the security and governance requirements that this model entails.
Antipattern: Centralization of Control
To try to address the problem of having siloed, spread, and distributed data managed via task-specific data processing solutions, some organizations have tried to centralize everything in a single, monolithic platform under the control of the IT department. As shown in Figure 1-5, the underlying technology solution doesn’t change—instead, the problems are made more tractable by assigning them to a single organization to solve.
Such centralized control by a unique department comes with its own challenges and trade-offs. All business units (BUs)—IT itself, data analytics, and business users—struggle when IT controls all data systems:
- IT
-
The challenge that IT departments face is the diverse set of technologies involved in these data silos. IT departments rarely have all the skills necessary to manage all of these systems. The data sits across multiple storage systems on premises and across clouds, making it costly to manage DWHs, data lakes, and data marts. It is also not always clear how to define security, governance, auditing, etc., across different sources. Moreover, it introduces a scalability problem in getting access to the data: the amount of work that IT needs to carry out linearly increases with the number of source systems and target systems that will be part of the picture because this will surely increase the number of data access requests by all the related stakeholders/business users.
- Analytics
-
One of the main problems hindering effective analytics processes is not having access to the right data. When multiple systems exist, moving data to/from one monolithic data system becomes costly, resulting in unnecessary ETL tasks, etc. In addition, the preprepared and readily available data might not have the most recent sources, or there might be other versions of the data that provide more depth and broader information, such as having more columns or having more granular records. It is impossible to give your analytics team free rein whereby everyone can access all data due to data governance and operational issues. Organizations often end up limiting data access at the expense of analytic agility.
- Business
-
Getting access to data and analytics that your business can trust is difficult. There are issues around limiting the data you give the business so you can ensure the highest quality. The alternative approach is to open up access to all the data the business users need, even if that means sacrificing quality. The challenge then becomes a balancing act on the quality of the data and amount of trusted data given. It is often the case that IT does not have enough qualified business representatives to drive priorities and requirements. This can quickly become a bottleneck slowing down the innovation process within the organization.
Despite so many challenges, several organizations adopted this approach throughout the years, creating, in some cases, frustrations and tensions for business users who were delayed in getting access to the data they needed to fulfill their tasks. Frustrated business units often cope through another antipattern—that is, shadow IT—where entire departments develop and deploy useful solutions to work around such limitations but end up making the problem of siloed data worse.
A technical approach called data fabric is sometimes employed. This still relies on centralization, but instead of physically moving data, the data fabric is a virtual layer to provide unified data access. The problem is that such standardization can be a heavy burden and introduce delays for organization-wide access to data. The data fabric is, however, a viable approach for SaaS products trying to access customers’ proprietary data—integration specialists provide the necessary translation from customers’ schema to the schema expected by the SaaS tool.
Antipattern: Data Marts and Hadoop
The challenges around a siloed centrally managed system created huge tension and overhead for IT. To resolve this, some businesses adopted two other antipatterns: data marts and ungoverned data lakes.
In the first approach, data was extracted to on-premises relational and analytical databases. However, despite being called data warehouses, these products were, in practice, data marts (a subset of enterprise data suited to specific workloads) due to scalability constraints. Data marts allow business users to design and deploy their own business data into structured data models (e.g., in retail, healthcare, banking, insurance, etc.). This enables them to easily get information about the current and the historical business (e.g., the amount of revenue of the last quarter, the number of users who played your last published game in the last week, the correlation between the time spent on the help center of your website and the number of tickets received in the last six months, etc.). For many decades, organizations have been developing data mart solutions using a variety of technologies (e.g., Oracle, Teradata, Vertica) and implementing multiple applications on top of them. However, these on-premises technologies are severely limited in terms of capacity. IT teams and data stakeholders face the challenges of scaling infrastructure (vertically), finding critical talent, reducing costs, and ultimately meeting the growing expectation of delivering valuable insights. Moreover, these solutions tended to be costly because as data sizes grew, you needed to get a system with more compute to process it.
Due to scalability and cost issues, big data solutions based on the Apache Hadoop ecosystem were created. Hadoop introduced distributed data processing (horizontal scaling) using low-cost commodity servers, enabling use cases that were previously only possible with high-end (and very costly) specialized hardware. Every application running on top of Hadoop was designed to tolerate node failures, making it a cost-effective alternative to some traditional DWH workloads. This led to the development of a new concept called data lake, which quickly became a core pillar of data management alongside the DWH.
The idea was that while core operational technology divisions carried on with their routine tasks, all data was exported for analytics into a centralized data lake. The intent was for the data lake to serve as the central repository for analytics workloads and for business users. Data lakes have evolved from being mere storage facilities for raw data to platforms that enable advanced analytics and data science on large volumes of data. This enabled self-service analytics across the organization, but it required an extensive working knowledge of advanced Hadoop and engineering processes to access the data. The Hadoop Open Source Software (Hadoop OSS) ecosystem grew in terms of data systems and processing frameworks (HBase, Hive, Spark, Pig, Presto, SparkML, and more) in parallel to the exponential growth in organizations’ data, but this led to additional complexity and cost of maintenance. Moreover, data lakes became an ungoverned mess of data that few potential users of the data could understand. The combination of a skills gap and data quality issues meant that enterprises struggled to get good ROI out of data lakes on premises.
Now that you have seen several antipatterns, let’s focus on how you could design a data platform that provides a unified view of the data across its entire lifecycle.
Creating a Unified Analytics Platform
Data mart and data lake technologies enabled IT to build the first iteration of a data platform to break down data silos and to enable the organization to derive insights from all their data assets. The data platform enabled data analysts, data engineers, data scientists, business users, architects, and security engineers to derive better real-time insights and predict how their business will evolve over time.
Cloud Instead of On-Premises
DWH and data lakes are at the core of modern data platforms. DWHs support structured data and SQL, whereas data lakes support raw data and programming frameworks in the Apache ecosystem.
However, running DWH and data lakes in an on-premises environment has some inherent challenges, such as scaling and operational costs. This has led organizations to reconsider their approach and to start considering the cloud (especially the public version of it) as the preferred environment for such a platform. Why? Because it allowed them to:
-
Reduce cost by taking advantage of new pricing models (pay-per-use model)
-
Speed up innovation by taking advantage of best-of-breed technologies
-
Scale on-premises resources using a “bursting” approach
-
Plan for business continuity and disaster recovery by storing data in multiple zones and regions
-
Manage disaster recovery automatically using fully managed services
When users are no longer constrained by the capacity of their infrastructure, organizations are able to democratize data across their organization and unlock insights. The cloud supports organizations in their modernization efforts, as it minimizes the toil and friction by offloading the administrative, low-value tasks. A cloud data platform promises an environment where you no longer have to compromise and can build a comprehensive data ecosystem that covers the end-to-end data management and data processing stages from data collection to serving. And you can use your cloud data platform to store vast amounts of data in varying formats without compromising on latency.
Cloud data platforms promise:
-
Centralized governance and access management
-
Increased productivity and reduced operational costs
-
Greater data sharing across the organization
-
Extended access by different personas
-
Reduced latency of accessing data
In the public cloud environment, the lines between DWH and data lake technologies are blurring because cloud infrastructure (specifically, the separation of compute and storage) enables a convergence that was impossible in the on-premises environment. Today it is possible to apply SQL to data held in a data lake, and it’s possible to run what is traditionally a Hadoop technology (e.g., Spark) against data stored in a DWH. In this section we will give you an introduction to how this convergence works and how it can be the basis for brand-new approaches that can revolutionize the way organizations are looking at the data; you’ll get more details in Chapters 5 through 7.
Drawbacks of Data Marts and Data Lakes
Over the past 40 years, IT departments built domain-specific DWHs, called data marts, to support data analysts. They have come to realize that such data marts are difficult to manage and can become very costly. Legacy systems that worked well in the past (such as on-premises Teradata and Netezza appliances) have proven to be difficult to scale, to be very expensive, and to pose a number of challenges related to data freshness. Additionally, they cannot easily provide modern capabilities such as access to AI/ML or real-time features without adding that functionality after the fact.
Data mart users are frequently analysts who are embedded in a specific business unit. They may have ideas about additional datasets, analysis, data processing, and business intelligence functionality that would be very beneficial to their work. However, in a traditional company, they frequently do not have direct access to data owners, nor can they easily influence the technical decision makers who decide on datasets and tools. Additionally, because they do not have access to raw data, they are unable to test hypotheses or gain a deeper understanding of the underlying data.
Data lakes are not as simple or cost-effective as they may seem. While they can be scaled easily in theory, organizations often face challenges in planning and provisioning sufficient storage, especially if they produce highly variable amounts of data. Additionally, provisioning computational capacity for peak periods can be expensive, leading to competition for scarce resources between different business units.
On-premises data lakes can be fragile and require time-consuming maintenance. Engineers who could be developing new features are often relegated to maintaining data clusters and scheduling jobs for business units. The total cost of ownership is often higher than expected for many businesses. In short, data lakes do not create value, and many businesses find that the ROI is negative.
With data lakes, governance is not easily solved, especially when different parts of the organization use different security models. Then, the data lakes become siloed and segmented, making it difficult to share data and models across teams.
Data lake users typically are closer to the raw data sources and need programming skills to use data lake tools and capabilities, even if it is just to explore the data. In traditional organizations, these users tend to focus on the data itself and are frequently held at arm’s length from the rest of the business. On the other hand, business users do not have the programming skills to derive insights from data in a data lake. This disconnect means that business units miss out on the opportunity to gain insights that would drive their business objectives forward to higher revenues, lower costs, lower risk, and new opportunities.
Convergence of DWHs and Data Lakes
Given these trade-offs, many companies end up with a mixed approach, where a data lake is set up to graduate some data into a DWH or a DWH has a side data lake for additional testing and analysis. However, with multiple teams fabricating their own data architectures to suit their individual needs, data sharing and fidelity gets even more complicated for a central IT team.
Instead of having separate teams with separate goals—where one explores the business and another knows the business—you can unite these functions and their data systems to create a virtuous cycle where a deeper understanding of the business drives exploration and that exploration drives a greater understanding of the business.
Starting from this principle, the data industry has begun shifting toward a new approach, lakehouse and data mesh, which work well together because they help solve two separate challenges within an organization:
-
Lakehouse allows users with different skill sets (data analysts and data engineers) to access the data using different technologies.
-
Data mesh allows an enterprise to create a unified data platform without centralizing all the data in IT—this way, different business units can own their own data but allow other business units to access it in an efficient, scalable way.
As an added benefit, this architecture combination also brings in more rigorous data governance, something that data lakes typically lack. Data mesh empowers people to avoid being bottlenecked by one team and thus enables the entire data stack. It breaks silos into smaller organizational units in an architecture that provides access to data in a federated way.
Lakehouse
Data lakehouse architecture is a combination of the key benefits of data lakes and data warehouses (see Figure 1-6). It offers a low-cost storage format that is accessible by various processing engines, such as the SQL engines of data warehouses, while also providing powerful management and optimization features.
Databricks is a proponent of the lakehouse architecture because it was founded on Spark and needs to support business users who are not programmers. As a result, data in Databricks is stored in a data lake, but business users can use SQL to access it. However, the lakehouse architecture is not limited to Databricks.
DWHs running in cloud solutions like Google Cloud BigQuery, Snowflake, or Azure Synapse allow you to create a lakehouse architecture based around columnar storage that is optimized for SQL analytics: it allows you to treat the DWH like a data lake by also allowing Spark jobs running on parallel Hadoop environments to leverage the data stored on the underlying storage system rather than requiring a separate ETL process or storage layer.
The lakehouse pattern offers several advantages over the traditional approaches:
-
Decoupling of storage and compute that enable:
-
Inexpensive, virtually unlimited, and seamlessly scalable storage
-
Stateless, resilient compute
-
ACID-compliant storage operations
-
A logical database storage model, rather than physical
-
-
Data governance (e.g., data access restriction and schema evolution)
-
Support for data analysis via the native integration with business intelligence tools
-
Native support of the typical multiversion approach of a data lake approach (i.e., bronze, silver, and gold)
-
Data storage and management via open formats like Apache Parquet and Iceberg
-
Support for different data types in the structured or unstructured format
-
Streaming capabilities with the ability to handle real-time analysis of the data
-
Enablement of a diverse set of applications varying from business intelligence to ML
A lakehouse, however, is inevitably a technological compromise. The use of standard formats in cloud storage limits the storage optimizations and query concurrency that DWHs have spent years perfecting. Therefore, the SQL supported by lakehouse technologies is not as efficient as that of a native DWH (i.e., it will take more resources and cost more). Also, the SQL support tends to be limited, with features such as geospatial queries, ML, and data manipulation not available or incredibly inefficient. Similarly, the Spark support provided by DWHs is limited and tends to be not as performant as the native Spark support provided by a data lake vendor.
The lakehouse approach enables organizations to implement the core pillars of an incredibly varied data platform that can support any kind of workload. But what about the organizations on top of it? How can users leverage the best of the platform to execute their tasks? In this scenario there is a new operating model that is taking shape, and it is data mesh.
Data mesh
Data mesh is a decentralized operating model of tech, people, and process to solve the most common challenge in analytics—the desire for centralization of control in an environment where ownership of data is necessarily distributed, as shown in Figure 1-7. Another way of looking at data mesh is that it introduces a way of seeing data as a self-contained product rather than a product of ETL pipelines.
Distributed teams in this approach own the data production and serve internal/external consumers through well-defined data schema. As a whole, data mesh is built on a long history of innovation from across DWHs and data lakes, combined with the scalability, pay-for-consumption models, self-service APIs, and close integration associated with DWH technologies in the public cloud.
With this approach, you can effectively create an on-demand data solution. A data mesh decentralizes data ownership among domain data owners, each of whom are held accountable for providing their data as a product in a standard way (see Figure 1-8). A data mesh also enables communication between various parts of the organization to distribute datasets across different locations.
In a data mesh, the responsibility for generating value from data is federated to the people who understand it best; in other words, the people who created the data or brought it into the organization must also be responsible for creating consumable data assets as products from the data they create. In many organizations, establishing a “single source of truth” or “authoritative data source” is tricky due to the repeated extraction and transformation of data across the organization without clear ownership responsibilities over the newly created data. In the data mesh, the authoritative data source is the data product published by the source domain, with a clearly assigned data owner and steward who is responsible for that data.
A data mesh is an organizational principle about ownership and accountability for data. Most commonly, a data mesh is implemented using a lakehouse, with each business unit having a separate cloud account. Having access to this unified view from a technology perspective (lakehouse) and from an organizational perspective (data mesh) means that people and systems get data delivered to them in a way that makes the most sense for their needs. In some cases this kind of architecture has to span multiple environments, generating, in some cases, very complex architecture. Let’s see how companies can manage this challenge.
Note
For more information about data mesh, we recommend you read Zhamak Dehghani’s book Data Mesh: Delivering Data-Driven Value at Scale (O’Reilly).
Hybrid Cloud
When designing a cloud data platform, it might be that one single environment isn’t enough to manage a workload end to end. This could be because of regulatory constraints (i.e., you cannot move your data into an environment outside the organization boundaries), or because of the cost (e.g., the organization made some investments on the infrastructure that did not reach the end of life), or because you need a specific technology that is not available in the cloud. In this case a possible approach is adopting a hybrid pattern. A hybrid pattern is one in which applications are running in a combination of various environments. The most common example of hybrid pattern is combining a private computing environment, like an on-premises data center, and a public cloud computing environment. In this section we will explain how this approach can work in an enterprise.
Reasons Why Hybrid Is Necessary
Hybrid cloud approaches are widespread because almost no large enterprise today relies entirely on the public cloud. Many organizations have invested millions of dollars and thousands of hours into on-premises infrastructure over the past few decades. Almost all organizations are running a few traditional architectures and business-critical applications that they may not be able to move over to public cloud. They may also have sensitive data they can’t store in a public cloud due to regulatory or organizational constraints.
Allowing workloads to transition between public and private cloud environments provides a higher level of flexibility and additional options for data deployment. There are several reasons that drive hybrid (i.e., architecture spanning across on-premises, public cloud, and edge) and multicloud (i.e., architecture spanning across multiple public cloud vendors like AWS, Microsoft Azure, and Google Cloud Platform [GCP], for example) adoption.
Here are some key business reasons for choosing hybrid and/or multicloud:
- Data residency regulations
-
Some may never fully migrate to the public cloud, perhaps because they are in finance or healthcare and need to follow strict industry regulations on where data is stored. This is also the case with workloads in countries without a public cloud presence and a data residency requirement.
- Legacy investments
-
Some customers want to protect their legacy workloads like SAP, Oracle, or Informatica on prem but want to take advantage of public cloud innovations like, for example, Databricks and Snowflake.
- Transition
-
Large enterprises often require a multiyear journey to modernize into cloud native applications and architectures. They will have to embrace hybrid architectures as an intermediate state for years.
- Burst to cloud
-
There are customers who are primarily on premises and have no desire to migrate to the public cloud. However, they have challenges of meeting business service-level agreements (SLAs) due to ad hoc large batch jobs, spiky traffic during busy periods, or large-scale ML training jobs. They want to take advantage of scalable capacity or custom hardware in public clouds and avoid the cost to scale up on-premises infrastructure. Solutions like MotherDuck, which adopt a “local-first” computing approach, are becoming popular.
- Best of breed
-
Some organizations choose different public cloud providers for different tasks in an intentional strategy to choose the technologies that best serve their needs. For example, Uber uses AWS to serve their web applications, but it uses Cloud Spanner on Google Cloud for its fulfillment platform. Twitter runs its news feed on AWS, but it runs its data platform on Google Cloud.
Now that you understand the reasons why you might choose a hybrid solution, let’s have a look at the main challenges you will face when using this pattern; these challenges are why hybrid ought to be treated as an exception, and the goal should be to be cloud native.
Challenges of Hybrid Cloud
There are several challenges that enterprises face when implementing hybrid or multicloud architectures:
- Governance
-
It is difficult to apply consistent governance policies across multiple environments. For example, compliance security policies between on premises and public cloud are usually dealt with differently. Often, parts of the data are duplicated across on premises and cloud. Imagine your organization is running a financial report—how would you guarantee that the data used is the most recent updated copy if there are multiple copies that exist across platforms?
- Access control
-
User access controls and policies differ between on-premises and public cloud environments. Cloud providers have their own user access controls (called identity and access management, or IAM) for the services provided, whereas on-premises uses technologies such as local directory access protocol (LDAP) or Kerberos. How do you keep them synchronized or have a single control plane across distinct environments?
- Workload interoperability
-
When going across multiple systems, it is inevitable to have inconsistent runtime environments that need to be managed.
- Data movement
-
If both on-premises and cloud applications require access to some data, the two datasets must be in sync. It is costly to move data between multiple systems—there is a human cost to create and manage the pipeline, there may be licensing costs due to software used, and last but not least, it consumes system resources such as computation, network, and storage. How can your organization deal with the costs from multiple environments? How do you join heterogeneous data that is siloed across various environments? Where do you end up copying the data as a result of the join process?
- Skill sets
-
Having the two clouds (or on premises and cloud) means teams have to know and build expertise in two environments. Since the public cloud is a fast-moving environment, there is a significant overhead associated with upskilling and maintaining the skills of employees in one cloud, let alone two. Skill sets can also be a challenge for hiring systems integrators (SIs)—even though most large SIs have practices for each of the major clouds, very few have teams that know two or more clouds. As time goes on, we anticipate that it will become increasingly difficult to hire people willing to learn bespoke on-premises technologies.
- Economics
-
The fact that the data is split between two environments can bring unforeseen costs: maybe you have data in one cloud and you want to make it available to another one, incurring egress costs.
Despite these challenges, a hybrid setup can work. We’ll look at how in the next subsection.
Why Hybrid Can Work
Cloud providers are aware of these needs and these challenges. Therefore, they provide some support for hybrid environments. These fall into three areas:
- Choice
-
Cloud providers often make large contributions to open source technologies. For example, although Kubernetes and TensorFlow were developed at Google, they are open source so that managed execution environments for these exist in all the major clouds and they can be leveraged even in the on-premises environments.
- Flexibility
-
Frameworks such as Databricks and Snowflake allow you to run the same software on any of the major public cloud platforms. Thus, teams can learn one set of skills that will work everywhere. Note that the flexibility offered by tools that work on multiple clouds does not mean that you have escaped lock-in. You will have to choose between (1) lock-in at the framework level and flexibility at the cloud level (offered by technologies such as Databricks or Snowflake) and (2) lock-in at the cloud level and flexibility at the framework level (offered by the cloud native tools).
- Openness
-
Even when the tool is proprietary, code for it is written in a portable manner because of the embrace of open standards and import/export mechanisms. Thus, for example, even though Redshift runs nowhere but on AWS, the queries are written in standard SQL and there are multiple import and export mechanisms. Together, these capabilities make Redshift and BigQuery and Synapse open platforms. This openness allows for use cases like Teads, where data is collected using Kafka on AWS, aggregated using Dataflow and BigQuery on Google Cloud, and written back to AWS Redshift (see Figure 1-9).
Cloud providers are making a commitment to choice, flexibility, and openness by making heavy investments in open source projects that help customers use multiple clouds. Therefore, multicloud DWHs or hybrid data processing frameworks are becoming reality. So you can build out hybrid and multicloud deployments with better cloud software production, release, and management—the way you want, not how a vendor dictates.
Edge Computing
Another incarnation of the hybrid pattern is when you may want to have computational power spanning outside the usual data platform perimeter, maybe to interact directly with some connected devices. In this case we are talking about edge computing. Edge computing brings computation and data storage closer to the system where data is generated and needs to be processed. The aim in edge computing is to improve response times and save bandwidth. Edge computing can unlock many use cases and accelerate digital transformation. It has many application areas, such as security, robotics, predictive maintenance, smart vehicles, etc.
As edge computing is adopted and goes mainstream, there are many potential advantages for a wide range of industries:
- Faster response time
-
In edge computing, the power of data storage and computation is distributed and made available at the point where the decision needs to be made. Not requiring a round trip to the cloud reduces latency and empowers faster responses. In preventive maintenance, it will help stop critical machine operations from breaking down or hazardous incidents from taking place. In active games, edge computing can provide the millisecond response times that are required. In fraud prevention and security scenarios, it can protect against privacy breaches and denial-of-service attacks.
- Intermittent connectivity
-
Unreliable internet connectivity at remote assets such as oil wells, farm pumps, solar farms, or windmills can make monitoring those assets difficult. Edge devices’ ability to locally store and process data ensures no data loss or operational failure in the event of limited internet connectivity.
- Security and compliance
-
Edge computing can eliminate a lot of data transfer between devices and the cloud. It’s possible to filter sensitive information locally and only transmit critical data model building information to the cloud. For example, with smart devices, watch-word processing such as listening for “OK Google” or “Alexa” can happen on the device itself. Potentially private data does not need to be collected or sent to the cloud. This allows users to build an appropriate security and compliance framework that is essential for enterprise security and audits.
- Cost-effective solutions
-
One of the practical concerns around IoT adoption is the up-front cost due to network bandwidth, data storage, and computational power. Edge computing can locally perform a lot of data computations, which allows businesses to decide which services to run locally and which ones to send to the cloud, which reduces the final costs of an overall IoT solution. This is where low-memory binary deployment of embedded models in a format like Open Neural Network Exchange (ONNX), built from a modern compiled language like Rust or Go, can excel.
- Interoperability
-
Edge devices can act as a communication liaison between legacy and modern machines. This allows legacy industrial machines to connect to modern machines or IoT solutions and provides immediate benefits of capturing insights from legacy or modern machines.
All these concepts allow architects to be incredibly flexible in the definition of their data platform. In Chapter 9 we will deep dive more into these concepts and we will see how this pattern is becoming a standard.
Applying AI
Many organizations are thrust into designing a cloud data platform because they need to adopt AI technologies—when designing a data platform, it is important to ensure that it will be future proof in being capable of supporting AI use cases. Considering the great impact AI is having on society and its diffusion within the enterprise environments, let’s take a quick deep dive into how it can be implemented in an enterprise environment. You will find a deeper discussion in Chapters 10 and 11.
Machine Learning
These days, a branch of AI called supervised machine learning has become tremendously successful to the point where the term AI is more often used as an umbrella term for this branch. Supervised ML works by showing the computer program lots of examples where the correct answers (called labels) are known. The ML model is a standard algorithm (i.e., the exact same code) that has tunable parameters that “learn” how to go from the provided input to the label. Such a learned model is then deployed to make decisions on inputs for which the correct answers are not known.
Unlike expert systems, there is no need to explicitly program the AI model with the rules to make decisions. Because many real-world domains involve human judgment where experts struggle to articulate their logic, having the experts simply label input examples is much more feasible than capturing their logic.
Modern-day chess-playing algorithms and medical diagnostic tools use ML. The chess-playing algorithms learn from records of games that humans have played in the past,2 whereas medical diagnostic systems learn from having expert physicians label diagnostic data.
Generative AI, a branch of AI/ML that has recently become extremely capable, is capable of not just understanding images and text but of generating realistic images and text. Besides being able to create new content in applications such as marketing, generative AI streamlines the interaction between machines and users. Users are able to ask questions in natural language and automate many operations using English, or other languages, instead of having to know programming languages.
In order for these ML methods to operate, they require tremendous amounts of training data and readily available custom hardware. Because of this, organizations adopting AI start out by building a cloud data/ML platform.
Uses of ML
There are a few key reasons for the spectacular adoption of ML in industry:
- Data is easier.
-
It is easier to collect labeled data than to capture logic. Every piece of human reasoning has exceptions that will be coded up over time. It is easier to get a team of ophthalmologists to label a thousand images than it is to get them to describe how they identify that a blood vessel is hemorrhaging.
- Retraining is easier.
-
When ML is used for systems such as recommending items to users or running marketing campaigns, user behavior changes quickly to adapt. It is important to continually train models. This is possible in ML, but much harder with code.
- Better user interface.
-
A class of ML called deep learning has proven capable of being trained even on unstructured data such as images, video, and natural language text. These types of inputs are notoriously difficult to program against. This enables you to use real-world data as inputs—consider how much better the user interface of depositing checks becomes when you can simply take a photograph of a check instead of having to type all the information into a web form.
- Automation.
-
The ability of ML models to understand unstructured data makes it possible to automate many business processes. Forms can be easily digitized, instrument dials can be more easily read, and factory floors can be more easily monitored because of the ability to automatically process natural language text, images, or video.
- Cost-effectiveness.
-
ML APIs that give machines the ability to understand and create text, images, music, and video cost a fraction of a cent per invocation, whereas paying a human to do so would cost several orders of magnitude more. This enables the use of technology in situations such as recommendations, where a personal shopping assistant would be prohibitively expensive.
- Assistance.
-
Generative AI can empower developers, marketers, and other white-collar workers to be more productive. Coding assistants and workflow copilots are able to simplify parts of many corporate functions, such as sending out customized sales emails.
Given these advantages, it is not surprising that a Harvard Business Review article found that AI generally supports three main business requirements:
-
Automating business processes—typically automating back-office administrative and financial tasks
-
Gaining insight through data analysis
-
Engaging with customers and employees
ML increases the scalability to solve those problems using data examples and without needing to write custom code for everything. Then ML solutions such as deep learning allow solving these problems even when that data consists of unstructured information like images, speech, video, natural language text, etc.
Why Cloud for AI?
A key impetus behind designing a cloud data platform might be that the organization is rapidly adopting AI technologies such as deep learning. In order for these methods to operate, they require tremendous amounts of training data. Therefore, an organization that plans to build ML models will need to build a data platform to organize and make the data available to their data science teams. The ML models themselves are very complex, and training the models requires copious amounts of specialized hardware called graphics processing units (GPUs). Further, AI technologies such as speech transcription, machine translation, and video intelligence tend to be available as SaaS software on the cloud. In addition, cloud platforms provide key capabilities such as democratization, easier operationalization, and the ability to keep up with the state of the art.
Cloud Infrastructure
The bottom line is that high-quality AI requires a lot of data—a famous paper titled “Deep Learning Scaling Is Predictable, Empirically” found that for a 5% improvement in a natural language model, it was necessary to train twice as much data as was used to get the first result. The best ML models are not the most advanced ones—they are the ones that are trained on more data of high-enough quality. The reason is that increasingly sophisticated models require more data, whereas even simple models will improve in performance if trained on a sufficiently large dataset.
To give you an idea of the quantity of data required to complete the training of modern ML models, image classification models are routinely trained on one million images and leading language models are trained on multiple terabytes of data.
As shown in Figure 1-10, this sort of data quantity requires a lot of efficient, bespoke computation—provided by accelerators such as GPUs and custom application-specific integrated circuits (ASICs) called tensor processing units (TPUs)—to harness this data and make sense of it.
Many recent AI advances can be attributed to increases in data size and compute power. The synergy between the large datasets in the cloud and the numerous computers that power it has enabled tremendous breakthroughs in ML. Breakthroughs include reducing word error rates in speech recognition by 30% over traditional approaches, the biggest gain in 20 years.
Democratization
Architecting ML models, especially in complex domains such as time-series processing or natural language processing (NLP), requires knowledge of ML theory. Writing code for training ML models using frameworks such as PyTorch, Keras, or TensorFlow requires knowledge of Python programming and linear algebra. In addition, data preparation for ML often requires data engineering expertise, and evaluating ML models requires knowledge of advanced statistics. Deploying ML models and monitoring them requires knowledge of DevOps and software engineering (often termed MLOps). Needless to say, it is rare that all these skills are present in every organization. Given this, leveraging ML for business problems can be difficult for a traditional enterprise.
Cloud technologies offer several options to democratize the use of ML:
- ML APIs
-
Cloud providers offer prebuilt ML models that can be invoked via APIs. At that point, a developer can consume the ML model like any other web service. All they require is the ability to program against representational state transfer (REST) web services. Examples of such ML APIs include Google Translate, Azure Text Analytics, and Amazon Lex—these APIs can be used without any knowledge of NLP. Cloud providers provide generative models for text and image generation as APIs where the input is just a text prompt.
- Customizable ML models
-
Some public clouds offer “AutoML,” which are end-to-end ML pipelines that can be trained and deployed with the click of a mouse. The AutoML models carry out “neural architecture search,” essentially automating the architecting of ML models through a search mechanism. While the training takes longer than if a human expert chooses an effective model for the problem, the AutoML system can suffice for lines of businesses that don’t have the capability to architect their own models. Note that not all AutoML is the same—sometimes what’s called AutoML is just parameter tuning. Make sure you are getting a custom-built architecture rather than simply a choice among prebuilt models, double-checking that there are various steps that can be automated (e.g., feature engineering, feature extraction, feature selection, model selection, parameter tuning, problem checking, etc.).
- Simpler ML
-
Some DWHs (BigQuery and Redshift at the time of writing) provide the ability to train ML models on structured data using just SQL. Redshift and BigQuery support complex models by delegating to Vertex AI and SageMaker respectively. Tools like DataRobot and Dataiku offer point-and-click interfaces to train ML models. Cloud platforms make fine-tuning of generative models much easier than otherwise.
- ML solutions
-
Some applications are so common that end-to-end ML solutions are available to purchase and deploy. Product Discovery on Google Cloud offers an end-to-end search and ranking experience for retailers. Amazon Connect offers a ready-to-deploy contact center powered by ML. Azure Knowledge Mining provides a way to mine a variety of content types. In addition, companies such as Quantum Metric and C3 AI offer cloud-based solutions for problems common in several industries.
- ML building blocks
-
Even if no solution exists for the entire ML workflow, parts of it could take advantage of building blocks. For example, recommender systems require the ability to match items and products. A general-purpose matching algorithm called two-tower encoders is available from Google Cloud. While there is no end-to-end back-office automation ML model, you could take advantage of form parsers to help implement that workflow quicker.
These capabilities allow enterprises to adopt AI even if they don’t have deep expertise in it, thereby making AI more widely available.
Even if the enterprise does have expertise in AI, these capabilities prove very useful because you still have to decide whether to buy or build an ML system. There are usually more ML opportunities than there are people to solve them. Given this, there is an advantage to allowing noncore functionality to be carried out using prebuilt tools and solutions. These out-of-the-box solutions can deliver a lot of value immediately without needing to write custom applications. For example, data from a natural language text can be passed to a prebuilt model via an API call to translate text from one language to another. This not only reduces the effort to build applications but also enables non-ML experts to use AI. On the other end of the spectrum, the problem may require a custom solution. For example, retailers often build ML models to forecast demand so they know how much product to stock. These models learn buying patterns from a company’s historical sales data, combined with in-house, expert intuition.
Another common pattern is to use prebuilt, out-of-the-box models for quick experimentation, and once the ML solution has proven its value, a data science team can build it in a bespoke way to get greater accuracy and hopefully more differentiation against the competition.
Real Time
It is necessary for the ML infrastructure to be integrated with a modern data platform because real-time, personalized ML is where the value is. As a result, speed of analytics becomes really important as the data platform must be able to ingest, process, and serve data in real time, or opportunities are lost. This is then complemented by the speed of action. ML drives personalized services, based on the customer’s context, but has to provide inference before the customer context switches—there’s a closing window for most commercial transactions within which the ML model needs to provide the customer with an option to act. To achieve this, you need the results of ML models to arrive at the point of action in real time.
Being able to supply ML models with data in real time and get the ML prediction in real time is the difference between preventing fraud and discovering fraud. To prevent fraud, it is necessary to ingest all payment and customer information in real time, run the ML prediction, and provide the result of the ML model back to the payment site in real time so that the payment can be rejected if fraud is suspected.
Other situations where real-time processing saves money are customer service and cart abandonment. Catching customer frustration in a call center and immediately escalating the situation is important to render the service effective—it will cost a lot more money to reacquire a customer once lost than to render them good service in the moment. Similarly, if a cart is at risk of being discarded, offering an enticement such as 5% off or free shipping may cost less than the much larger promotions required to get the customer back on the website.
In other situations, batch processing is simply not an effective option. Real-time traffic data and real-time navigation models are required for Google Maps to allow drivers to avoid traffic.
As you will see in Chapter 8, the resilience and autoscaling capability of cloud services is hard to achieve on premises. Thus, real-time ML is best done in the cloud.
MLOps
Another reason that ML is better in the public cloud is that operationalizing ML is hard. Effective and successful ML projects require operationalizing both data and code. Observing, orchestrating, and acting on the ML lifecycle is termed MLOps.
Building, deploying, and running ML applications in production entails several stages, as shown in Figure 1-11. All these steps need to be orchestrated and monitored; if, for example, data drift is detected, the models may need to be automatically retrained. Models have to be retrained on a constant basis and deployed, after making sure they are safe to be deployed. For the incoming data, you have to perform data preprocessing and validation to make sure there are no data quality issues, followed by feature engineering, followed by model training, and ending with hyperparameter tuning.
In addition to the data-specific aspects of monitoring discussed, you also have the monitoring and operationalization that is necessary for any running service. A production application is often running continuously 24/7/365, with new data coming in regularly. Thus, you need tooling that makes it easy to orchestrate and manage these multiphase ML workflows and to run them reliably and repeatedly.
Cloud AI platforms such as Google’s Vertex AI, Microsoft’s Azure Machine Learning, and Amazon’s SageMaker provide managed services for the entire ML workflow. Doing this on premises requires you to cobble together the underlying technologies and manage the integrations yourself.
At the time of writing this book, MLOps capabilities are being added at a breakneck pace to the various cloud platforms. This brings up an ancillary point, that with the rapid pace of change in ML, you are better off delegating the task of building and maintaining ML infrastructure and tooling to a third party and focusing on data and insights that are relevant to your core business.
In summary, a cloud-based data and AI platform can help resolve traditional challenges with data silos, governance, and capacity while enabling the organization to prepare for a future where AI capabilities become more important.
Core Principles
When designing a data platform, it can help to set down key design principles to adhere to and the weight that you wish to assign to each of these principles. It is likely that you will need to make trade-offs between these principles, and having a predetermined scorecard that all stakeholders have agreed to can help you make decisions without having to go back to first principles or getting swayed by the squeakiest wheel.
Here are the five key design principles for a data analytics stack that we suggest, although the relative weighting will vary from organization to organization:
- Deliver serverless analytics, not infrastructure.
-
Design analytics solutions for fully managed environments and avoid a lift-and-shift approach as much as possible. Focus on a modern serverless architecture to allow your data scientists (we use this term broadly to refer to data engineers, data analysts, and ML engineers) to keep their focus purely on analytics and move away from infrastructure considerations. For example, use automated data transfer to extract data from your systems and provide an environment for shared data with federated querying across any service. This eliminates the need to maintain custom frameworks and data pipelines.
- Embed end-to-end ML.
-
Allow your organization to operationalize ML end to end. It is impossible to build every ML model that your organization needs, so make sure you are building a platform within which it is possible to embed democratized ML options such as prebuilt ML models, ML building blocks, and easier-to-use frameworks. Ensure that when custom training is needed, there is access to powerful accelerators and customizable models. Ensure that MLOps is supported so that deployed ML models don’t drift and become no longer fit for purpose. Make the ML lifecycle simpler on the entire stack so that the organization can derive value from its ML initiatives faster.
- Empower analytics across the entire data lifecycle.
-
The data analytics platform should be offering a comprehensive set of core data analytics workloads. Ensure that your data platform offers data storage, data warehousing, streaming data analytics, data preparation, big data processing, data sharing and monetization, business intelligence (BI), and ML. Avoid buying one-off solutions that you will have to integrate and manage. Looking at the analytics stack much more holistically will, in return, allow you to break down data silos, power applications with real-time data, add read-only datasets, and make query results accessible to anyone.
- Enable open source software (OSS) technologies.
-
Wherever possible, ensure that open source is at the core of your platform. You want to ensure that any code that you write uses OSS standards such as standard SQL, Apache Spark, TensorFlow, etc. By enabling the best open source technologies, you will be able to provide flexibility and choice in data analytics projects.
- Build for growth.
-
Ensure that the data platform that you build will be able to scale to the data size, throughput, and number of concurrent users that your organization is expected to face. Sometimes, this will involve picking different technologies (e.g., SQL for some use cases and NoSQL for other use cases). If you do so, ensure that the two technologies that you pick interoperate with each other. Leverage solutions and frameworks that have been proven and used by the world’s most innovative companies to run their mission-critical analytics apps.
Overall, these factors are listed in the order that we typically recommend them. Since the two primary motivations of enterprises in choosing to do a cloud migration are cost and innovation, we recommend that you prioritize serverless (for cost savings and freeing employees from routine work) and end-to-end ML (for the wide variety of innovation that it enables).
In some situations, you might want to prioritize some factors over others. For startups, we typically recommend that the most important factors are serverless, growth, and end-to-end ML. Comprehensiveness and openness can be sacrificed for speed. Highly regulated enterprises might favor comprehensiveness, openness, and growth over serverless and ML (i.e., on premises might be necessitated by regulators). For digital natives, we recommend, in order, end-to-end ML, serverless, growth, openness, and comprehensiveness.
Summary
This was a high-level introduction to data platform modernization. Starting from the definition of the data lifecycle, we looked at the evolution of data processing, the limitations of traditional approaches, and how to create a unified analytics platform on the cloud. We also looked at how to extend the cloud data platform to be a hybrid one and to support AI/ML. The key takeaways from this chapter are as follows:
-
The data lifecycle has five stages: collect, store, process, analyze/visualize, and activate. These need to be supported by a data and ML platform.
-
Traditionally, organizations’ data ecosystems consist of independent solutions that lead to the creation of silos within the organization.
-
Data movement tools can break data silos, but they impose a few drawbacks: latency, data engineering resource bottlenecks, maintenance overhead, change management, and data gaps.
-
Centralizing control of data within IT leads to organizational challenges. IT departments don’t have necessary skills, analytics teams get poor data, and business teams do not trust the results.
-
Organizations need to build a cloud data platform to obtain best-of-breed architectures, handle consolidation across business units, scale on-prem resources, and plan for business continuity.
-
A cloud data platform leverages modern approaches and aims to enable data-led innovation through replatforming data, breaking down silos, democratizing data, enforcing data governance, enabling decision making in real time and using location information, and moving seamlessly from descriptive analytics to predictive and prescriptive analytics.
-
All data can be exported from operational systems to a centralized data lake for analytics. The data lake serves as the central repository for analytics workloads and for business users. The drawback, however, is that business users do not have the skills to program against a data lake.
-
DWHs are centralized analytics stores that support SQL, something that business users are familiar with.
-
The data lakehouse is based on the idea that all users, regardless of their technical skills, can and should be able to use data. By providing a centralized and underlying framework for making data accessible, different tools can be used on top of the lakehouse to meet the needs of each user.
-
Data mesh introduces a way of seeing data as a self-contained product. Distributed teams in this approach own the data production and serve internal/external consumers through well-defined data schema.
-
A hybrid cloud environment is a pragmatic approach to meet the realities of the enterprise world such as acquisitions, local laws, and latency requirements.
-
The ability of the public cloud to provide ways to manage large datasets and provision GPUs on demand makes it indispensable for all forms of ML, but deep learning and generative AI in particular. In addition, cloud platforms provide key capabilities such as democratization, easier operationalization, and the ability to keep up with the state of the art.
-
The five core principles of a cloud data platform are to prioritize serverless analytics, end-to-end ML, comprehensiveness, openness, and growth. The relative weights will vary from organization to organization.
Now that you know where you want to land, in the next chapter, we’ll look at a strategy to get there.
1 Not just the cost of the technology or license fees—the cost here includes people costs, and SQL skills tend to cost less to an organization than Java or Python skills.
2 Recent ML systems such as AlphaGo learn by looking at games played between machines themselves: this is an advanced type of ML called reinforcement learning, but most industrial uses of ML are of the simpler supervised kind.
Get Architecting Data and Machine Learning Platforms now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.