Chapter 1. Should AI and Analytics at Scale Be Difficult?

Compromise is a fine thing to do. Unless it’s unnecessary.

It’s true that building artificial intelligence (AI) and analytics systems at large scale isn’t easy, but there’s more to the story than that.

Successful AI and analytical systems—systems that work well in production, where value is realized—don’t have to be as difficult as people often make them. It’s not that people intentionally make life harder than necessary; it’s just that they may start with some faulty assumptions that can lead to accepting unnecessary compromises. But it doesn’t have to be that way.

We’ll examine some of these faulty assumptions and what they are based on throughout this chapter, but first it’s important to realize that they are not necessarily the result of bad reasoning. Instead, faulty assumptions often arise from limitations imposed by underlying technologies and weak design, limitations that could be avoided.

Some limitations aren’t apparent at small scale but show up as systems go into production at large scale. Some limitations involve legacy technologies that don’t adapt well to all the demands of scale in production. Some modern technologies designed for large-scale systems don’t mix well with legacy applications, or with a wide range of tools.

In any of these cases, if your current technology choices for basics such as data infrastructure and a framework to orchestrate computation lack key capabilities, it’s natural for you to assume you have to accept trade-offs. You may feel you have to choose between scalability versus reliability or versus affordability. You may feel that each AI or analytics system needs to have separate data infrastructure or feel stuck with trade-offs between scalability and availability or agility. But these trade-offs and limitations that give rise to them are not inescapable.

Note

Recognizing how your current technology and practices may be forcing your choices is a good first step to architect and build better production systems at lower cost.

Innovations in technology can change the whole paradigm for what is possible. One only needs to have the vision to see—and take advantage of—the new possibilities.

Consider what happened in the late nineteenth and early twentieth centuries as skyscrapers first began to change entire cityscapes. Previously, the large, imposing buildings found in cities like New York and Chicago—the brick or stone-sided mansions, office buildings, banks, and department stores—were mostly less than four stories tall, and always less than ten stories. Then, over the course of about 45 years, everything changed. Tall buildings began to rise up above the rest, creating an entirely new skyline.

Skyscrapers were not just built on a bigger scale. A 50–100 story building, or even a 10-story building, is not just a taller version of a 3-story building. There are fundamental differences between even the early skyscrapers and prominent but relatively short city buildings of a few stories that came before. These differences became possible because of technological advances.

One early enabling technology was development of the passenger-bearing safety elevator. Without elevators, it would have been impractical to use tall buildings even if it were structurally possible to build them. Fireproofing was another technology that was essential for taller buildings.

But the biggest shift in possibilities came about with the technological advance of using a steel framework as the fundamental infrastructure to support the weight of a building, rather than relying on load-bearing walls of brick and stone (as was previously the standard way to build). This was a revolutionary idea.

Steel beam construction freed architects from limitations imposed by traditional ways of building. Although early skyscrapers resembled taller versions of 3-story buildings (from the outside, anyway), they were radically different thanks to new technology. Figure 1-1 shows a New York skyline.

The Chrysler Building at 42nd Street and Lexington Avenue rose up above the New York cityscape in May 1930 as a classic Art Deco style skyscraper. It was briefly the tallest building in the city and it remains one of the most beautiful. Image by Vivienne Gucwa

New architectural design options based on a combination of technological advances introduced a whole new way of using space. Not only was there much more usable building space on each acre of real estate, businesses could also develop new ideas about how to interact and collaborate.

Larger scale for buildings would not have been possible by just doing more of the same. It is a classic example of the modern saying, often used in relation to growing start-ups, that what got you here won’t get you there.

By analogy, if your assumptions about large-scale AI and analytics systems are based on old approaches that do not leverage technological advances in data infrastructure and computational orchestration, you may be missing out on the equivalent of the new skyscraper.

It’s not about whether you can get large-scale AI or analytics systems working at all. It’s about whether or not your system is what we call “scale-efficient.” Such systems are optimized to be cost-effective and provide the flexibility needed for you to quickly respond to changes and new opportunities, even as scale increases.

We’ve been surprised to find that some people think they must repeatedly re-architect their systems in order to increase scale or to try out new ideas. That’s because their design and underlying technologies make their system cumbersome and brittle, unable to pivot effectively as conditions change (and they will change!). Equally surprising is that many people assume AI and machine learning systems must be developed and run on different infrastructure than that of analytics systems. Or they think that building large systems has to be enormously expensive.

This is simply not the case.

We’ve seen a big difference in the way people work with AI and analytics systems at scale, and there is a particular style that characterizes highly successful, scale-efficient systems.

Advanced technology alone is not sufficient to address the problems imposed by large-scale systems. A variety of practices also help make large-scale systems work better and at reasonable cost. And that matters, because for most situations, saying “it can’t be done at a reasonable cost” is the same as saying “it can’t be done.”

In this report, we examine the challenges of working at scale, and, starting in Chapter 2, we identify a pattern of fundamental approaches seen in successful scale-efficient systems. Those insights can keep you from missing out on the advantages that result from this style of work.

Note

Some aspects of building and maintaining scale-efficient AI and analytics systems may superficially resemble DevOps, but there are fundamental differences.

Before we examine what works to build scale-efficient AI and analytics, let’s first start by digging into why working at scale can be difficult and, more specifically, how some people unintentionally make it even harder.

Data Challenges at Scale

If you think because you can store lots of data that you’ve met the challenges of data, please think again. How you store it and how you manage it shapes your architecture, affects how teams work with data or work together, and affects how applications can interact. It also affects how well your system scales and how cost-effective it is. One common reason that people make working with large-scale AI and analytics systems harder than it has to be is that they underestimate the importance of data and how to manage it, from development to production. Here are some of the data challenges that need to be addressed in a scale-efficient system.

Effective Data Storage

To deal with large amounts of data and the applications that use it, modern system architects have turned to breaking things into pieces and working in a distributed way. This is the essence of what is known as a scale-out architecture. Computations are broken into pieces that can run in parallel across many computers. That allows larger computations and lets them run faster than would happen with just one computer. We’ll talk about that more in the next section.

Typically, as data grows it is also broken into pieces and stored on multiple machines. That, too, is a clever approach, but breaking things into many pieces also requires having a way to coordinate the parts, and this is a point where scale-efficient systems often differ from less effective alternatives. In a scale-efficient system, software serves as the data infrastructure, and it needs to orchestrate where the pieces of data are stored in a way that does not lose data and that avoids performance bottlenecks. The management capabilities of the data storage infrastructure also need to conveniently control who has (and does not have) access to data and provide efficient mechanisms for data movement within a cluster or between clusters.

All this should be scalable, reliable, and highly available and should allow a variety of data formats coming from a variety of data sources. When you have a lot of data, you need a convenient way to find it, so exactly how files and tables are named and located makes a big difference to the efficiency and speed of a data infrastructure. And, for a scale-efficient system, data infrastructure should provide a way to tier less frequently used data to cost-effective storage systems.

Another challenge for data in large-scale systems, given the fast-growing use of containers, is allowing applications to be containerized but still interact with other applications, as we discuss in the next section.

Data, Kubernetes, and Containerized Applications

Distributed computations require three major things. First is some way to package up programs that do the computation. Second is some way to orchestrate all the pieces of computation. Third is access to data.

Containerization of applications is ideal for the packaging step because it allows programs to be run in predictable, customized environments. A key advantage is the ability to control the specific environment your program requires (including system libraries, language version, and additional packages). The environment for one containerized application doesn’t have to match the environment needed for other applications.

That leaves the second requirement: an organizing framework to coordinate where and when these containers will run. Kubernetes, from Google, has emerged as the leading framework to orchestrate containerized computations, both on premises and in cloud deployments.

In Chapter 2, we examine more about how containerization and Kubernetes work in real-world use cases running many applications. But even as many people turn to Kubernetes to orchestrate computation, they may overlook the third requirement: access to data and an analogous need to orchestrate the data used by containerized applications.

Orchestrating Computation with Kubernetes

Kubernetes is an open source project that coordinates and manages computational processes spread across shared clusters of computers. Kubernetes was announced in 2014, released as open source in 2015, and is the outgrowth of the Google internal system named Borg and the Google research project known as the Omega scheduler.

Since its first release, Kubernetes has become the de facto standard for managing large-scale computation, steadily squeezing out alternatives such as Mesos and Docker Swarm. As of 2019, nearly 80% of respondents to the annual CNCF survey reported using Kubernetes in production. All the major public cloud vendors support Kubernetes, and have done so for several years. Many noncloud systems, such as OpenShift, Tanzu, Rancher, and the HPE Ezmeral Container Platform, use Kubernetes at their core.

There are many reasons for this five-year track from first release to dominance, but most of them relate to the fact that Kubernetes tackles the hard problems of service access, process coordination, security, system visibility, and allowing multiple resource scheduling strategies to coexist in the same cluster in a cohesive and mature manner. It didn’t hurt to get very high profile use cases like Pokemon Go that vividly described how the team at Niantic used Kubernetes to build an incredibly scalable system that let them go far beyond their expectations.

Most applications need to interact with persistent data, as input, to store intermediate values and to store final results. This is referred to as “state.” Also, different containers, from the same or different applications, often need to interact with the same data. Unless you want to be limited to stateless applications, you need a way for containerized applications to store and access data. Further, that data cannot be tied to individual containers. It isn’t good enough to just give each container some private disk space.

So you must not only think about orchestrating the computational parts of the large application (Kubernetes largely solves this problem), you also have to plan how they will coordinate with your data infrastructure. Kubernetes provides some help in the form of a standardized method known as the Container Storage Interface (CSI) to connect containerized workloads with data storage systems.

That is, however, not the only issue for data. Is your system limited to just storage in the form of raw disk blocks? Or in the form of container-local file systems? Can it handle distributed files and tables? Does it support event streams? Can individual objects scale to a size larger than any single physical machine can store? And do you have to use many different systems to meet different needs, thus introducing complications due to managing multiple systems? As popular and powerful as it is, Kubernetes doesn’t address the needs for data orchestration nor free you from limitations that your data infrastructure may impose. It’s therefore important to identify what those limitations are and determine to what extent they are escapable.

In-Production Data from the Start

Data can effectively be in production even before code is written, but this requirement is frequently overlooked until too late. Think about that: by the time an application is tested and ready to be deployed into production, critical data may have been “corrected” or deleted or lost. This problem with data is particularly true of AI or machine learning systems, where you do not know a priori all the aspects of data that will be important in the end.

Flexibility of Data Access

Another data challenge is flexibility of access. Does your data infrastructure force you to copy data between systems in order to work with different tools? Does your system provide fine-grained and consistent control over who has access without imposing a giant burden on IT teams?

If your storage system imposes limitations on how easily you can deal with changes in scale and changes in tools, or if it limits access by legacy applications, you’re accepting trade-offs that aren’t necessary. That will have rippling negative impacts throughout development and production. In short, it’s not just a matter of whether or not your system works at all; it’s a question of whether it works reliably, efficiently, and affordably at large scale.

Of course data is just one of the challenges you face in dealing with analytics and AI at scale in production. Data needs are so often overlooked, however, that we are going to focus most heavily on issues surrounding data than on other issues in this report. But before we look in more detail at the aspects of data infrastructure that can best support your systems, let’s delve into a more fundamental question about the overall challenge of going from development into production.

What Makes Production Break?

It’s not unusual to find projects that seem to be running well in development begin to encounter problems as they go into production. We have all heard somebody say, “it runs on my machine.” What, then, is it about going from development to production that makes such a difference?

Common reasons that things may break in going from development to production include:

Changes in execution environment
Requirement to meet service-level agreements (SLAs)
Working with larger scale data or different data

All of these issues have to do with changes between development and production settings. To the degree that we can minimize this difference, we can also minimize the risk of encountering problems during deployment.

We’ve already mentioned how containers help control the execution environment of code. This ability to easily set up a customized environment that lets us develop, test, and deploy programs in more specific and predictable conditions and control exactly which code dependencies they have. In addition, Kubernetes lets us control the network environment much more tightly since networking is a key part of an application’s environment.

Perhaps the biggest difference between development and production is that in production your system has to meet service-level agreements (SLAs). Obviously, we try to build correct systems in the first place, and we test an application before putting it into production, but we may miss some issues because our testing environment differs from production. A particular application may be able to meet its SLA for latency when tested in isolation, during development. Yet it fails when run in a production setting where other applications are competing for resources. There can also be expectations about data or how a system interacts with other systems that aren’t written down explicitly and thus don’t get fully tested before deployment. These expectations effectively function as SLAs in production.

It’s important to get ahead of these issues; fixing is always harder than doing things right in the first place. Moreover, continual fixes are a form of instability that can lead to further complications as more and more systems are deployed and interact. It isn’t news, but it bears repeating: it is better to test against your SLAs (both explicit and implicit) well in advance of production and have infrastructure that helps you develop and test in as realistic an environment as possible. Also keep in mind that it’s not just code that must meet SLAs; data also can have SLAs attached. In Chapter 2, we explore how people meet these data guarantees as part of an efficient overall design.

Larger scale can be a major challenge in testing. This is particularly true if your architecture and your infrastructure are not designed to handle change in scale well. The old moan we mentioned above that “it ran on my machine” turns into “it ran at my scale!” It’s difficult to extrapolate behavior of a program at small scale to the behavior at large scale. For that reason, if you only test at a small scale, the transition into production can be risky. Keep this transition in mind and consider developing and testing your applications at production scale and speed. Start as you plan to go.

Security at Scale

The need for security is a given in any system, but it is particularly important for production systems. There are a number of ways to deal with security, but the issue we want to address is that security does not always scale. The manner by which you ensure security at a local or small scale will not necessarily work reliably at large scale. More importantly, people may be caught unaware by a false sense of confidence in security measures that appear to work at the scale used during development but fail at production scale. The point at which you deploy applications into production is not a good time to figure out if your security measures can handle production scale.

Furthermore, the expanding use of containers to run large applications raises new questions about how to handle security. In this case, it’s not just scale in terms of the amount of data that is the issue. Rather, the problem arises because many containers work together in these large applications; the old concept of a Linux user ID as the owner of a process becomes unusable because it is inherently embedded in the concept of an application running on a single machine. As applications become more complex by running on many containers, our approach to security must adapt and change.

One interesting new open source technology, SPIFFE (Secure Production Identity Framework for Everyone), offers ways for dealing with security for such large applications by defining a cryptographically attestable workload identifier that can be used to secure communication channels between processes and allows trust relationships to be established between services.

New Technology for Application Security: SPIFFE

Scalable systems are making previous security concepts such as process permissions, user IDs, and SELinux much less effective. Applications no longer run on individually curated machines connected by a trusted network (if they ever did).

SPIFFE is a protocol that defines a new concept of a workload identity that is designed for this new model of computation. A workload is given a cryptographically secure certificate that encodes the ephemeral identity of the workload and that also specifies other services that should be trusted. This certificate can provide the basis of securing requests between services in large systems and can extend beyond the confines of a single data center all the way to the edge. When rooted in hardware mechanisms like a silicon root of trust and hardware key managers, SPIFFE can provide much better security than has previously been available, particularly at scale and particularly with highly distributed systems.

What Does Scale Really Mean?

Today, enterprises are building analytic and AI applications that harvest the great value of large amounts of data. You can ask and answer questions based on large data sets that were just not possible with data at a smaller scale. It’s natural for organizations to look for new data sources and to collect and use larger quantities of data than in the past. Such large-scale data can be a challenge for product use, but data size is just one of the aspects of scale that should be considered in order to build a successful production system. Let’s take a look at the different meanings of scale, starting with the most straightforward.

Scale in Terms of Data Size

What amount of data should be thought of as large scale? Think of data as large when the size itself becomes a technical challenge. Routinely speaking, people who work with less than 50 terabytes may not particularly need to address the challenges of large scale, unless they are likely to see rapid data growth. In contrast, the businesses we work with routinely have data ranging from 50 terabytes to 500 petabytes and beyond, so they need data infrastructure that can handle this scale without having to resort to costly work-arounds, such as cluster proliferation.

The critical thing about large data size, however, is to not make it bigger than it needs to be. In response to limitations imposed by some data technologies, people make unnecessary copies of huge data sets. One way this happens is through the use of data infrastructure that lacks open access APIs. Such data systems cause people to copy data between specialized systems to accommodate different analytics or machine learning tools and languages. This unnecessary copying is particularly common in machine learning and AI projects, where data scientists routinely use a wide variety of favorite tools that depend on standard file access. In contrast, data engineers commonly use legacy big-data tools. AI and machine learning tools generally do not directly access data stored in some of the so-called big-data platforms. The result is a proliferation of redundant copies.

Another reason data gets copied unnecessarily is to deal with “noisy neighbors”—competing applications that may create hot spots or congestion. Data infrastructure that lacks fully distributed metadata can also result in metadata congestion when multiple applications access large data sets. All these limitations can lead to unnecessary data copying and result in sprawling, expensive, cumbersome systems. These problems are not inescapable, however. There are data technologies that avoid these limitations, as we discuss in Chapter 2.

All of those issues of scale have to do with the amount of data, but there are other concerns of scale as well.

Scale in Terms of the Number of Files or Other Objects

While people generally are aware that total data size is an important requirement, they may overlook the equally important requirement of being able to handle a large number of files or other data objects. Working with hundreds of millions or billions or even more files can swamp you unless your data infrastructure is designed to handle a very large number of objects as well as a large amount of data. Even worse, if you have three applications that each require 80 million files and a platform that only handles 100 million at most, you are forced to have three separate platforms. Furthermore, if someone were to run a test on one of these three platforms at production scale, that could overwhelm the platform and take out a production system.

While most businesses don’t have to deal with truly colossal scale in terms of the number of objects, use cases typically do involve tens of millions to hundreds of millions of objects, which is enough to cause difficulties with many systems. Metrics-oriented service companies, for instance, tend to have a large number of files. Consumer websites are a typical example: they often need to display many images of the things they sell, and they may offer many versions of each item or service. An online retail catalog lists multiple colors, sizes, and views for each of many products. All of these variations must be stored as images, resulting in hundreds of images per product. When this requirement is multiplied by millions of products, you have to handle a huge number of small images. So remember to look beyond total data size as you assess your system’s ability to handle scale: can your system reliably handle large numbers of objects as well? Can it handle that many objects for, say, 100 applications simultaneously?

Scale Can Mean Many Applications or Many Teams

As we mentioned in the context of the usefulness of Kubernetes, one key issue for production is that an application almost certainly won’t be the only program running on the system. Your architecture and infrastructure should not limit you to a few applications on the same platform at the same time. If it does limit you this way, your system will quickly become too expensive due to cluster proliferation. The entry cost of every new application or project becomes unnecessarily high if you have to set up a new cluster to support each one. In addition, creating lots of new systems or platforms to support new applications means you wind up with lots of data silos. If that happens, you will lose the advantage of a comprehensive data strategy and lose opportunities for collaboration.

Many applications generally means lots of teams as well. To make these teams productive, they should have as much autonomy as is compatible with security and the ability to manage the data infrastructure. For efficiency, and to reduce the burden on IT, it is important to delegate many simple tasks to these teams.

All this requires a data infrastructure that makes it easy to manage who does and who does not have access to data across huge systems and huge numbers of objects. Unless you have effective and expressive ways to control access by different teams, no matter what tools they use, you will find it very difficult to maintain secure production systems. The same thing will happen if you have multiple platforms that cannot use the same languages to express access controls.

Scale in Terms of Geo-distributed Locations

Another kind of scale happens when data comes from geo-distributed sources or when you need to run applications in many dispersed locations, possibly due to data regulatory requirements or just simple physics.

The problems of dealing with the geological scale of locations include how to capture large amounts of data near the source, how and when to send some or all of this data back to core data centers, and how to deliver analytic applications and machine learning models to these edge locations.

One classic example is a system that captures IoT sensor data and does some partial data processing or data modeling in the same place. Other examples include service providers, retail companies, or even financial companies. In general, the edge is where your business actually happens, where your business touches your customers: in stores and online. In Chapter 4, we deal with edge as a special topic for how to build successful production systems at scale.

Scalability Is as Important as Scale

No person ever steps in the same river twice.

Adage based on philosophy of Heraclitus

It’s one thing to have a design and infrastructure that can handle your current scale, but what happens as things change? Because they will.

More data is collected, new data sources are tapped, new applications are deployed, new lines of business are opened. This is the natural progression in successful enterprises, so it’s not enough to be able to handle your current needs. You must also be prepared for a change in scale.

Sometimes scale increases as a result of having a successful platform. A technically efficient and convenient platform can attract projects away from less successful platforms. This shift rapidly drives up the scale on the attractive platform. Success attracts success. That’s a good thing as long as your current infrastructure and design can deal with change. Businesses shouldn’t have to re-architect their system as scale changes, or say “no” to new, potentially valuable, opportunities because current production systems and IT teams who support them cannot afford the added pressure. They also shouldn’t maroon projects on second-rate platforms. But that’s what often happens if the overall design and infrastructure of your organization impose unnecessary trade-offs due to lack of this kind of flexibility.

The idea is to think beyond scale to scalability—the ability of your system to withstand change in scale in multiple ways.

Unfortunately people may not correctly assess the potential scalability of their systems, or they think it’s fine to be “getting by” with a system that works at their current scale but cannot meet expanding needs. This problem may be caused by acceptance of a false economy. Figure 1-2 shows how such false economies can lead to a (temporary) preference for an unscalable system.

Infrastructure that appears to be cheaper and adequate at lower scale can turn out to be quite expensive. Furthermore, you can wind up in real trouble with seemingly innocent changes. Imagine you have two platforms, each operating at the top of the false economy range. Combining these platforms could be disastrous because both systems will be pushed into an unscalable regime of operations.

A scalable system is one where costs go up in direct proportion to scale. An unscalable system is one where the costs go up at an accelerating rate as scale increases. There may be a region of false economy where an unscalable system appears cheaper. But if scale is increased the situation will eventually reverse possibly catastrophically for the unscalable system .

Consider people using the poorly scalable systems with the profile shown by the dotted line in Figure 1-2. It’s not that they cannot increase scale. It’s that they may feel pressure to stay in the false economy range to avoid the sharply increasing costs they will face if scale increases. They won’t want to allow other projects to use their systems. They won’t want to take on large new data sources that could push them well into the bad scaling region. Their technology is limiting their options unnecessarily. Feeling pressure to stay in the false economy zone is a symptom that you are accepting trade-offs that need to be re-examined because they might well be escapable.

Cost in this case is not limited to monetary cost. There also can be time costs in how quickly development is carried out. There are costs in performance of applications running slower on a cumbersome system. There are complexity costs associated with workarounds that try to avoid scaling. Another particularly painful cost is the slow response times from IT teams that result from bad human scaling. One of the most likely outcomes of that burden on IT is frustration among users. That frustration, in turn, can lead developers, analysts, and data scientists to resort to the work-around of building shadow IT. When that happens, hidden costs stack up and the ability to fix problems decreases. The impact can reach beyond the particular teams involved in this work-around. This style is, as you might imagine, not how an efficient, cost-effective, agile organization works.

Proliferation as a Symptom

Another warning sign of a nonscale-efficient system is runaway proliferation. If the underlying data infrastructure is imposing unnecessary limitations, people may try to address challenges and meet SLAs by just “adding more stuff.” This usually manifests as proliferation of machines and clusters, driven in part by the unnecessary copying of large data sets, which we have already discussed. That in turn results in an additional burden on IT. There’s also an increased likelihood of siloed data, which can have a negative impact on accuracy of results for data science and analysts.

Such proliferation is an incredibly insidious form of technical debt that can be hard to see since it usually happens a step at a time and each step has a logical reason, usually involving compensating for deficiencies in existing systems. Each of these systems is locally plausible, but together they often are dominated by the cost of making and maintaining redundant data copies.

As scale increases, either in terms of the amount and diversity of data or amount and diversity of computation, it’s natural that there will be some expansion. Additional hardware, sometimes in new locations, can be a healthy pattern of growth in an organization. This is not the type of proliferation we are warning against. Instead, be alert to teams assuming that additional clusters or platforms are the default way to handle larger scale or new projects. If that is the style of work in your organization, it may be a symptom of inefficient data infrastructure.

Alternatively, the problem may be a human one. Even with a good foundation in efficiently scalable data infrastructure, you may find that people who have previously worked on infrastructure that did not scale well or that imposed serious trade-offs are so used to assuming proliferation is necessary that they don’t take full advantage of a scalable system. This problem is escapable through education and increased awareness of how others are approaching production at scale, assuming you have good infrastructure.

Scale Up Without Scaling IT

Perhaps the most powerful and possibly surprising message in what we are recommending with regard to scalability is that you should be able to increase scale without increasing central IT resources. Furthermore, you should be able to do this without offloading infrastructure work to application teams. People may not be looking for the solutions that make this work because they may assume it’s not possible. It definitely is. In Chapter 2, we describe how we’ve seen it done.

Note

Scalability without scaling up IT team size is a sign of a well-designed system built on efficient data infrastructure.

In order to break down unnecessary compromises, accept the premise that you should have scalability for data, applications, teams, projects, and even locations without matching increases in IT team size. What, then, are the implications of not having to scale IT?

Decoupling scalability of data and applications from scaling IT helps break down the trade-off between reliability, scale, and affordability. It also opens the door to trying new projects by lowering both the entry cost and the risk. This next-project advantage not only opens the door to new lines of business or cost savings through automation via new applications. It also gives you a huge advantage by letting your organization pivot quickly to respond to new situations in a timely manner.

Achieving this ability to scale without scaling IT teams is a combination of design decisions, cooperation at the human level, and data infrastructure that handles many aspects of data management at the platform level rather than the application level. Not all data technologies that handle large-scale data storage (such as many distributed big-data systems and block storage) enable the platform-level actions that it takes to deburden IT teams. The diagram in Figure 1-3 shows a decision flow about the interaction of data infrastructure and its impact on IT. Notice which conditions produce the bottlenecks shown on the right-hand side of the diagram.

This flowchart shows how certain mistaken approaches can undermine scalability at both organizational and technical levels.

These choices have widespread impact, particularly because many organizations have a fixed or declining IT budget. We are describing choices you can make that let you do what needs to be done—including making new projects practical and cost-effective—even with a steady level of IT resources.

Lowering entry costs for new projects also depends on a system that encourages users to share resources, including sharing data. When you have a system where experiments or even full-fledged new projects can easily coexist with existing production systems, you have an advantage. New projects can effectively make use of the sunk costs involved in setting up the cluster in the first place. That helps you afford to experiment more because you don’t have to invest upfront to get these projects going. We call this the “next-project effect” since it is the second (and later) projects that have the benefit.

If your underlying data infrastructure is not adequately scalable and does not support open data access by a variety of tools and languages, it is difficult to share and get the next-project advantage. These technical limitations can lead to IT discouraging new projects, especially speculative ones, to avoid the risk of failures and the higher initial investment that would be required. That, in turn, limits your organization’s ability to stay competitive by taking advantage of innovation.

Note

A no-failure policy is a no-innovation policy.

It is better to have an overall infrastructure and design that allows new projects to run on the same system as existing projects.

Look for Faulty Assumptions

We started this chapter with the idea that your design choices and data infrastructure may be imposing limitations that force you into unnecessary compromises. This happens when you make faulty assumptions that are based on limitations that could be avoided. The resulting trade-offs are part of what makes building successful large-scale AI and analytics harder than it needs to be.

How can you tell if your system and architecture are scale-efficient? Or are they inefficient at scale and/or lack true scalability? Look out for these faulty assumptions:

AI and analytics should be run on separate systems (clusters).
The IT team has to scale as the scale of data and applications grows.
It is very expensive to do large-scale projects in production.
The best way for different teams or applications to use the same data is to make private copies, even for very large data sets.
Data motion, including between edge and on-premises core, or edge and cloud, must be programmed at the application level.
Legacy applications cannot run directly on modern big-data infrastructure.
Big-data platforms are for specialized projects instead of serving as a universally available general platform across your organization.
You need to know the final scale for data and applications ahead of time in order to plan and set up your architecture and infrastructure.
You need to re-architect your existing system if your data or number of applications increase or if you need to use new technologies.
Multitenancy is not practical in production and must be restricted.

If you see yourself in any of these assumptions, you may want to consider a new style of work that mirrors the successes other businesses are having with large-scale systems in production. To do this, you can reexamine your ideas about architecture, user interactions, and the data infrastructure that is the foundation of all you do.

The Shape of the Solution

The idea that a new style of work can be built on data infrastructure engineered to handle the challenges of scale and scalability is not just aspirational: we’ve observed this over the past several years, as we describe in more detail in Chapter 2. We will explain successful approaches we see being undertaken by businesses in a wide range of sectors so that you may see what approaches could make a difference for your business.

Get AI and Analytics at Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

AI and Analytics at Scale by Ted Dunning, Ellen Friedman

Chapter 1. Should AI and Analytics at Scale Be Difficult?

Note

Figure 1-1. The Chrysler Building at 42nd Street and Lexington Avenue rose up above the New York cityscape in May 1930 as a classic Art Deco–style skyscraper. It was briefly the tallest building in the city, and it remains one of the most beautiful. (Image by Vivienne Gucwa)

Note

Data Challenges at Scale

Effective Data Storage

Data, Kubernetes, and Containerized Applications

In-Production Data from the Start

Flexibility of Data Access

What Makes Production Break?

Security at Scale

What Does Scale Really Mean?

Scale in Terms of Data Size

Scale in Terms of the Number of Files or Other Objects

Scale Can Mean Many Applications or Many Teams

Scale in Terms of Geo-distributed Locations

Scalability Is as Important as Scale

Proliferation as a Symptom

Scale Up Without Scaling IT

Note

Figure 1-3. This flowchart shows how certain mistaken approaches can undermine scalability at both organizational and technical levels.

Note

Look for Faulty Assumptions

The Shape of the Solution

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly