Chapter 1. Is It Production-Ready?

The future is already here—it’s just not evenly distributed.

William Gibson

Big data has grown up. Many people are already harvesting huge value from large-scale data via data-intensive applications in production. If you’re not yet doing that or not doing it successfully, you’re missing out. This book aims to help you design and build production-ready systems that deliver value from large-scale data. We offer practical advice on how to do this based on what we’ve observed across a wide range of industries.

The first thing to keep in mind is that finding value isn’t just about collecting and storing a lot of data, although that is an essential part of it. Value comes from acting on that data, through data-intensive applications that connect to real business goals. And this means that you need to identify practical actions that can be taken in response to the insights revealed by these data-driven applications. A report by itself is not an action; instead, you need a way to connect the results to value-based business goals, whether internal or customer facing. For this to work in production, the entire pipeline—from data ingestion, through processing and analytic applications to action—must be doable in a predictable, dependable, and cost-effective way.

Note

Big data isn’t just big. It’s much more than just an increase in data volume. When used to full advantage, big data offers qualitative changes as well as quantitative. In aggregate, data often has more value than just the sum of the parts. You often can ask—and, if you’re lucky, answer—questions that could not have been addressed previously.

Value in big data can be based on building more efficient ways of doing core business processes, or it might be found through new lines of business. Either way, it can involve working not only at new levels of scale in terms of data volume but also at new speeds. The world is changing: data-intensive applications and the business goals they address need to match the new microcycles that modern businesses often require. It’s no longer just a matter of generating reports at yearly, quarterly, monthly, weekly, or even daily cycles. Modern businesses move at a new rhythm, often needing to respond to events in seconds or even subseconds. When decisions are needed at very low latency, especially at large scale, they usually require automation. This is a common goal of modern systems: to build applications that automate essential processes.

Another change in modern enterprises has to do with the way applications are designed, developed, and deployed: for your organization to take full advantage of innovative new approaches, you need to work on a foundation and in a style that can allow applications to be developed over a number of iterations.

These are just a few examples of the new issues that modern businesses working with large-scale systems face. We’re going to delve into the goals and challenges of big data in production and how you can get the most out of the applications and systems you build, but first, we want to make one thing clear: the possibilities are enormous and well worth pursuing, as depicted in Figure 1-1. Don’t fall for doom-and-gloom blogs that claim big data has failed because some early technologies for big data have not performed well in production. If you do, you’ll miss out on some great opportunities. The business of getting value from large-scale data is alive and well and growing rapidly. You just have to know how to do it right.

Successful production projects harvest the potential value in large-scale data in businesses as diverse as financial services, agri-tech, and transportation. Data-intensive applications are specific for each industry, but there are many similarities in basic goals and challenges for production.
Figure 1-1. Successful production projects harvest the potential value in large-scale data in businesses as diverse as financial services, agri-tech, and transportation. Data-intensive applications are specific for each industry, but there are many similarities in basic goals and challenges for production.

Production brings its own challenges as compared to work in development or experimentation. These challenges are, for some, seemingly a barrier, but they don’t need to be. The first step is to clearly recognize the challenges and pitfalls that you might encounter as you move into production so that you can have a clear and well-considered plan in advance on how to avoid or address them. In this chapter, we talk not only about the goals of production, but also the challenges and offer some hints about how you can recognize a system that is in fact ready for production.

There is no magic formula for success in production, however. Success requires making good choices about how to design a production-capable architecture, how to handle data and build effective applications, and what the right technologies and organizational culture are to fit your particular business. There are several themes that stand out among those organizations that are doing this successfully: what “production” really is, why multitenancy matters, the importance and power of simplicity and the value of flexibility. It’s not a detailed or exhaustive list, but we think these ideas make a big difference as you go to production, so we touch on them in this chapter and then dig deeper throughout the rest of the book as to how you can best address them.

Let’s begin by taking a look at what production is. We have a bit of a different view than what is traditionally meant by production. This new view can help you be better prepared as you tackle the challenge of production in large-scale, modern systems.

What Does Production Really Mean?

What do we mean by “in production”? The first thing that you might have in mind is to assume that production systems are applications that are customer facing. Although that is often true, it’s not the only important characteristic. For one thing, there are internal systems that are mainstream processes and are critical to business success. The fact that business deliverables depend on such systems makes them be in production, as well.

There’s a better way to think about what production really means. If a process truly matters to your business, consider it as being in production and plan for it accordingly. We take that a step further: being in production means making promises you must keep. These promises are about connecting to real business value and meeting goals in a reasonable time frame. They also have to do with collecting and providing access to the right data, being able to survive a disaster, and more.

Note

“In production” means making and keeping value-oriented promises. These promises are made and kept because they are about the stuff that matters to somebody.

The key is to correctly identify what really matters, to document (formalize) the promises you are making to address these issues, and to have a way to monitor whether the promises are met. This somewhat different view of production—the making and keeping of promises for processes essential to your business—helps to ensure that you take into account all aspects of what matters for production to be successful across a complete pipeline rather than focusing on just one step. This view also helps you to future-proof your systems so that you can take advantage of new opportunities in a practical, timely, and cost-effective way. We have more to say about that later, but first, think about the idea that “in production” is about much more than just the deployment of applications.

Data and Production

The idea of what is meant by in production also should extend to data. With data-driven business, keep in mind that data is different than code: Data is, importantly, in production sooner. In fact, you might say that data has a longer memory than code. Developers work through multiple iterations of code as an application evolves, but data can have a role in production from the time it is ingested and for decades after, and so it must be treated with the same care as any production system.

There are several scenarios that can cause data to need to be considered in production earlier than traditionally thought, and, of course, that will depend on the particular situation. For instance, it’s an unfortunate fact that messing up your data can cause you problems much longer than messing up your code ever could. The problem, of course, comes from the fact that you can fix code and deploy a new version. Problem sorted. But if you mess up archival data, you often can’t fix the problem at all. If you build a broken model, version control will give you the code you used, but what about the data? Or, what about when you use an archive of nonproduction data to build that model? Is that nonproduction data suddenly promoted retrospectively? In fact, data often winds up effectively in production long before your code is ready, and it can wind up in production without you even knowing at the time.

Another example is the need for compliance. Increasingly, businesses are being held responsible to be able to document what was known and when it was known for key processes and decisions, whether manual or automated. With new regulations, the situations that require this sort of promise regarding data are expanding.

Newly ingested, or so-called “raw,” data also surprisingly might need to be treated as production-grade even if data at all known subsequent steps in processing and Extract, Transform, and Load (ETL) for a particular application do not need to be. Here’s why. Newly developed applications might come to need particular features of the raw data that were discarded by the original application. To be prepared for that possibility, you would need to preserve raw data reliably as a valuable asset for future production systems even though much of the data currently seems useless.

We don’t mean to imply that all data at all stages of a workflow be treated as production grade. But one way to recognize whether you’re building production-ready systems is to have a proactive approach to planning data integrity across multiple applications and lines of business. This kind of commonality in planning is a strength in preparing for production. The issue of how to deal with when to consider data as “in production” and how to treat it is difficult but important. Another useful approach is to securely archive raw data or partially raw data and treat that storage process as a production process even if downstream use is not (yet) for production. Then, document the boundary. We provide some suggestions in Chapter 6 that should help.

Do You Have the Right Data and Right Question?

The goal of producing real value through analytics often comes down to asking the right question. But which questions you can actually ask may be severely limited by how much data you keep and how you can analyze it. Inherently, you have more degrees of freedom in terms of which questions and what analyses are possible if you retain the original data in a form closer to how events in the real world happened.

Let’s take a simplified example as an illustration of this. Assume for the moment that we have sent out three emails and want to determine which is the most effective at getting the response that we want. This example isn’t really limited to emails, of course, but could be all kinds of problems involving customer response or even physical effects like how a manufacturing process responds to various changes.

Which email is the best performer? Look at the dashboard showing the number of responses per hour in Figure 1-2. You can see that it makes option C appear to be the best by far. Moreover, if all we have is the number of responses in the most recent hour, this is the only question we can ask and the only answer we can get. But it is really misleading. It is only telling us which email performs best at tnow and that’s not what we want to know.

A dashboard shows current click rates. In this graph, option C seems to be doing better than either A or B. Such a dashboard can be dangerously misleading because it shows no history.
Figure 1-2. A dashboard shows current click rates. In this graph, option C seems to be doing better than either A or B. Such a dashboard can be dangerously misleading because it shows no history.

There is a lot more to the story. Plotting the response rate against time gives us a very different view, as shown in the top graph in Figure 1-3. Now we see that each email was sent at different times, which means that the instantaneous response rate at tnow is mostly just a measure of which email was most recently sent. Accumulating total responses instead of instantaneous response rate doesn’t fix things, because that just gives a big advantage to the email that was sent first instead of most recently.

In contrast to comparing instantaneous rates as in the upper panel of Figure 1-3, by aligning these response curves according to their launch times we get a much better picture of what is happening, as shown in the lower panel. Doing this requires that we retain a history of click rates as well as record the events corresponding to each email’s launch.

Raw click data is graphed in the upper graph. Three email options (A, B and C) were launched at different times, which makes comparing their short-term click rate at tnow very misleading. In contrast, the lower graph shows responses aligned at their launch times. Here the response is compared at a fixed time after launch. With this data, it’s clear that option B (green) is actually the best performer.
Figure 1-3. Raw click data is graphed in the upper graph. Three email options (A, B and C) were launched at different times, which makes comparing their short-term click rate at tnow very misleading. In contrast, the lower graph shows responses aligned at their launch times. Here the response is compared at a fixed time after launch. With this data, it’s clear that option B (green) is actually the best performer.

But what if we want to do some kind of analysis that depends on which time zone the recipient was in? Our aggregates are unlikely to make this distinction. At some point, the only viable approach is to record all the information we have about each response to every email as a separate business event. Recording just the count of events that fit into particular preknown categories (like A, B, or C) takes up a lot less space but vastly inhibits what we can understand about what is actually happening.

What technology we use to record these events is not nearly as important as the simple fact that we do record them (we have suggestions on how to do this in Chapter 5). Getting this wrong by summarizing event data too soon and too much has led some people to conclude that big data technologies are of no use to them. Often, however, this conclusion is based on using these new technologies to do the same analysis on the same summarized data as they had always done and getting results that are no different. But the alternative approach of recording masses of detailed events inevitably results in a lot more data. That is, often, big data.

Does Your System Fit Your Business?

Production promises built into business goals define the Service Level Agreements (SLAs) for data-intensive applications. Among the most common criteria to be met are speed, scale, reliability, and sustainability.

The need for speed

There often is time-value to large-scale data. Examples occur across many industries. Real-time traffic and navigation insights are more valuable when the commuter is en route to their destination rather than hearing about a traffic jam that occurred yesterday. Data for market reports, predictive analytics, utilities or telecommunications usage levels, or recommendations for an ecommerce site all have a time-based value. You build low latency data-intensive applications because your business needs to know what’s happening in the real world fast enough to be able to respond.

That said, it’s not always true that faster is better. Just making an application or model run faster might not have any real advantage if the timing of that process is already faster than reasonable requirements. Make the design fit the business goal; otherwise, you’re wasting effort and possibly resources. The thing that motivates the need for speed (your SLA) is getting value from data, not bragging rights.

Note

Does it fit? Fit your design and technology to the needs particular to specific business goals, anticipating what will be required for production and planning accordingly. This is an overarching lesson, not just about speed. Each situation defines its own requirements. A key to success is to recognize those requirements and address them appropriately.

In other words, don’t pick a solution before you understand the problem.

Scale Is More Than Just Data Volume

Much of the value of big data lies in its scale. But scale—in terms of processing and storage of very large data volumes of many terabytes or petabytes—can be challenging for production, especially if your systems and processes have been tested at only modest scale. In addition, do you look beyond your current data volume requirements to be ready to scale up when needed? This change can sometimes need to happen quickly depending on your business model and timeline, and of course it should be doable in a cost-effective way and without unwanted disruption. A key characteristic of organizations that deploy into production successfully is being able to handle large volume and velocity of data for known projects but also being prepared for growth without having to completely rebuild their system.

A different twist on the challenge of scale isn’t just about data volume. It can also be about the number of files you need to handle, especially if the files are small. This might sound like a simple challenge but it can be a show-stopper. We know of a financial institution that needed to track all incoming and outgoing texts, chats, and emails for compliance reasons. This was a production-grade promise that absolutely had to be kept. In planning for this critical goal, these customers realized that they would need to store and be able to retrieve billions of small files and large files and run a complex set of applications including legacy code. From their previous experience with a Hadoop Distributed File System (HDFS)–based Apache Hadoop system, the company knew that this would likely be very difficult to do using Hadoop and would require complicated workarounds and dozens of name nodes to meet stringent requirements for long-term data safety. They also knew that the size would make conventional storage systems implausibly expensive. They avoided the problem in this particular situation by building and deploying the project on technology designed to handle large numbers of small as well as large files and to have legacy applications directly access the files. (We discuss that technology, a modern big data platform, in Chapter 4). The point is, this financial company was successful in keeping its promises because potential problems were recognized in advance and planned for accordingly. These customers made certain that their SLAs fit their critical business needs and, clearly understanding the problem, found a solution to fit the needs.

Additional issues to consider in production planning are the range of applications that you’ll want to run and how you can do this reliably and without resulting in cluster sprawl or a nightmare of administration. We touch on these challenges in the sections on multitenancy and simplicity in this chapter as well as with the solutions introduced in Chapter 2.

Reliability Is a Must

Reliability is important even during development stages of a project in order to make efficient use of developer time and resources, but obviously pressures change as work goes into production. This change is especially true for reliability. One way to think of the difference between a production-ready project and one that is not ready is to compare the behavior of a professional musician to an amateur. The amateur musician practices a song until they can play it through without a mistake. In contrast, the professional musician practices until they cannot play it wrong. It’s the same with data and software. Development is the process of getting software to work. Production is the process of setting up a system so that it (almost) never fails.

Issues of reliability for Hadoop-based systems built on HDFS might have left some people thinking that big data systems are not suitable for serious production deployments, especially for mission-critical processes, but this should not be generalized to all big data systems. That’s a key point to keep in mind: big data does not equal Hadoop. Reliability is not the only issue that separates these systems, but it is an important one. Well-designed big data systems can be relied on with extreme confidence. Here’s an example for which reliability and extreme availability are absolutely required.

Aadhaar: reliability brings success to an essential big data system

An example of when it matters to get things right is an impressive project in which data has been used to change society in India. The project is the Aadhaar project run by the Unique Identification Authority of India (UIDAI). The basic idea of the project is to provide a unique, randomly chosen 12-digit government-issued identification number to every resident of India and to provide a biometric data base so that anybody with an Aadhaar number can prove their identity. The biometric data includes an iris scan of both eyes plus the fingerprint of all ten fingers, as suggested by the illustration in Figure 1-4. This record-scale biometric system requires reliability, low latency, and complete availability 24/7 from anywhere in India.

UIDAI runs the Aadhaar project whose goal is to provide a unique 12-digit identification number plus biometric data for authentication to every one of the roughly 1.2 billion people in India. (Figure based on image by Christian Als/Panos Pictures)
Figure 1-4. UIDAI runs the Aadhaar project whose goal is to provide a unique 12-digit identification number plus biometric data for authentication to every one of the roughly 1.2 billion people in India. (Figure based on image by Christian Als/Panos Pictures.)

Previously in India, most of the population lacked a passport or any other identification documents, and most documents that were available were easily forged. Without adequately verifiable identification, it was difficult or impossible for many citizens to set up a bank account or otherwise participate in a modern economy, and there was also a huge amount of so-called “leakage” of government aid that disappeared to apparent fraud. Aadhaar is helping to change that.

The Aadhaar data base can be used to authenticate identities for every citizen, even in rural villages where a wide range of mobile devices from cell phones to microscanners are used to authenticate identities when a transaction is requested. Aadhaar ID authentication is also used to verify qualification for government aid programs such as food deliveries for the poor or pension payments for the elderly. Implementation of this massive digital identification system has spurred economic growth and saved a huge amount of money by thwarting fraud.

From a technical point of view, what are the requirements for such an impressive big data project? For this project to be successful in production, reliability and availability are a must. Aadhaar must meet strict SLAs for availability of the authentication service every day, at any time, across India. The authentication process, which involves a profile look-up, supports thousands of concurrent transactions with end-to-end response times on the order of 100 milliseconds. The authentication system was originally designed to run on Apache Hadoop and Apache HBase, but the system was neither fast enough nor reliable enough, even with multiple redundant datacenters. Late in 2014, the authentication service was moved to a MapR platform to make use of MapR-DB, a NoSQL data base that supports the HBase API but avoids compaction delays. Since then, there has been no downtime, and India has reaped the benefits of this successful big data project in production.

Predictability and Repeatability

Predictability and repeatability also are key factors for business and for engineering. If you don’t have confidence in those qualities, it’s not a business; it’s a lottery—it’s not engineering; it’s a lucky accident.

These qualities are especially important in the relationship between test environments and production settings. Whether it’s a matter of scale, meeting latency requirements or running in a very specific environment, it’s important for test conditions to accurately reflect what will happen in production. You don’t want surprises. Just observing that an application worked in a test setting is not in itself sufficient to determine that it is production ready. You must examine the gap between test conditions and what you expect for real-world production settings and, as much as is feasible, have them match, or at least you should understand the implication of their differences. How do you get better predictability and repeatability? In Chapter 2, we explain several approaches that help with this, including running containerized applications and using Kubernetes as an orchestration layer. This is also one of the ways in which data should be considered in production from early stages because it’s important to preserve enough data to replay operations. We discuss that further in the design patterns presented in Chapter 5.

Security On-Premises, in Cloud, and Multicloud

Like reliability, data and system security are a must. You should address them from the start, not as an add-on afterthought when you are ready to deploy to production. People who are highly experienced with security know that it is a process of design, data handling, management, and good technology rather than a fancy tool you plug in and forget. Security should extend from on-premises deployments across multiple datacenters and to cloud and multicloud systems, as well.

Depending solely on perimeter security implemented in user-level software is not a viable approach for production systems unless it is part of a layered defense that extends all the way down to the data itself.

Risk Versus Potential: Pressures Change in Production

Pressures change as you move from development into production, partly because the goals are different and partly because the scale or SLAs change. Also, the people who handle requirements might not be the same in production as they were in development.

First, consider the difference in goals. In development, the goal is to maximize potential. You will likely explore a range of possibilities for a given business goal, experimenting to see what approach will provide the best performance for the predetermined goal. It’s smart to keep in mind the SLAs your application will need to meet in production, but in development your need to meet these promises right away is more relaxed. In development, you can better afford some risk because the impact of a failure is less serious.

The balance between potential and risk changes as you move into production. In production, the goal is to minimize risk, or at least to keep it to acceptable levels. The potentially broad goals that you had in development become narrower: you know what needs to be done, and it must be delivered in a predictable, reproducible, cost-effective, and reliable way, without requiring an army for effective administration. Possible sources of risk come from the pressure of scale and from speed; ironically these are two of the same characteristics that are often the basis for value.

Let’s look for a moment at the consequences of unreliable systems. As we stated earlier, outages in development systems can have consequences for lost time and lost morale when systems are down, but these pale in comparison to the consequences for unreliability in production because of the more immediate impact on critical business function. Reliability also applies to data safety, and, as we mentioned earlier, data can have in-production pressures much sooner than code does. Because the focus in production shifts to minimizing risk, it’s often good to consider the so-called “blast radius” or impact of a failure. The blast radius is generally much more limited for application failures than for an underlying platform failure, so the requirements for stability are higher for the platform.

Furthermore, the potential blast radius is larger for multitenant systems, but this can have a paradoxical effect on overall business risk. It might seem that the simple solution here is to avoid multitenancy to minimize blast radius, but that’s not the best answer. If you don’t make use of multitenancy, you are missing out on some substantial advantages of a modern big data system. The trick is to pick a very-high-reliability data platform to set up an overall multitenant design but to logically isolate systems at the application level, as we explain later in this chapter.

Should You Separate Development from Production?

With a well-designed system and the right data and analytics platform capabilities, it is possible to run development and production applications on the same cluster, but generally we feel it is better to keep these at least logically separated. That separation can be physical so that development and production run on separate clusters, and data is stored separately, as well, but it does not need to be so. Mainly, the impact of development and production applications should be separated. To do that requires that the system you use lets you exert this control with reasonable effort rather than inflicting a large burden for administration. In Chapters 2 and 4, we describe techniques that help with either physical separation or separation of impact.

There are additional implications when production data is stored on a separate cluster from development. As more and more processes depend on real data, it is becoming increasingly difficult to do serious development without access to production data. Again, there is more than one way to deal with this issue, but it is important to recognize in advance whether this is needed in your projects and to thus plan accordingly.

A different issue arises over data that comes from development applications. Development-grade processes should not produce production data. To do otherwise would introduce an obligation to live up to promises that you aren’t ready to keep. We have already stated that any essential data pipeline should be treated as being in production. This means that you should consider all of the data sources for that pipeline and all of its components as production status. Development-stage processes can still read production data, but any output produced as a result will not be production grade.

Your system should also make it possible for an entire data flow to be versioned and permission controlled, easily and efficiently. Surprisingly, even in a system with strong separation of development and production, you probably still need multitenancy. Here’s why.

Why Multitenancy Matters

Multitenancy refers to an assignment of resources such that multiple applications, users, and user groups and multiple datasets all share the same cluster. This approach requires the ability to strictly and securely insulate separate tenants as appropriate while still being able to allow shared access to data when desired. Multitenancy should be one of the core goals of a well-designed large data system because it helps support large-scale analytics and machine learning systems both in development and in production. Multitenancy is valuable in part because it makes these systems more cost effective. Sharing resources among applications, for instance, results in resource optimization, keeping CPUs busy and having fewer under-used disks. Well-designed and executed multitenancy offers better optimization for specialized hardware such as Graphics Processing Units (GPUs), as well. You could provide one GPU machine to each of 10 separate data scientists, but that gives each one only limited compute power. In contrast, with multitenancy you can give each data scientist shared access to a larger, more powerful, shared GPU cluster for bursts of heavy computation. This approach uses the same number or less of GPUs yet delivers much more effective resources for data-intensive applications.

There are also long-term reasons that multitenancy is a desirable goal. Properly done, multitenancy can substantially reduce administrative costs by allowing a single platform to be managed independent of how many applications are using it. In addition, multitenancy makes collaboration more effective while helping to keep overall architectures simple. A well-designed multitenant system is also better positioned to support development and deployment of your second (and third, and so on) big data project by taking advantage of sunk costs. That is, you can do all of this if your platform makes multitenancy safe and practical. Some large platforms don’t have robust controls over access or might not properly isolate resource-hungry applications from one another or from delay-sensitive applications. The ability to control data placement is also an important requirement of a data platform suitable for multitenancy.

Multitenancy also serves as a key strategy because many high-value applications are also the ones that pose the highest development risk. Taking advantage of the sunk costs of a platform intended for current production or development by using it for speculative projects allows high-risk/high-reward projects to proceed to a go/no-go decision without large upfront costs. That means experimentation with new ideas is easier because projects can fail fast and cheap. Multitenancy also allows much less data duplication, thus driving down amortized cost, which again allows more experimentation.

Putting lots of applications onto a much smaller single cluster instead of a number of larger clusters can pose an obvious risk, as well. That is, outage in a cluster that supports a large number of applications can be very serious because all of those applications are subject to failure if the platform fails. It will also be possible (no matter the system) for some applications to choke off access to critical resources unless you have suitable operational controls and platform-level controls. This means that you should not simply put lots of applications on a single cluster without considering the increased reliability required of a shared platform. We explain how to deal with this risk in Chapter 2.

If you are thinking it’s too risky or too complicated to use a truly multitenant system, look more closely at your design and the capabilities of your underlying platform and other tools: multitenancy is practical to achieve, and it’s definitely worth it, but it won’t happen by accident. We talk more about how to achieve it in later chapters.

Simplicity Is Golden

Multitenancy is just one aspect of an efficient and reliable production system. You need to be able maintain performance, data locality, manage computation and storage resources, and deploy into a predictable and controlled environment when new applications are launched—and you should be able to do all this without requiring an army of administrators. Otherwise, systems could be too expensive and too complicated to be sustainable in the long run.

We have experience with a very large retail company that maintains a mixed collection of hundreds of critical business processes running on a small number of production clusters with very effective multitenancy. In this case the entire set of clusters is managed by a very small team of administrators. They are able to share resources across multiple applications and can deploy experimental programs even in these essential systems. This retailer found its big data platform was so reliable and easy to manage that it didn’t even need a war room for this data platform during the Christmas lockdown months. These systems are returning large amounts of traceable incremental revenues with modest ongoing overhead costs. The savings have gone into developing new applications and exploring new opportunities.

The lesson here is that simplicity is a strength. Again, keep in mind that big data does not equal Hadoop. HDFS is a write once/read-only distributed file system that is difficult to access for legacy software or a variety of machine learning tools. These traits can make HDFS be a barrier to streamlined design. Having to copy data out of HDFS systems to do data processing or machine learning on it and then copy it back is an unnecessary complication. That is just one example of how your choices in design and technology make a difference to how easily you can make a system be production ready.

Note

Big data systems don’t need to be cumbersome. If you find your design has a lot of workarounds, that’s a flag to warn you that you might not have the best architecture and technology to support large-scale data-intensive applications in production.

For a system to be sustainable in production it must be cost effective, both in terms of infrastructure and administrative costs. A well-designed big data system does not take an army of administrators to maintain.

Flexibility: Are You Ready to Adapt?

The best success comes from systems that can expand to larger scale or broaden to include new data and new applications—even make use of new technologies—without having to be rebuilt from scratch. A data and analytics platform should have the capabilities to support a range of data types, storage structures and access APIs, legacy, and new code, in order to give you the flexibility needed for modern data-driven work. Even a system running very well in production today will need to be able to change in a facile manner in future because the world doesn’t stay the same—you need a system that delivers reliability now but makes it possible for you to adapt to changing conditions and new opportunities.

If you are working in a system design and with a platform that gives you the flexibility to add new applications or new data sources, you are in an excellent position to be able to capture targets of opportunity. These are opportunities to capture value to that can arise on a nearly spur-of-the-moment basis, either through insights that have come from data exploration, through experimentation with new design patterns, or simply because the right two people sat together at lunch and came up with a great idea. With a system that gives you flexibility, you can take advantage of these situations, maybe even leading to a new line of business.

Although flexible systems and architectures are critical, it is equally important that your organization has a flexible culture. You won’t get the full advantage of the big data tools and data-intensive approaches you’ve adopted if you are stuck in a rigid style of managing human resources. We recently asked Terry McCann, a consultant with Adatis, a company that helps people get their big data systems into production, what he thought was one of the most important issues for production. Somewhat surprisingly, McCann said the lack of a DevOps and DataOps approach is one of the biggest issues because that can make the execution of everything else so much more difficult. That observation is in line with a 2017 survey by New Vantage Partners that highlights the major challenge for success in big data projects as the difficulty of organizational and cultural change around big data.

Formula for Success

We have discussed goals important for big data systems if they are to be production ready. You will want to see that there is a clear connection to business value and a clearly defined way to act on the results of data-intensive applications. Data may need to be treated with production care as soon as it is ingested if later it could be required as a system of record or to serve as critical input for a production process, either now or in future. Reliability is an essential requirement for production as well as the ability to handle scale and speed as appropriate for your SLAs in production. Effective systems take advantage of multitenancy but are not cumbersome to maintain. They also should provide a good degree of flexibility, making it easy to adapt to changing conditions and to take advantage of new opportunities.

These are desirable goals, but how do you get there? We said earlier there is no magic formula for getting value from big data in production, and that is true. We should clarify, however: there is a formula for success—it’s just not magic.

That’s what we show you in Chapter 2: what you can do to deploy into production successfully.

Get AI and Analytics in Production now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.