Chapter 4. Decisions That Drive Successful Hadoop Projects

What are the decisions that drive successful Hadoop projects? The answer lies in part in decisions you make as you plan what project to tackle, the way you design your workflow, and how you’ve set up your cluster to begin with. Apache Hadoop is a powerful technology that has enormous potential, and the ways in which it can and will be used are still being discovered. Whether you are a seasoned Hadoop user or a newcomer to the technology, there are some key decisions and strategic approaches to using Hadoop for big data systems that can help ensure your own success with Hadoop and related tools, and we offer some suggestions here that may help with your choices.

The following list is not a comprehensive “how-to” guide, nor is it detailed documentation about Hadoop. Instead, it’s eclectic. We provide technical and strategic tips—some major and some relatively minor or specialized—that are based on what has helped other Hadoop users succeed. Some of these tips will be helpful before you start using Hadoop, and others are intended for more seasoned users, to guide choices as you put Hadoop to work in development and production settings.

Note that the first three tips are particularly important if you are new to Hadoop and NoSQL databases such as HBase or MapR-DB.

Tip #1: Pick One Thing to Do First

If you work with large volumes of data and need scalability and flexibility, you can use Hadoop in a wide variety of ways to reduce costs, increase revenues, advance your research, and keep you competitive. Adopting Hadoop as your big data platform is a big change from conventional computing, and if you want to be successful quickly, it helps to focus initially on one specific use for this new technology.

Don’t expect that from the start you can come up with all the different ways that you might eventually want to use Hadoop. Instead, examine your own needs (immediate or long-term goals), pick one need for which Hadoop offers a near-term advantage, and begin planning your initial project. As your team becomes familiar with Hadoop and with the ecosystem tools required for your specific goal, you’ll be well positioned to try other things as you see new ways in which Hadoop may be useful to you.

There’s no single starting point that’s best for everyone. In Chapter 5, we describe some common use cases that fit Hadoop well. Many of those use cases would make reasonable first projects. As you consider what to focus on first, whether it comes from our list or not, make a primary consideration that there is a good match between what you need done and what Hadoop does well. For your first project, don’t think about picking the right tool for the job—be a bit opportunistic and pick the right job for the tool.

By focusing on one specific goal to start with, the learning curve for Hadoop and other tools from the ecosystem can be a little less steep. For example, for your first Hadoop project, you might want to pick one with a fairly short development horizon. You can more quickly see whether or not your planning is correct, determine if your architectural flow is effective, and begin to gain familiarity with what Hadoop can do for you. This approach can also get you up and running quickly and let you develop the expertise needed to handle the later, larger, and likely more critical projects.

Many if not most of the successful and large-scale Hadoop users today started with a single highly focused project. That first project led in a natural way to the next project and the next one after that. There is a lot of truth in the saying that big data doesn’t cause Hadoop to be installed, but that instead installing Hadoop creates big data. As soon as there is a cluster available, you begin to see the possibilities of working with much larger (and new) datasets. It is amazing to find out how many people had big data projects in their hip pocket and how much value can be gained from bringing them to life.

Tip #2: Shift Your Thinking

Think in a different way so that you change the way that you design systems. This idea of changing how you think may be one of the most important bits of advice we can offer to someone moving from a traditional computing environment into the world of Hadoop. To make this transition of mind may sound trivial, but it actually matters a lot if you are to take full advantage of the potential that Hadoop offers. Here’s why.

What we mean by a shift in thinking is that methods and patterns that succeed for large-scale computing are very different from methods and patterns that work in more traditional environments, especially those that involve relational databases and data warehouses. A large-scale shift in thinking is required for the operations, analytics, and applications development teams. This change is what will let you build systems that make use of what Hadoop does well. It is undeniably very hard to change the assumptions that are deeply ingrained by years of experience working with traditional systems. The flexibility and innovation of Hadoop systems is a great advantage, but to be fully realized, they must be paired with your own willingness to think in new ways.

Here are a couple of specific examples of how to do this:

  • Learn to delay decisions. This advice likely feels counterintuitive. We’re not advocating procrastination in general—we don’t want to encourage bad habits—but it is important to shift your thinking away from the standard idea that you have to design and structure how you will format, transform, and analyze data from the start, before you ingest, store, or analyze any of it.

    This change in thinking is particularly hard to do if you’re used to using relational databases, where the application lifecycle of planning, specifying, designing, and implementing can be fairly important and strict. In traditional systems, just how you prepare data (i.e., do ETL) is critically important; you need to choose well before you load, because changing your mind late in the process with a traditional system can be disastrous. That means that with traditional systems such as relational databases, your early decisions really need to be fully and carefully thought through and locked down.

    With Hadoop, you don’t need to be locked into your first decisions. It’s not only unnecessary to narrow your options from the start, it’s also not advised. To do so limits too greatly the valuable insights you can unlock through various means of data exploration.

    It’s not that with Hadoop you should store data without any regard at all for how you plan to use it. Instead, the new idea here is that with Hadoop, the massively lower cost of large-scale data storage and the ability to use a wider variety of data formats means that you can load and use data in relatively raw form, including unstructured or semistructured. In fact, it can be useful to do so because it leaves you open to use it for a known project but also to decide later how else you may want to use the same data. This flexibility is particularly useful since you may use the data for a variety of different projects, some of which you’ve not yet conceived at the time of data ingestion. The big news is, you’re not stuck with your first decisions.

  • Save more data. If you come from a traditional data storage background, you’re probably used to automatically thinking in terms of extracting, transforming, summarizing, and then discarding the raw data. Even where you run analytics on all incoming data for a particular project, you likely do not save more than a few weeks or months of data because the costs of doing so would quickly become prohibitive.

    With Hadoop, that changes dramatically. You can benefit by shifting your thinking to consider saving much longer time spans of your data because data storage can be orders of magnitude less expensive than before. These longer histories can prove valuable to give you a finer-grained view of operations or for retrospective studies such as forensics. Predictive analytics on larger data samples tends to give you a more accurate result. You don’t always know what will be of importance in data at the time it is ingested, and the insights that can be gained from a later perspective will not be possible if the pertinent data has already been discarded.

    “Save more data” means saving data for longer time spans, from larger-scale systems, and also from new data sources. Saving data from more sources also opens the way to data exploration—experimental analysis of data alongside your mainstream needs that may unlock surprising new insights. This data exploration is also a reason for delaying decisions about how to process or downsample data when it is first collected.

    Saving data longer can even simplify the basic architecture of system components such as message-queuing systems. Traditional queuing systems worry about deleting messages as soon as the last consumer has acknowledged receipt, but new systems keep messages for a fixed and long time period. If messages that should be processed in seconds will actually persist for a week, the need for fancy acknowledgement mechanisms vanishes. Your architecture may have similar assumptions and similar opportunities.

Tip #3: Start Conservatively But Plan to Expand

A good guideline for your initial purchase of a Hadoop cluster is to start conservatively and then plan to expand at a later date. Don’t try to commit to finalizing your cluster size from the start—you’ll know more six months down the line about how you want to use Hadoop and therefore what size cluster makes sense than you will when you begin for the first time. Some very rough planning can be helpful just to budget the overall costs of seeing your Hadoop project through to production, but you can make these estimates much better after a bit of experience. Remember, it is fairly easy to expand an initial Hadoop cluster, even by very large factors. To help with that, we provide some tips for successful cluster expansion further along in this list.

That said, make sure to provide yourself with a reasonably sized cluster for your initial development projects. You need to have sufficient computing power and storage capacity to make your first Hadoop project a success, so give it adequate resources. Remember that extra uses for your cluster will pop out of the woodwork almost as soon you get it running. When ideas for new uses arise, be sure to consider whether your initial cluster can handle them or whether it’s time to expand. Capacity planning is a key to success.

A common initial cluster configuration as of the writing of this book is 6–12 machines, each with 12–24 disks and 128–192 GB of RAM. If you need to slim this down initially, go for fewer nodes with good specs rather than having more nodes that give very poor performance. If you can swing it, go for 10Gb/s networking.

Tip #4: Be Honest with Yourself

Hadoop offers huge potential cost savings, especially as you scale out wider, because it uses commodity hardware. But it isn’t magic. If you set a bad foundation, Hadoop cannot make up for inadequate hardware and setup. If you try to run Hadoop on a couple of poor-quality machines with a few disks and shaky network connections, you won’t see very impressive results.

Be honest with yourself about the quality of your hardware and your network connections. Is the disk storage sufficient? Do you have a reasonable balance of cores to disk? Do you have a reasonable balance of CPU and disk capacity for the scale of data storage and analysis you plan to do? And perhaps most important of all, how good are your network connections?

A smoothly running Hadoop cluster will put serious pressure on the disks and network—it’s supposed to do so. Make sure each machine can communicate with each other machine at the full bandwidth for your network. Get good-quality switches and be certain that the system is connected properly.

In order to do this, plan time to test your hardware and network connections before you install Hadoop, even if you think that the systems are working fine. That helps you avoid problems or makes it easier to isolate the source of problems if they do arise. If you do not take these preparatory steps and a problem occurs, you won’t know if it is hardware or a Hadoop issue that’s at fault. Lots of people waste lots of time doing this. In fact, trying to build a high-performance cluster with misconfigured network or disk controllers or memory is common enough that we considered putting it into the chapter on use cases.

The good news is that we have some pointers to good resources for how to test machines for performance. See Appendix A at the end of this book for details.

Tip #5: Plan Ahead for Maintenance

This tip is especially aimed at larger-scale Hadoop users who have expanded to clusters with many machines. Lots of machines in a cluster give you a big boost in computing power, but it also means that in any calendar period, you should expect more maintenance. Many machines means many disks; it’s natural that in any few months, some number of them will fail (typical estimates are about 5–8% per year). This is a normal part of the world of large-scale computing (unless you rely solely on cloud computing). It’s not a problem, but to have a smoothly running operation, you should build in disk replacement as a regular part of your schedule.

Tip #6: Think Big: Don’t Underestimate What You Can (and Will) Want to Do

Hadoop provides some excellent ways to meet some familiar goals, and one of these may be the target of your initial projects when you begin using Hadoop and HBase. In other words, an initial reason to adopt Hadoop is that often it can provide a way to do what you already need to do but in a way that scales better at lower cost. This is a great way to start. It’s also likely that as you become familiar with how Hadoop works, you will notice other ways in which it can pay off for you. Some of the ways you choose to expand your use of a Hadoop data system may be new—that’s one of the strengths of Hadoop. It not only helps you do familiar things at lower cost, it also unlocks the door to new opportunities.

Stay open to these new possibilities as you go forward; the rewards can surprise you. We see this pattern with people already using Hadoop. A large majority of MapR customers double the size of their clusters in the first year due to increased scope of requirements. What happens is that it quickly becomes apparent that there so many targets of opportunity—unforeseen applications that solve important problems—that additional workload on the clusters quickly justifies additional hardware. It is likely that your experience will be like theirs; your first goals will be fairly straightforward and possibly familiar, but other opportunities will appear quickly. So start relatively small in cluster size, but whatever you do, try not to commit to an absolute upper bound on your final cluster size until you’ve had a chance to see how Hadoop best fits your needs.

Tip #7: Explore New Data Formats

Some of the most successful decisions we have seen involving Hadoop projects have been to make use of new data formats, including semi-structured or unstructured data. These formats may be unfamiliar to you if you’ve worked mainly with traditional databases. Some useful new formats such as Parquet or well-known workhorses like JSON allow nested data with very flexible structure. Parquet is a binary data form that allows efficient columnar access, and JSON allows the convenience of a human readable form of data, as displayed in Example 4-1.

Example 4-1. Hadoop enables you to take advantage of nested data formats such as JSON (shown here), thus expanding the number of data sources you can use to gain new insights. Social media sources and web-oriented APIs such as Twitter streams often use JSON. The new open source SQL-on-Hadoop query engine Apache Drill is able to query nested data structures in either JSON or Parquet. See Ford VIN Decoder for more information.
{
    "VIN":"3FAFW33407M000098",
    "manufacturer":"Ford",
    "model": {
      "base": "Ford F-Series, F-350",
      "options": [
        "Crew Cab", "4WD", "Dual Rear Wheels"
      ]
    },
    "engine":{
      "class": "V6,Essex",
      "displacement": "3.8 L",
      "misc": ["EFI","Gasoline","190hp"]
    },
    "year":2007
}

Nested data provides you with some interesting new options. Think of it as you would this book—the book is an analogy for nested data. It’s one thing, but it contains subsets of content at different levels, such as chapters, figure legends, and individual sentences. Nested data can be treated as a unit, but with the right access, the data at each internal layer can be used.

Nested data formats such as Parquet combine flexibility with performance. A key benefit of this flexibility is that it allows you to “future-proof” your applications. Old applications will silently ignore new data fields, and new applications can still read old data. Combined with a little bit of discipline, these methods lead to very flexible interfaces. This style of data structure migration was pioneered by Google and has proved very successful in a wide range of companies.

Besides future-proofing, nested data formats let you encapsulate structures. Just as with programming languages, encapsulation allows data to be more understandable and allows you to hide irrelevant details.

These new formats seem very strange at first if you come from a relational data background, but they quickly become second nature if you give them a try. One of your challenges for success is to encourage your teams to begin to consider unstructured and semistructured data among their options. What all this means from a business perspective is that access to semistructured, unstructured, and nested data formats greatly expand your chances to reap the benefits of analyzing social data, of combining insights from diverse sources, and reducing development time through more efficient workflows for some projects.

Tip #8: Consider Data Placement When You Expand a Cluster

Having just recommended (in Tip #3: Start Conservatively But Plan to Expand) that you start small and expand your cluster as you understand your workload, we should also suggest that some care is in order when you do expand a cluster, especially if you expand a heavily loaded cluster just by a small amount. This situation is a very common use case because it often happens that usage of a new cluster expands far more quickly than planned, and people may initially choose to upgrade by adding only a few additional nodes. The challenge can arise because new data will tend to be loaded onto the newly available and more empty new nodes unless you take measures to regulate what happens. Figure 4-1 shows what can happen if you start with five very full nodes and add two new ones.

An odd effect that is sometimes observed when adding a few nodes to a nearly full cluster. New data (shown here in red) can become unevenly distributed across the cluster. If new data is hotter than old data, this focus on new data on new nodes will make them hotter as well—it’s as though for a subset of operations, the cluster has the appearance of being smaller than before. In that case, adding just a few new machines to the cluster can inadvertently decrease current cluster throughput.
Figure 4-1. An odd effect that is sometimes observed when adding a few nodes to a nearly full cluster. New data (shown here in red) can become unevenly distributed across the cluster. If new data is hotter than old data, this focus on new data on new nodes will make them hotter as well—it’s as though for a subset of operations, the cluster has the appearance of being smaller than before. In that case, adding just a few new machines to the cluster can inadvertently decrease current cluster throughput.

Tip #9: Plot Your Expansion

The advice here is really simple: Don’t wait until your cluster is almost full before you decide to expand, particularly if you are not working in the cloud. Instead, plan ahead to include sufficient time for ordering new hardware and getting it delivered, set up, and pretested before you install your Hadoop software and begin to ingest data.

To do this capacity planning well, you will need to estimate not only how long it will take from decision to order to having up-and-running machines, you also will need to estimate growth rates for your system and set a target threshold (e.g., 70% of capacity used) as the alert to start the expansion process. You can measure and record machine loads, probably using a time series database, and plot your rate of decrease in available space. That lets you more easily incorporate an appropriate lead time when you order new hardware in order to avoid crunches.

Tip #10: Form a Queue to the Right, Please

When you are dealing with streaming data, it makes good sense to plan a queuing step as part of data ingestion to the Hadoop cluster or for the output of a realtime analytics application in your project architecture. If there is any interruption of the processing of streaming data due to application error, software upgrades, traffic problems, or cosmic rays, losing data is usually less acceptable than delayed processing. A queuing layer lets you go back and pick up where your process left off once the data processing can resume. Some data sources inherently incorporate replayable queues, but many do not. Especially for data sources that are non-local, queuing is just a very good idea.

As mentioned in our overview of the Hadoop ecosystem presented in Chapter 2, there are several useful tools for providing the safety of a queue for streaming data. Apache Kafka is particularly useful for this purpose.

Tip #11: Provide Reliable Primary Persistence When Using Search Tools

As powerful and useful as they are, search tools are not suitable as primary data stores, although it may be tempting to think of them that way when you are storing data in them. Successful projects are planned with built-in protections, and one aspect of that planning is to respect search technologies such as ElasticSearch for what they do well but not to assume that they provide a dependable primary data store. They were not designed to do so, and trying use them for that purpose can result in data loss or substantial performance loss.

The alternative is safe and easy: archive data in your Hadoop system as the primary store rather than relying on what is stored in the search tool. With the low cost of data storage in Hadoop systems, there isn’t a big penalty for persisting data for primary storage in Hadoop rather than relying on your search engine. In terms of protecting data, there are big benefits for doing so.

Tip #12: Establish Remote Clusters for Disaster Recovery

Your initial Hadoop project may not be business critical, but there is a good chance that a business-critical application will be on your cluster by at least the end of the first year of production. Even if the processing isn’t business critical, data retention may well be. Before that happens, you will need to think through how to deal with the unlikely but potentially devastating effects of a disaster. For small startups, the answer might be that it is better to roll the dice to avoid the distraction and cost of a second data center, but that is rarely an option with an established business.

If you have data of significant value and applications that run in critical-path operations for your organization, it makes sense to plan ahead for data protection and recovery in case of a disaster such as fire, flood, or network access failure in a data center. The remote mirroring capabilities of MapR make it possible to do this preparation for disaster recovery (DR) even for very large-scale systems in a reliable, fast, and convenient way using remote mirroring.

Chapter 3 explains more about how MapR mirroring works. Keep in mind that once remote mirroring is established, only the incremental differences in data files are transmitted, rather than complete new copies. Compression of this data differential for transfer to the remote cluster makes the process fast and keeps bandwidth costs down. There is no observable performance penalty and each completed mirror will atomically advance to a consistent state of the mirror source.

Mirrors like this are a MapR-specific feature. What are your options for DR in other Hadoop distributions? On HDFS-based systems, you can use the built-in distributed copy (distcp) utility to copy entire directory trees between clusters. This utility can make copies appear nearly atomically, but it doesn’t have any way to do a true incremental copy. When distcp is used in production, it is common for it to be combined with conventions so that files that are currently being modified are segregated into a separate directory tree. Once no more modifications are being made to this directory tree and a new live tree is created, the copy to the remote cluster can be initiated.

One thing to keep in mind is that a DR cluster is not entirely a waste of resources in the absence of a disaster. Such a backup cluster can be used as a development sandbox or staging environment. If you have frequent data synchronization and can protect the backup data from inadvertent modification, this can be a very fruitful strategy.

Tip #13: Take a Complete View of Performance

Performance is not only about who wins a sprint. Sometimes the race is a marathon. Or the goal is to throw a javelin. The key is to know which event you are competing in. The same is true with Hadoop clusters.

Too often, an organization assesses which tool they want to adopt just by comparing how fast each one completes running a single specific job or query. Speed is important, and benchmark speeds like this can be informative, but the speed on a small set of queries is only a small part of the picture in terms of what may matter most to success in your particular project. Another consideration is long-term throughput: which tool has stayed up and running and therefore supported the most work two weeks or two months later?

Performance quality should also be judged relative to the needs of the particular project. As mentioned briefly in Chapter 2, what’s most commonly important as a figure of merit in streaming and realtime analytics is latency (the time from arrival of a record to completion of the processing for that record). In contrast, for interactive processing, the data remains fixed and the time of interest is the time that elapses between presenting a query and getting a result—in other words, the response time.

In each of these situations, a measure of performance can be helpful in picking the right tool and the right workflow for the job, but you must be careful that you’re measuring the right form of performance.

Once you have picked key performance indicators for the applications you are running, it is just as important to actually measure these indicators continuously and record the results. Having a history of how a particular job has run over time is an excellent diagnostic for determining if there are issues with the cluster, possibly due to hardware problems, overloading from rogue processes, or other issues.

Another very useful trick is to define special “canary” jobs that have constant inputs and that run the same way each time. Since their inputs are constant and since they are run with the same resources each time they are run, their performance should be comparable each time they are run. If their performance changes, something may have happened to the cluster. With streaming environments, such tests are usually conducted by putting special records known as “tracer bullets” into the system. The processing of tracers triggers additional logging and diagnostics, and the results are used very much like the performance of canary jobs.

Tip #14: Read Our Other Books (Really!)

We’ve written several short books published by O’Reilly that provide pointers to handy ways to build Hadoop applications for practical machine learning, such as how to do more effective anomaly detection (Practical Machine Learning: A New Look at Anomaly Detection), how to build a simple but very powerful recommendation engine (Practical Machine Learning: Innovations in Recommendation), and how to build high-performance time series databases (Time Series Databases: New Ways to Store and Access Data). Each of these short books takes on a single use case and elaborates on the most important aspects of that use case in an approachable way. In our current book, we are doing the opposite, treating many use cases at a considerably lower level of detail. Both approaches are useful.

So check those other books out—you may find lots of good tips that fit your project.

Tip # 15: Just Do It

Hadoop offers practical, cost-effective benefits right now, but it’s also about innovation and preparing for the future. One decision with large potential payoffs is to get up to speed now with new technologies before the need for them is absolutely critical. This gives you time to gain familiarity, to fit the way a tool is used to your own specific needs, and to build your organization’s foundation of expertise with new approaches before you are pressed against an extreme deadline.

The only way to gain this familiarity is to pick a project and just do it!

Get Real-World Hadoop now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.