Chapter 1. Cloud Migration: From Data Center to Hadoop in the Cloud

How do you move a large portfolio of more than 400 batch analytical programs from a proprietary database appliance architecture to the Hadoop ecosystem in the cloud?

During a session at Strata + Hadoop World New York 2015, Jaipaul Agonus, the technology director in the market regulation department of FINRA (Financial Industry Regulatory Authority) described this real-world case study of how one organization used Hive, Amazon Elastic MapReduce (Amazon EMR) and Amazon Simple Storage Service (S3) to move a surveillance application to the cloud. This application consists of hundreds of thousands of lines of code and processes 30 billion or more transactions every day.

FINRA is often called “Wall Street’s watch dogs.” It is an independent, not-for-profit organization authorized by Congress to protect United States investors by ensuring that the securities industry operates fairly and honestly through effective and efficient regulation. FINRA’s goal is to maintain the integrity of the market by governing the activities of every broker doing business in the US. That’s more than 3,940 securities firms with approximately 641,000 brokers.

How does it do it? It runs surveillance algorithms on approximately 75 billion transactions daily to identify violation activities such as market manipulation, compliance breaches, and insider trading. In 2015, FINRA expelled 31 firms, suspended 736 brokers, barred 496 brokers, fined firms more than $95 million, and ordered $96 million in restitution to harmed investors.

The Balancing Act of FINRA’s Legacy Architecture

Before Hadoop, Massively Parallel Processing (MPP) methodologies were used to solve big data problems. As a result, FINRA’s legacy applications, which were first created in 2007, relied heavily on MPP appliances.

MPP tackles big data by partitioning the data across multiple nodes. Each node has its own local memory and processor, and the distributed nodes are handled by a sophisticated centralized SQL engine, which is essentially the brain of the appliance.

According to Agonus, FINRA’s architects originally tried to design a system in which they could find a balance between cost, performance, and flexibility. As such, it used two main MPP appliance vendors. “The first appliance was rather expensive because it had specialized hardware due to their SQL engines; the second appliance, a little less expensive because they had commodity hardware in the mix,” he said.

FINRA kept a year’s worth of data in the first appliance, including analytics that relied on a limited dataset and channel, and a year’s worth of data in the second appliance—data that can run for a longer period of time and that needs a longer date range. After a year, this data was eventually stored offline.

Legacy Architecture Pain Points:
Silos, High Costs, Lack of Elasticity

Due to FINRA’s tiered storage design, data was physically distributed across appliances, including MPP appliances, Network-Attached Storage (NAS), and tapes; therefore, there was no one place in its system where it could run all its analytics across the data. This affected accessibility and efficiency. For example, to rerun old data, FINRA had to do the following:

  • To rerun data that was more than a month old, it had to rewire analytics to be run against appliance number two.

  • To rerun data that was more than a year old, it had to call up tapes from the offline storage, clear up space in the appliances for the data, restore it, and revalidate it.

The legacy hardware was expensive and was highly tuned for CPU, storage, and network performance. Additionally, it required costly proprietary software, forcing FINRA to spend millions annually, which indirectly resulted in a vendor lock-in.

Because FINRA was bound by the hardware in the appliances, scaling was difficult. To gauge storage requirements, it essentially needed to predict the future growth of data in the financial markets. “If we don’t plan well, we could either end up buying more or less capacity than we need, both causing us problems,” said Agonus.

The Hadoop Ecosystem in the Cloud

Many factors were driving FINRA to migrate to the cloud—the difficulty of analyzing siloed data, the high cost of hardware appliances and proprietary software, and the lack of elasticity. When FINRA’s leaders started investigating Hadoop, they quickly realized that many of their pain points could be resolved. Here’s what they did and how they did it.

FINRA’s cloud-based Hadoop ecosystem is made up of the following three tools:

Hive

This is the de facto standard for SQL-on-Hadoop. It’s a component of Hortonworks Data Platform (HDP) and provides a SQL-like interface.

Amazon EMR

This is a managed Hadoop framework that brings elasticity to Hadoop clusters in the cloud.

Amazon S3

This is Amazon’s storage service with practically infinite storage (up to five terabytes).

SQL and Hive

FINRA couldn’t abandon SQL because it already had invested heavily in SQL-based applications running on MPP appliances. It had hundreds of thousands of lines of legacy SQL code that had been developed and iterated over the years. And it had a workforce with strong SQL skills. “Giving up on SQL would also mean that we are missing out on all the talent that we’ve attracted and strengthened over the years,” said Agonus.

As for Hive, users have multiple execution engines:

MapReduce

This is a mature and reliable batch-processing platform that scales well for terabytes of data. It does not perform well enough for small data or iterative calculations with long data pipelines.

Tez

Tez aims to balance performance and throughput by streaming the data from one process to another without actually using HDFS. It translates complex SQL statements into optimized, purpose-built data processing graphs.

Spark

This takes advantage of fast in-memory computing by fitting all intermediate data into memory and spilling back to disk only when necessary.

Amazon EMR

Elastic MapReduce makes easy work of deploying and managing Hadoop clusters in the cloud. “It basically reduces the complexity of the time-consuming set up, management, and tuning of the Hadoop clusters and provides you a resizable cluster of Amazon’s EC2 instances,” said Agonus. EC2 instances are essentially virtual Linux servers that offer different combinations of CPU, memory, storage, and networking capacity in various pricing models.

Amazon S3

S3 is a cost-effective solution that handles storage. Because one of FINRA’s architecture goals was to separate the storage and compute resources so that it could scale them independently, S3 met its requirements. “And since you have the source data set available in S3, you can run multiple clusters against that same data set without overloading your HDFS nodes,” said Agonus.

All input and output data now resides in S3, which acts like HDFS. The cluster is accessible only for the duration of the job. S3 also fits Hadoop’s file system requirements and works well as a storage layer for EMR.

Capabilities of a Cloud-Based Architecture

With the right architecture in place, FINRA found it had new capabilities that allowed it to operate in an isolated virtual network (VPC, or virtual private cloud). “Every surveillance has a profile associated with it that lets these services know about the instance type and the size needed for the job to complete,” said Agonus.

The new architecture also made it possible for FINRA to store intermediate datasets; that is, the data produced and transferred between the two stages of a MapReduce computation—map and reduce. The cluster brings in the required data from S3 through Hive’s external tables and then sends it to the local HDFS for further storing and processing. When the processing is complete, the output data is written back to S3.

Lessons Learned and Best Practices

What worked? What didn’t? And where should you focus your efforts if you are planning on migrating to the cloud? According to Agonus, your primary objective in designing your Hive analytics would be to focus on direct data access and maximizing your resource utilization. Following are some key lessons learned from the FINRA team.

Secure the financial data

The audience asked how FINRA secured the financial data. “That’s the very first step that we took,” said Agonus. FINRA has an app security group that performed a full analysis on the cloud, which was a combined effort with the cloud vendor. They also used encryption on their datacenter. This is part of Amazon’s core, he explained. “Everything that is encrypted stays encrypted,” he said. “Amazon’s security infrastructure is far more extensive than anything we could build in-house.”

Conserve resources by processing necessary data

Because Hive analytics enable direct data access, you need only partition the data you require. FINRA partitions its trade dataset based on a trade date. It then process only the data that it needs. As a result, it don’t waste resources trying to scan millions upon millions of rows.

Prep enhances join performance

According to Agonus, bucketing and sorting data ahead of time enhances join performance and reduces the I/O scan significantly. “Joins also work much faster because the buckets are aligned against each other and a merge sort is applied on them,” he said.

Tune the cluster to maximize resource utilization

Agonus emphasized the ease of making adjustments to your configurations in the cloud. Tuning the Hive configurations in your cluster lets you maximize resource utilization. Because Hive consumes data in chunks, he says, “You can adjust minimum/maximum splits to increase or decrease the number of mappers or reducers to take full advantage of all the containers available in your cluster.” Furthermore, he suggests, you can measure and profile your clusters from the beginning and adjust them continuously as your data size changes or the execution framework changes.

Achieve flexibility with Hive UDFs when SQL falls short

Agonus stressed that SQL was a perfect fit for FINRA’s application; however, during the migration process, FINRA found two shortcomings with Hive, which it overcame by using Hive user defined functions (UDFs).

The first shortcoming involved Hive SQL functionality compared to other SQL appliances. For example, he said, “The Windows functions in Netezza allow you to ignore nulls during the implementation of PostgreSQL, but Hive does not.” To get around that, FINRA wrote a Java UDF that can do the same thing.

Similarly, it discovered that Hive did not have the date formatting functions it was used to in Oracle and other appliances. “So we wrote multiple Java UDFs that can convert formats in the way we like,” said Agonus. He also reported that Hive 1.2 supports date conversion functions well.

The second shortcoming involved procedural tasks. For example, he said, “If you need to de-dupe a dataset by identifying completely unique pairs based on the time sequence in which you receive them, SQL does not offer a straightforward way to solve that.” However, he suggested writing a Java or Python UDF to resolve that outside of SQL and bring it back into SQL.

Choose an optimized storage format and compression type

A key component of operating efficiently is data compression. According to Agonus, the primary benefit of compressing data is the space you save on the disk; however, in terms of compression algorithms, there is a bit of a balancing act between compression ratio and compression performance. Therefore, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. The abundance of options, though, can make it difficult for users to select the right ones for their MapReduce jobs.

Some are designed to be very fast, but might not offer other features that you need. “For example,” says Agonus, “Snappy is one of the fastest available, but it doesn’t offer much in space savings comparatively.” Others offer great space savings, but they’re not as fast and might not allow Hadoop to split and distribute the workload. According to Agonus, Gzip compression offers the most space saving, but it is also among the slowest and is not splittable. Agonus advises choosing a type that best fits your use case.

Run migrated processes for comparison

One of the main mitigation strategies FINRA used during the migration was to conduct an apples-to-apples comparison of migrated processes with its legacy output. “We would run our migrated process for an extensive period of time, sometimes for six whole months, and compare that to the output in legacy data that were produced for the same date range,” said Agonus. “This proved very effective in identifying issues instantly.” FINRA also partnered with Hadoop and cloud vendors who could look at any core issues and provide it with an immediate patch.

Benefits Reaped

With FINRA’s new cloud-based architecture, it no longer had to project market growth or spend money upfront on heavy appliances based on projections. Nor did it need to invest in a powerful appliance to be shared across all processes. Additionally, FINRA’s more dynamic infrastructure allowed it to improve efficiencies, running both faster and more easily. Due to the ease of making configuration changes, it was also able to utilize its resources according to its needs.

FINRA was also able to mine data and do machine learning on data in a far more enhanced manner. It was also able to decrease its emphasis on software procurement and license management because the cloud vendor performs much of the heavy lifting in those areas.

Scalability also improved dramatically. “If it’s a market-heavy day, we can decide that very morning that we need bigger clusters and apply that change quickly without any core deployments,” said Agonus. For example, one process consumes up to five terabytes of data, whereas others can run on three to six months worth of data. Lastly, FINRA can now reprocess data immediately without the need to summon tapes, restore them, revalidate them, and rerun them.

Get Data Infrastructure for Next-Gen Finance now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.