The Apache Hadoop ecosystem is very data-centric, making it a natural fit for database developers with SQL experience. Much application development work for Hadoop consists of writing programs to copy, convert or reorganize, and analyze data files. A lot of effort goes into finding ways to do these things reliably, on a large scale, and in parallel across clusters of networked machines. Impala focuses on making these activities fast and easy, without requiring you to have a PhD in distributed computing, learn a lot of new APIs, or write a complete program when your intent can be conveyed with a single SQL statement.
The Cloudera Impala project arrives in the Big Data world at just the right moment. Data volume is growing fast, outstripping what can be realistically stored or processed on a single server. The Hadoop software stack is opening that field up to a larger audience of users and developers.
Impala brings a high degree of flexibility to the familiar database ETL process. You can query data that you already have in various standard Hadoop file formats (see File Formats). You can access the same data with a combination of Impala and other Hadoop components such as Apache Hive, Apache Pig, and Cloudera Search without duplicating or converting the data. When query speed is critical, the Parquet columnar file format makes it simple to reorganize data for maximum performance of data warehouse-style queries.
Traditionally, Big Data processing has resembled batch jobs from the mainframe era where unexpected or tough questions required running jobs overnight or all weekend. The goal of Impala is to express even complicated queries directly with familiar SQL syntax, running fast enough that you can get an answer to an unexpected question in seconds or at most a few minutes. We refer to this human-scale type of responsiveness as “interactive.”
For users and business intelligence tools that speak SQL, Impala brings a more effective development model than writing a new Java program to handle each new kind of analysis. Although the SQL language has a long history in the computer industry, with the combination of Big Data and Impala, it is once again cool.
Now you can write sophisticated analysis queries using natural expressive notation, the same way Perl mongers do with text-processing scripts. You can interactively traverse large data sets and data structures, like a Pythonista inside the Python shell. You can avoid memorizing verbose specialized APIs; SQL is like a RISC instruction set that focuses on a standard set of powerful commands. When you do need access to API libraries for capabilities such as visualization and graphing, you can access Impala data from programs written in languages such as C++, Java, and Python through the standard JDBC and ODBC protocols.
You can also take advantage of business tools that use SQL behind the scenes but don’t require you to code SQL directly. For example, you can use traditional business intelligence tools such as IBM Cognos, SAP Business Objects, and MicroStrategy, as well as the new generation of data discovery tools such as Tableau.
Impala integrates with existing Hadoop components, security, metadata, storage management, and file formats. You keep the flexibility you already have with these Hadoop strong points and add capabilities that make SQL queries much easier and faster than before.
With SQL, you can turn complicated analysis programs into simple, straightforward queries. To help answer questions and solve problems, you can enlist a wide audience of analysts who already know SQL or the standard business intelligence tools built on top of SQL. They know how to use SQL or BI tools to analyze large data sets and how to quickly get accurate answers for many kinds of business questions and “what if” scenarios. They know how to design data structures and abstractions that let you perform this kind of analysis both for common use cases and unique, unplanned scenarios.
The filtering, calculating, sorting, and formatting capabilities of SQL let you delegate those operations to the Impala query engine, rather than generating a large volume of raw results and coding client-side logic to organize the final results for presentation.
Impala embodies the Big Data philosophy that large data sets should be just as easy and economical to work with as small ones. Large volumes of data can be imported instantaneously, without any changes to the underlying data files. You have the flexibility to query data in its raw original form, or convert frequently queried data to a more compact, optimized form. Either way, you don’t need to guess which data is worth saving; you preserve the original values, rather than condensing the data and keeping only the summarized form. There is no required step to reorganize the data and impose rigid structure, such as you might find in a traditional data warehouse environment.
The data files that Impala works with are all in open, documented, interoperable formats. (Some are even human-readable.) If you want to use Impala alongside other Hadoop components, you can do that without copying or converting the data. When you work with future generations of data-processing software, you can keep using the original data files rather than being faced with a difficult migration.
The Impala architecture provides such a speed boost to SQL queries on Hadoop data that it will change the way you work. Whether you currently use MapReduce jobs or even other SQL-on-Hadoop technologies such as Hive, the fast turnaround for Impala queries opens up whole new categories of problems that you can solve. Instead of treating Hadoop data analysis as a batch process that requires extensive planning and scheduling, you can get results any time you want them. Instead of doing a mental context switch as you wait for each query to finish, run a query, and immediately evaluate the results and fine-tune it. This rapid iteration helps you zero in on the best solution without disrupting your workflow. Instead of trying to shrink your data down to a representative subset, you can analyze everything you have, producing the most accurate answers and discovering new trends and correlations.
Perhaps you have had the experience of using software or a slow computer where after every command or operation, you waited so long that you had to take a coffee break or switch to another task. Then when you switched to faster software or upgraded to a faster computer, the system became so responsive that it lifted your mood, reengaged your intellect, and sparked creative new ideas. This is the type of reaction Impala aims to inspire in Hadoop users.
Previously, if you were writing queries for business intelligence, the data typically had already been condensed to a manageable volume of high-value information, and gone through a complicated extract-transform-load (ETL) cycle to be loaded into a database system.
With Impala, this procedure is shortened. The data arrives in Hadoop after fewer steps, and Impala is ready to query it immediately. The high-capacity, high-speed storage of a Hadoop cluster lets you bring in all the data, not just the subset that you think is the most valuable. Because Impala can query the raw data files, you can skip the time-consuming stages of loading and reorganizing data that you might have encountered with older database systems.
This fast end-to-end process opens up new possibilities for analytic queries. You can use techniques such as exploratory data analysis and data discovery. With earlier generations of software, you were unlikely to do these kinds of operations: either because it was too expensive to store all the data in your data warehouse or too time-consuming to load and convert it all into a usable form.
You might receive raw data in simple formats such as delimited text files. Text files are bulky and not particularly efficient to query, but these aren’t critical aspects for exploratory business intelligence (BI). The queries you run against such data are intended to determine what new insights you can gather by analyzing a comprehensive set of data. You might spot trends, identify interesting subsets, and learn how to design a schema that matches well with the underlying structure of the data. Exploratory BI typically involves ad hoc queries: ones that are made up on the spot and then fine-tuned over several iterations. To tease out answers to questions such as “Are there any…?”, “What is the most…?”, and so on, the queries often involve aggregation functions such as
Once you know the queries you want to run on a regular basis, you can optimize your data and your schema to be as efficient as possible. For data you intend to intensively analyze, expect to graduate from text or other unoptimized file formats, and convert the data to a compressed columnar file format—namely the Parquet format. (If you are an experienced Hadoop shop, you might already be using Parquet format in your data pipeline. In that case, enjoy the extra query speed during the exploratory BI phase, and skip any subsequent data conversion steps.)