If you are a Unix-oriented tools hacker, Impala fits in nicely at the tail end of your workflow. You create data files with a wide choice of formats for convenience, compactness, or interoperability with different Apache Hadoop components. You tell Impala where those data files are and what fields to expect inside them. That’s it! Then, let the SQL queries commence. You can see the results of queries in a terminal window through the
impala-shell command, save them in a file to process with other scripts or applications, or pull them straight into a visualizer or report application through the standard ODBC or JDBC interfaces. It’s transparent to you that behind the scenes, the data is spread across multiple storage devices and processed by multiple servers.
When you administer Impala, it is a straightforward matter of some daemons communicating with each other through a predefined set of ports. There is an
impalad daemon that runs on each data node in the cluster and does most of the work, a
statestored daemon that runs on one node and performs periodic health checks on the
impalad daemons, and the roadmap includes one more planned service. Log files show the Impala activity occurring on each node.
Administration for Impala is typically folded into administration for the overall cluster through the Cloudera Manager product. You monitor all nodes for out-of-space problems, CPU spikes, network failures, and so on, rather ...