Apache Parquet is a columnar storage format for data where the structure of the data is incorporated into the file. It is available to any project in the Hadoop ecosystem and is a key format for analytics. It was designed to meet the goals of interoperability, space efficiency, and query efficiency. Parquet files can be stored in HDFS as well as non-HDFS filesystems.
Columnar storage works well for analytics as the data is stored and arranged by table columns instead of rows. Analytics use cases typically select multiple columns and perform aggregation functions on the values, such as sum, average, or standard deviation. ...