July 2017
Beginner to intermediate
378 pages
10h 26m
English
We will return to the GPS position data example to set up a retention policy. The raw data is landed into an HDFS table and kept uncompressed. A series of data refineries transforms the data into cleaned and useful analytical datasets. All of the analytical datasets are compressed in Parquet format–still in HDFS.
Upon the review of the raw data, it was determined that any problems in the initial transformation are typically discovered within a week. No problems were found more than a month later. Due to this, the raw data was retained for the latest month, and then moved into S3 Standard. In S3, a ruleset was created to move the data into Standard-Infrequent Access after another month, and then into Amazon Glacier ...