May 2017
Beginner to intermediate
596 pages
15h 2m
English
Since we want to write Parquet files into HDFS, we would be using the same BucketSink as used before but with a custom Parquet Writer and a DateTimeBucketer with a minute-based partition, as shown as follows. The Bucket sink path is passed as a command line argument, hdfsPath, which we will discuss later. Update the HADOOP_USER_NAME system property in the code to your user account name in CentOS:
System.setProperty("HADOOP_USER_NAME", "centos");//HDFS SinkBucketingSink<Tuple2<IntWritable, Text>> hdfsSink = new BucketingSink<Tuple2<IntWritable, Text>>(parameterTool.getRequired("hdfsPath"));hdfsSink.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HH-mm"));hdfsSink.setWriter(new SinkParquetWriter<Tuple2<IntWritable, ...