The process for loading data using Bulk utilities is very similar:
- Extracting data from the source.
- Transforming the data into HFiles.
- Loading the files into HBase by guiding the region servers as to where to find them.
The following points have to be remembered when using Bulk utilities:
- HBase/Hadoop cluster with MapReduce/Yarn should be running. You can run jps to check it.
- Access rights (user/group) are needed to execute the program.
- Table schema needs to be designed to the input structure.
- Split points need to be taken into consideration.
- The entire stack (compaction, split, block size, max file size, flush size, version compression, mem store size, block cache, garbage collections nproc, and so on) needs to be fine-tuned ...