File handling with Hadoopy

Hadoopy is a library in Python, which provides an API to interact with Hadoop to manage files and perform MapReduce on it. Hadoopy can be downloaded from http://www.Hadoopy.com/en/latest/tutorial.html#installing-Hadoopy.

Let's try to put a few files in Hadoop through Hadoopy in a directory created within HDFS, called data:

$ Hadoop fs -mkdir data

Here is the code that puts the data into HDFS:

importHadoopy
import os
hdfs_path = ''
def read_local_dir(local_path):
  for fn in os.listdir(local_path):
    path = os.path.join(local_path, fn)
    if os.path.isfile(path):
      yield path

def main():
  local_path = './BigData/dummy_data'
  for file in  read_local_dir(local_path):
    Hadoopy.put(file, 'data')
 print"The file %s has been put ...

Get Mastering Python for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.