SampleData – a simple API for loading data

Loading data into a Notebook is one of the most repetitive tasks a data scientist can do, yet depending on the framework or data source being used, writing the code can be difficult and time-consuming.

Let's take a concrete example of trying to load a CSV file from an open data site (say https://data.cityofnewyork.us) into both a pandas and Apache Spark DataFrame.

Note

Note: Going forward, all the code is assumed to run in a Jupyter Notebook.

For pandas, the code is pretty straightforward as it provides an API to directly load from URL:

import pandas
data_url = "https://data.cityofnewyork.us/api/views/e98g-f8hy/rows.csv?accessType=DOWNLOAD"
building_df = pandas.read_csv(data_url)
building_df

The last statement, ...

Get Data Analysis with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.