Wrangle your data with Python

Learn techniques for programmatically acquiring data and how to extract that data.

By Jacqueline Kazil and Katharine Jarmul
August 24, 2016
Close-up of a sorting container Close-up of a sorting container (source: Unsplash via Pixabay)

Acquiring and Storing Data

Finding your first dataset(s) to investigate might be the most important step toward acheiving your goal of answering your questions. As we mentioned in not available, you should first spend some time refining your question until you have one specific enough to identify good data about but broad enough to be interesting to you and others.

Alternatively, you might have a dataset you already find interesting, but no compelling question. If you don’t already know and trust the data source, you should spend some time investigating. Ask yourself: is the data valid? Is it updated? Can I rely on future or current updates and publications?

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

In this chapter, we will review where you can save and store data for later use. If databases are new to you, we will review when and how to use them and demonstrate how to set up a simple database to store your data. For those of you who are already familiar with databases or if your source is a database, we will cover some basic database connection structures in Python.

Don’t worry if you haven’t yet decided on a dataset; we will use several examples you can access in this book’s repository.

Note

We strongly encourage you to come up with some applicable questions to use throughout this book so you can better learn by doing. These could be questions you’ve been meaning to research or questions related to the data explored in the book. Even if the questions you pick are simple, it’s best to learn by writing some of your own code.

Not All Data Is Created Equal

Although we’d like to believe in the veracity and quality of every dataset we see, not all datasets will measure up to our expectations. Even datasets you currently use could prove to be ineffective and inefficient sources after further research. As you explore automated solutions to the data wrangling problems you face, you will find the tools Python can help determine
good versus bad data and help suss out the viability of your data. We will cover more about those tools as we unravel data cleaning and data exploration with Python in Chapters not available and not available, and automation in not available.

When first getting hold of new data, we recommend performing a data smell test to decide whether you trust the data and if it is a reliable source of information. You can ask yourself:

  • Is the author a veritable source I can contact if I have questions or concerns?
  • Does the data appear to be regularly updated and checked for errors?
  • Does the data come with information as to how it was acquired and what types of samples were used in its acquisition?
  • Is there another source of data that can verify and validate this dataset?
  • Given my overall knowledge of the topic, does this data seem plausible?

If you answered “yes” to at least three of those questions, you are on the right track! If you answered “no” to two or more of them, you might need to do some more searching to find data you can reliably defend.

Tip

You might need to reach out to the author and/or organization who initially published and collected the data to request more information. Often, a quick call or email to the right person can help you answer one or more of those questions and prove how reliable and informed your data source is.

Fact Checking

Fact checking your data, although sometimes annoying and exhausting, is paramount to the validity of your reporting. Depending on your dataset, fact checking may involve:

  • Contacting the source(s) and verifying their latest methods and releases
  • Determining other good sources for comparison
  • Calling an expert and talking with them about good sources and veritable information
  • Researching your topic further to determine whether your sources and/or datasets are credible

Libraries and universities with access to subscriber-only publishing and educational archives are great resources for fact checking. If you can access tools like LexisNexis, the Congressional Quarterly Press Library, JSTOR, Cornell University’s arXiv project, and Google’s Scholar search, you can determine what others have studied and said about the topic.

Google Search can also help in fact checking. If someone says the data comes from a published source, chances are there are other folks who have either fact checked that claim or have proof of that claim. Again, you need to use your own discretion when reviewing things published online. Is the source veritable? Does the argument seem cogent and make sense? Does the proof appear valid? Evaluate your results with these questions in mind.

Tip

Government bureaus have vast datasets. If you’d like to study a phenomenon in your local city, state, or country, you can usually find someone via phone or email who has a useful dataset. Census bureaus worldwide regularly release census data and are a good first place to start if you’re stumped on what questions you’d like to answer.

Once you have verified and fact checked your initial dataset, it will be easier to both script it and determine the data’s validity in the future. You can even use some of the tips you learn throughout this book (particularly in not available) to create scripts and auto-update your data.

Readability, Cleanliness, and Longevity

If your dataset appears completely illegible, there is still hope: you can use the lessons in not available to clean it with code. Luckily, if it was created by a computer, it can likely be read by a computer. More difficulty exists in attempting to get data from “real life” onto our computers. As we saw in not available, PDFs and uncommon data file types can be difficult, but not impossible to work with.

We can use Python to help us read illegible data, but the illegibility may mean the data doesn’t come from a good source. If it is massive and generated by a computer, that is one thing—database dumps are never pretty. However, if the data you have is illegible and from a human source, it may point to an issue of data cleanliness and veritability.

Another issue you face is whether your data has already been cleaned. You can determine this by asking more about how the data is collected, reported, and updated. You should be able to determine:

  • How clean is the data?
  • Has someone taken the time to show statistical error rates or update erroneous entries or misreported data?
  • Will further updates be published or sent to you?
  • What methods were used in the collection of the data, and how were those methods verified?
Tip

If your source uses standardized and rigorous research and collection methods, you can likely reuse your cleaning and reporting scripts with little modification for years to come. Those systems don’t normally change regularly (as change is both costly and time-intensive). Once you’ve scripted your cleanup, you can easily process next year’s data and skip directly to data analysis.

In addition to cleanliness and readability, you care about the longevity of your data. Are you dealing with regularly collected and updated data? On what schedule is the data released and updated? Knowing how often an organization updates its data will help you determine your ability to use the data for years to come.

Where to Find Data

Just as there is more than one way to verify a source or script a PDF parser, there are many ways to find data. In this section, we’ll review the methods you can use both on and offline.

Using a Telephone

Look at the data file and ask yourself, how did the data get there? File types like Excel, PDF, or even Word usually involve a human in the process, and that person got the data from a source.

If you identify the person who collected the data, you might be able to get ahold of the raw data. This raw data may be in an easy-to-parse file format, like a CSV or database. The person you speak with can also answer questions about methods of collection and update timelines.

Here are some tips for finding a human from a data file:

  • Search the file for contact information.
  • Look for a byline—if there is no name, look for the organization.
  • Search for the filename and the title of the document on the Web.
  • Look at the file metadata by right-clicking and selecting “Properties” on Windows or “Get Info” on a Mac.

Reach out to any person you find. If that is not the person who created the file, then simply ask them if they know who did. Don’t be shy—your interest in their topic of study and work can be refreshing and flattering.

After you find someone to reach out to, try to reach them by phone or in person. Emails are easy to misread and usually end in extended messaging. Here are some example questions to help you think about what to ask:

  • Where did you get the data on pages 6 through 200?
  • Does it come in other formats, such as JSON, CSV, XML, or a database?
  • How was the data gathered?
  • Can you describe the data collection methods?
  • What do the abbreviations mean?
  • Will this data be updated? How and when?
  • Is there anyone else who can add more information?

Depending on your time constraints and the goal of your project, you may want to get started with data exploration while you wait for your questions to be answered.

US Government Data

For those interested in studying phenomena in the United States, recent pushes by the Obama administration to release readily accessible data online have provided easy access to regular govenment agency reporting. A quick browse on Data.gov reveals storm data, graduation and dropout rates,
endangered species data, crime statistics, and many other interesting datasets.

Beyond federal data, states and local agencies have their own sites to release data—we’ve highlighted a few here:

If you can’t find the information you need publicly listed, do not hesitate to give the agency or bureau a call and request data over the phone. Many government offices have interns and staffers who handle public requests for information.

Government and Civic Open Data Worldwide

Depending on what country you want to research and whether you live there, there are many ways to acquire government data. Because we are more familiar with US policies, we do not claim this is an extensive listing of possibilities. If you come across other useful open data not covered in this book that you’d like to share, feel free to reach out to us!

Warning

We still recommend fact checking government datasets, particularly if the government has a history of human rights violations. Use your best judgment when approaching all data and do not hesitate to pick up the phone or email the listed contact to further inquire about data collection methods.

EU and UK

If you are interested in data from the European Union or the United Kingdom, there are many data portals available. Several of the following sites are put together by organizations and open data enthusiasts, so feel free to reach out directly to the site owner if you are looking for a particular dataset:

Africa

If you are interested in data from African nations, there are many projects working to amass data and build APIs for developer use. Many African nations also use their own open data portals (a quick Google search can usually identify these). We’ve singled out some useful regional projects:

Asia

If you are interested in data from Asian nations, most run their own open data sites. We’ve identified a few with impressive datasets and some regional data from organizations:

Non-EU Europe, Central Asia, India, the Middle East, and Russia

Many Central Asian, Central European, and Middle Eastern countries outside of the EU have their own government open data sites. We have highlighted a few, but your linguistic skills will be paramount if you know what regions and countries you’d like to target and want to access data in the native tongue (although Google Chrome will attempt to translate web pages automatically, so you may still be able to find useful data even if you don’t speak the language):

South America and Canada

Many South American nations have their own open data sites, found easily by search. Canada also has an open data portal for statistics. We have highlighted a few sites but encourage you to seek out particular sectors or governments you are interested in by searching online:

Organization and Non-Government Organization (NGO) Data

Organizations—both the locally run and international—are great sources for datasets that cross state or national borders, such as data on climate change, international business and trade, and global transportation. If your topic is something the government might not collect (data on religious details, drug use, community-based support networks, etc.) or if the government in question is an unreliable source or lacks an open data portal, you might be able to find the data via an NGO or open data organization. We’ve listed some here, but there are many more fighting for open exchange and access to data:

Education and University Data

Universities and graduate departments around the world are constantly researching and releasing datasets, covering everything from advances in biological science to the interrelatedness of native cultures with neighboring ecological habitats. It’s hard to imagine a subject not broached within the educational sphere, so universities are a great place to get the latest topical data. Most researchers are happy to hear someone is
interested in their topic, so we encourage you to reach out to the appropriate departments and authors directly for more information. If you’re not sure where to start, here are a few good options:

Medical and Scientific Data

Similar to universities, scientific and medical research departments and organizations are an excellent resource for a broad array of data. Navigating scientific research can prove daunting, but don’t fret—if you can find the datasets used for the research, they usually come without the same research paper jargon. If you have a specific researcher or study in mind, we recommend reaching out directly; we’ve collected a few of the aggregators into the following list:

Crowdsourced Data and APIs

If your idea or question is better answered by crowdsourcing, the Internet and its many forums, services, and social media outlets allow you to create your own questions and answer them with the help of some data mining. Services like Twitter and Instragram boast billions of users and easy-to-use application programming interfaces (or APIs). APIs are protocols or tools that allow software or code to interact with another system. In our case, we are usually dealing with web-based APIs
where we can send web requests and ask for data from the service. With normally less than an hour of set-up time, API access will put millions of records at your fingertips.

We will take a more in-depth look at APIs in not available, but for now some of the basic challenges and benefits of using an API are reviewed in Table 1-1.

Table 1-1. Using an API
Pros Cons
Immediate access to data you can use Unreliability of mass API system (selection bias)
Vast quantities of data Data overload
You don’t have to worry about storage; you can just access the data from the service’s storage Reliability and dependence on access—API limitations or downtime

As you can see, there are benefits and compromises. If you find an API you want to use, create a few rules around how you will use it and what to do if it is not accessible (you may want to store responses locally to avoid downtime issues). Collecting enough responses over time can also help eliminate some selection bias in your research.

Outside of social web services, there are a variety of sites where you can post your own questions and ideas and ask for a crowdsourced reply. Whether you want to go to an expert forum related to the topic or post a survey and circulate it though your own channels is up to you, but be aware when using your own research questions and methods that you must account for whatever size and sampling errors arise. For a more detailed introduction to writing
your own survey along with citations for more information, the University of Wisconsin’s survey guide can be a good starting point.

For other crowdsourced data, take a look at:

The amount of data available is enormous, and it’s no small task sorting through all of the noise to get a good idea of what questions you can answer and how you should go about answering them. Let’s walk through a few case studies to give you a better idea of how to go about pursuing the data that helps answer your questions.

Case Studies: Example Data Investigation

We will outline a few different areas of interest and questions so you have a good idea of how you can begin.

Ebola Crisis

Let’s say you are interested in investigating the Ebola crisis in West Africa. How might you begin? To start with, you might quickly Google “Ebola crisis data.” You will find there are many international organizations working to track the spread of the virus, and those organizations put numerous tools at your disposal. First, you may find the World Health Organization’s situation report. The WHO site has information on the latest cases
and deaths, interactive maps showing affected regions, and key performance indicators for response measures, and it appears to be updated on a weekly basis. The data is available in CSV and JSON format, and it’s a veritable, reliable, and regularly updated source of information.

Rather than stopping at the first result you turn up, you keep digging to see what other sources are available. Upon further searching, we find a GitHub repository run by user cmrivers, which is a raw data collection of data sources from a variety of governmental and media sources. Because we know the user and can contact them via their contact information, we can also verify when was the last time these sources were updated and ask any questions about the collection methods. The formats are ones we know how to handle (CSV, PDF files), so they shouldn’t present a problem.

As you dig further, you might focus on one particular question, like “What precautions are being taken for safe burial?” You find a report on safe and dignified burials maintained by Sam Libby. Perfect! If and when you have any questions, you can contact Sam directly.

You’ve found a good initial list of sources, verified they are from organizations you can trust, and identified someone you can ask for more information as your research progresses. Now, let’s take a look at another example.

Train Safety

Let’s say you’re interested in train safety in the United States. Maybe your question is: What are the negative factors affecting train safety? To begin, you might look at some previous research around train safety. You come across the Federal Railroad Administration (FRA), whose core duty is to ensure railroad safety and usability. As you read some of the reports and fact sheets on the FRA website, you determine most reports indicate train accidents occur
due to poor track maintenance or human error.

You’re interested in the human side of things, so you start digging a little more. You find that the FRA has numerous reports on railroad employees and safety. You find a report on sleep patterns of railroad workers, which could shed some light on how human errors occur. You also find some information about federal regulations for drug
and alcohol testing of railroad employees
.

Now you might have more questions you can specify to narrow down what you really want to know. Maybe your question is now, “How often are railroad accidents caused by alcohol?” Or “How often are train engineers overworked or working while exhausted?” You have some initial trusted datasets and the ability to call the FRA and request more information as your research progresses.

Football Salaries

Now let’s say you’re interested in football (the kind with the feet, not the pigskin one) salaries. How much are these players making, and how significant is each player’s impact on his team?

As you first start searching, you determine you should focus on one league, given the disparate data. Let’s say you choose the English Premier League. You find a listing of Premier League club salaries on a site you may never have heard of before. It seems the author has also compiled lists of each team and how much each player is getting paid. To
better understand where the data is coming from and to make sure you can trust the source, you should reach out to the author listed on the page and get more information.

If you’re also searching for endorsements, you may come across the Statistica charts outlining endorsement and salary data for the top-paid football players. You’ll probably want to reach out and see if there is updated data on endorsements so you can compare the most recent season.

Now that you have some salary data, you’ll want to look at statistics on how good the top-paid players are. You find some player statistics on the Premier League’s website. This is data you can likely only get by web scraping (more on this in not available), but you know you can trust the source. Another search for statistics on players turns up some more data on
top assists. You can also analyze penalty try statistics. Again, you should investigate the validity of any source you use that’s not easily verified.

Now you can begin your your analysis to see how much each football player’s goals, red cards, and penalty kicks are worth!

Child Labor

Finally, let’s take a dive into the questions we will be answering with further chapters in this book. We’ll focus on the international crisis of child labor. When we think about international topics, we immediately turn to international organizations.

We find UNICEF’s open data site dedicated to reporting on child labor. In fact, UNICEF has entire datasets on women and children’s well-being and status across the globe. These might prove fruitful for answering questions such as “Does early marriage affect child labor rates?”

When looking for government data, we identify the US Department of Labor’s annual reports on child labor across the globe. These will be a great cross reference for our UNICEF datasets.

Additionally, we find the International Labor Organization’s (ILO) trend report on child labor. The ILO report seems to have links to many different datasets and should be a good reference for historical data on child labor.

We’ve amassed several datasets to use throughout the next chapters. We’ve included them all in the data repository so you can use them and follow along.

Now that we’ve explored how to identify questions and find resources, let’s look at storing our data.

Storing Your Data: When, Why, and How?

Once you’ve located your data, you need a place to store it! Sometimes, you’ll have received data in a clean, easy-to-access, and machine-readable format. Other times, you might want to find a different way to store it. We’ll review some data storage tools to use when you first extract your data from a CSV or PDF, or you can wait and store your data once it’s fully processed and cleaned (which we will cover in not available).

So, let’s say your dataset is disparate: a file from here, a report from there; some of it easy to download and access, but other bits you might need to copy or scrape from the Web. We will review how to clean and combine datasets in Chapters 7 and 9, but now let’s talk about how we can store data in a shared place.

Tip

If you are going to be using datasets from multiple computers, it’s always a good idea to store them on a network or the Internet (hello cloud computing!), or have them on an external hard drive or USB stick. Keep this in mind when you are working with a team who may need data access from different locations or computers. If you’re working on a single computer, make sure you have a data backup strategy. The worst thing about losing
your laptop will be losing the data you spent months acquiring and cleaning.

Databases: A Brief Introduction

Databases—learn to love them, love to hate them. As a developer, you’ll likely find yourself using many different types of databases throughout your education and work. This section is by no means intended as a comprehensive overview of databases, but we aim to provide a brief introduction to basic database concepts. If you already know and use databases actively, give this section a quick review and move on to the other storage solutions and when to use them.

Have you ever looked up a number on your phone using Siri? Have you ever searched Google? Have you ever clicked on a hashtag on Twitter or Instagram? Each of these actions involves a simple search and a response from a database (or a series of databases, or a database cache). You have a question (What funny new Maru videos are on YouTube?), you ask a particular database (YouTube Search), and get a fun(ny) response—a listing of search results to enjoy.

In the following sections, we outline two major database types, highlighting the pros and cons of each as well as their different strengths and weaknesses. For the purposes of data wrangling, you absolutely do not need to use a database; however, as you become more advanced at data wrangling and analysis, database use and knowledge will become more important and help advance your ability to store and analyze data.

If you’re interested in databasing, we will give a few tips on how to use Python with databases; but we clearly don’t have enough time here to fully cover the topic. We highly recommend you search out more information, videos, and tutorials based on your interest in this section.

Relational Databases: MySQL and PostgreSQL

Relational databases are great for data coming from a variety of sources with varying levels of interconnectedness. Relational data exemplifies its name: if your data has connections similar to a family tree, a relational database like MySQL will likely work well for you.

Relational data usually uses a series of unique identifiers to actively match datasets. In SQL, we normally call them IDs. These IDs can be used by other sets of data to find and match connections. From these connected datasets, we can make what we call joins, which allow us to access connected data from many different datasets at once. Let’s look at an example.

I have a really awesome friend. Her name is Meghan. She has black hair and works at The New York Times. In her spare time, she likes to go dancing, cook food, and teach people how to code. If I had a database of my friends and used SQL to represent their attributes, I might break it down like so:

**friend_table: 1
friend_id    2
friend_name
friend_date_of_birth
friend_current_location
friend_birthplace
friend_occupation_id


**friend_occupation_table:
friend_occupation_id
friend_occupation_name
friend_occupation_location

**friends_and_hobbies_table:
friend_id
hobby_id

**hobby_details_table:
hobby_id
hobby_name
hobby_level_of_awesome
1
In my database of friends, each of these sections (marked with **) would be tables. In relational databasing, tables usually hold information about a specific topic or object.
2
Each of the pieces of information a table holds are called fields. In this case, the friend_id field holds a unique ID for each friend in my friend_table.

With my database, I can ask: What are Meghan’s hobbies? To access the information, I would say to the database, “Hey, I’m looking for my friend Meghan. She lives in New York and here is her birthday; can you tell me her ID?” My SQL database will respond to this query with her friend_id. I can then ask my friend_and_hobbies_table (which properly matches hobby IDs with the friend IDs) what hobbies match up with this friend
ID and it will respond with a list of three new hobby IDs.

Because these IDs are numbers, I want to learn more about what they mean. I ask the hobby_details_table, “Can you tell me more about these hobby IDs?” and it says, “Sure! One is dancing, one is cooking food, and one is teaching people how to code.” Aha! I have solved the riddle, using just an initial friend description.

Setting up and getting data into a relational database can involve many steps, but if your datasets are complex with many different relationships, it shouldn’t take more than a few steps to figure out how to join them and get the information you desire. When building relational databases, spend time mapping out the relations and their attributes, similar to what we did with the friend database. What are the different types of data, and how can they be mapped to one another?

In relational database schema, we figure out how we want to match data by thinking about how we will most often use the data. You want the queries you ask the database to be easy to answer. Because we thought we might use occupation to help identify a friend, we put the occupation_id in the friend-table.

Another thing to note is there are several different kinds of relationships. For example, I can have many friends with cooking as a hobby. This is what we call a many-to-many relationship. If we were to add a table such as pets, that would add a different kind of relationship—a many-to-one. This is because a few of my friends have more than one pet, but each pet belongs to only one friend. I could look up all the pets of a friend by using their friend_id.

If learning more about SQL and relational databases interests you, we recommend taking a longer look at SQL. Learn SQL The Hard Way and SQLZOO are great first places to start. There are some slight differences in syntax between PostgreSQL and MySQL, but they both follow the same basics, and learning one over the other is a matter of personal choice.

MySQL and Python

If you are familiar with (or learning) MySQL and you’d like to use a MySQL database, there are Python bindings to easily connect. You will need to perform two steps. First, you must install a MySQL driver. Then, you should use Python to send authentication information (user, password, host, database name). There is a great Stack Overflow write-up covering both.

PostgreSQL and Python

If you are familiar with (or learning) PostgreSQL and you’d like to use a PostgreSQL database, there are Python bindings for PostgreSQL, too. You will also need to perform two steps: installing a driver and then connecting with Python.

There are many PostgreSQL drivers for Python, but the most popular one is Psycopg. Psycopg’s installation page has details about getting it running on your machine and there is a lengthy introduction on how to use it with Python on the PostgreSQL site.

Non-Relational Databases: NoSQL

Let’s say you like the idea of using a database, but mapping out all those relations gives you the heebie-jeebies. Maybe it’s just that you don’t really understand how the data connects right now. Maybe it’s that you have flat data (i.e., nonrelational data that doesn’t necessarily map well). Or maybe you don’t have a deeper interest in learning SQL. Luckily, there is a database out there for you.

NoSQL and other nonrelational databases store data in a flat format, usually JSON. As we discussed in not available, JSON uses a simple lookup for information. Going back to the data shared in the previous section about my friend, what if I just had the data stored in nodes that allow me to look up more information about that friend? It might look like so:

{
    'name': 'Meghan',
    'occupation': { 'employer': 'NYT',
                    'role': 'design editor',
                  },
    'birthplace': 'Ohio',
    'hobbies': ['cooking', 'dancing', 'teaching'],
}

As you can see, I can have a simple list of all of my friend’s attributes, without having to create tables.

What, you may ask, is the benefit of relational data? Depending on who you ask, you might get very different responses—within computer science and among developers this is a hotly debated topic. Our opinion is there are many advances SQL has made to allow for quick lookups when your data is structured with a vast network of relations. There are also plenty of advances nonrelational databases have made in speed, availability, and duplication.

In the end, if you have a stronger interest in learning one over the other, let that guide your decision rather than determining now what your dataset looks like. If you need to migrate it one way or another, there are tools that will help you migrate in either direction.1

MongoDB with Python

If you already have data in a nonrelational database structure or you are hoping to learn by doing, it’s very easy to connect NoSQL databases using Python. Although there are plenty to choose from, one of the most popular NoSQL database frameworks is MongoDB. To use MongoDB, you need to first install the drivers and then use Python to connect. There was a great
“Getting Started with MongoDB” presentation at PyCon
2012; this is a useful place to get started learning about MongoDB and how to connect using Python.

Setting Up Your Local Database with Python

One of the easiest ways to get started with databases and Python is to use a simple library to help you ramp up quickly. For the purpose of this book, we recommend starting with Dataset. Dataset is a wrapper library, which means it helps speed development by translating our readable Python code into the database code we want to work with.

If you already have a SQLite, PostgreSQL, or MySQL database, you can plug it right in by following the quickstart guide. If you don’t yet have one of those, it will create one for you as you use the tool. Let’s take a look at getting it running on your computer.

The first thing you will need to do is install Dataset. If you are already using pip, then simply type pip install dataset.

You then need to decide on the backend you will use. If you are already using PostgreSQL or MySQL, simply set up a new database following the proper syntax for your chosen database. If you are new to databases, we will use SQLite. First, download your operating system’s SQLite binary. Open the downloaded file and follow the installation instructions.

Open your terminal and change (cd) to your project folder holding your Python data wrangling scripts. To create your new SQLite database, type:

sqlite3 data_wrangling.db

You should see a prompt beginning with sqlite> asking you to enter SQL. Now that you’ve confirmed you have sqlite3 running on your computer, you can exit the SQLite terminal by typing .q. Upon exit, list the files in your current folder. You should now have a file called data_wrangling.db—that’s your database!

Once you have SQLite installed and your first database running, it’s time to get it working with Dataset. Let’s try the following code in a Python shell:

import dataset

db = dataset.connect('sqlite:///data_wrangling.db')

my_data_source = {
  'url':
  'http://www.tsmplug.com/football/premier-league-player-salaries-club-by-club/',
  'description': 'Premier League Club Salaries',
  'topic': 'football',
  'verified': False,
} 1

table = db['data_sources'] 2
table.insert(my_data_source) 3

another_data_source = {
  'url':
  'http://www.premierleague.com/content/premierleague/en-gb/players/index.html',
  'description': 'Premier League Stats',
  'topic': 'football',
  'verified': True,
}

table.insert(another_data_source)

sources = db['data_sources'].all() 4

print sources
1
Creates a Python dictionary of the data we are looking to save. We are saving the sources for our football research. We added information about the topic, description, URL, and whether we have verified the data yet.
2
Creates a new table called data_sources.
3
Inserts our first data source into our new table.
4
Shows all of the data sources we have stored in our data_sources table.

You have now set up your first relational table using SQLite, and done your first Python database interactions. As the book progresses, you will be able to add more data and tables to your database. Having all of your data stored in one place helps keep your data organized and your research focused.

When to Use a Simple File

If your dataset is small, most likely a simple file, rather than a database, will do. You might want to take a look at not available and start with some sanitizing techniques before saving it, but keeping it in a CSV or some other simple file format is perfectly fine. The same csv module we worked with to import CSVs (see not available) also has some easy-to-use
writer classes.

Your main consideration when using simple files is making sure you have easy access and backup. To manage these needs, you can store your data on a shared network drive or in a cloud-based service (Dropbox, Box, Amazon, Google Drive). Using one of these services usually means you will also have backup options, management capabilities, and the ability to share files. This is extremely helpful for those “oops, I overwrote the data file” moments.

Cloud-Storage and Python

Depending on your cloud storage solution, you should research the best way to get Python connected to your data. Dropbox has great Python support, and their Getting Started with Python guide provides a good introduction. Google Drive is a bit more complex, but the Python Quick start guide will help you through the first steps. There are also some Google Drive Python API wrappers, like
PyDrive, that allow you to use Google Drive without knowing much Python. We highly recommend GSpread for managing spreadsheets on Google Drive.

If you have your own cloud servers, you might need to research the best way to connect to them. Python has built in URL request, FTP (File Transfer Protocol), and SSH/SCP (Secure Sell/Secure Copy) methods, all documented in the Python stdlib. We will also cover some useful libraries for managing cloud services in not available.

Local Storage and Python

The simplest and most straightforward way to store your data is locally. You can open documents on your filesystem with one line of Python (the open command). You can also update and save new files as you work with the data using the built-in file.write method.

Alternative Data Storage

There are many new and interesting ways to store data which don’t involve the aforementioned paths. Depending on your use case, there may be a better way to store the data you are looking to use. Here are a few interesting ones:

Hierarchical Data Format (HDF)
HDF is a file-based scalable data solution allowing you to quickly store large datasets to a filesystem (local or otherwise). If you are already familiar with HDF, Python has an HDF5 driver, h5py, which connects Python to HDF5.
Hadoop
Hadoop is a big data distributed storage system, allowing you to store and process data across clusters. If you are already working with Hadoop or are familiar with Hadoop, Cloudera has a Guide to Python Frameworks for Hadoop with some great getting-started code samples.

Summary

Congratulations! You’ve gotten through some of the largest questions facing your project: How can I find useful data? How can I access and store the data? We hope you feel confident with the sources you have acquired and the veritability of your first dataset(s). We also hope you have a solid plan for backup and data storage.

You can use the skills you’ve honed in this chapter on future datasets, even if it’s just spending a few hours on data sites exploring questions that pop into your mind.

You should now feel confident:

  • Determining the value and use of a dataset you find
  • Picking up the phone to reach out for more information
  • Deciding where you might first look for data to answer a question
  • Implementing a safe and hassle-free way to store your data
  • Validating the data you have found
  • Building relational models of the data

You’ve also been introduced to the concepts in Table 1-2.

Table 1-2. New Python and programming concepts and libraries
Concept/library Purpose
Relational databases (e.g., MySQL and PostgreSQL) Storing relational data in an easy way
Non-relational databases (e.g., MongoDB) Storing data in a flat way
SQLite setup and usage Easy-to-use SQL-based storage that works well for simple projects
Dataset installation and usage Easy-to-use Python database wrapper

You’ll be using all of these skills and more as you move forward in future chapters. In the next chapter, you’ll be learning all about cleaning your data, finding inconsistencies with the code, and getting it closer to a fully running script or program so you can analyze your data and output results to share with the world.

1For more reading on database migration between SQL and NoSQL databases, check out Matt Asay’s writeup on migrating Foursquare to a NoSQL database from a relational database. Additionally, there are some Quora writeups covering migration in the opposite direction.

Post topics: Data
Post tags: data python
Share:

Get the O’Reilly Radar Trends to Watch newsletter