book

Data Wrangling with Python

by Jacqueline Kazil, Katharine Jarmul

February 2016

Beginner to intermediate

508 pages

12h 27m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Who Should Read This BookWho Should Not Read This BookHow This Book Is OrganizedWhat Is Data Wrangling?What to Do If You Get StuckConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
Why PythonGetting Started with PythonWhich Python VersionSetting Up Python on Your MachineTest Driving PythonInstall pipInstall a Code EditorOptional: Install IPythonSummary
Basic Data TypesStringsIntegers and FloatsData ContainersVariablesListsDictionariesWhat Can the Various Data Types Do?String Methods: Things Strings Can DoNumerical Methods: Things Numbers Can DoList Methods: Things Lists Can DoDictionary Methods: Things Dictionaries Can DoHelpful Tools: type, dir, and helptypedirhelpPutting It All TogetherWhat Does It All Mean?Summary
CSV DataHow to Import CSV DataSaving the Code to a File; Running from Command LineJSON DataHow to Import JSON DataXML DataHow to Import XML DataSummary
Installing Python PackagesParsing Excel FilesGetting Started with ParsingSummary
Avoid Using PDFs!Programmatic Approaches to PDF ParsingOpening and Reading Using slateConverting PDF to TextParsing PDFs Using pdfminerLearning How to Solve ProblemsExercise: Use Table Extraction, Try a Different LibraryExercise: Clean the Data ManuallyExercise: Try Another ToolUncommon File TypesSummary
Not All Data Is Created EqualFact CheckingReadability, Cleanliness, and LongevityWhere to Find DataUsing a TelephoneUS Government DataGovernment and Civic Open Data WorldwideOrganization and Non-Government Organization (NGO) DataEducation and University DataMedical and Scientific DataCrowdsourced Data and APIsCase Studies: Example Data InvestigationEbola CrisisTrain SafetyFootball SalariesChild LaborStoring Your Data: When, Why, and How?Databases: A Brief IntroductionRelational Databases: MySQL and PostgreSQLNon-Relational Databases: NoSQLSetting Up Your Local Database with PythonWhen to Use a Simple FileCloud-Storage and PythonLocal Storage and PythonAlternative Data StorageSummary
Why Clean Data?Data Cleanup BasicsIdentifying Values for Data CleanupFormatting DataFinding Outliers and Bad DataFinding DuplicatesFuzzy MatchingRegEx MatchingWhat to Do with Duplicate RecordsSummary
Normalizing and Standardizing Your DataSaving Your DataDetermining What Data Cleanup Is Right for Your ProjectScripting Your CleanupTesting with New DataSummary
Exploring Your DataImporting DataExploring Table FunctionsJoining Numerous DatasetsIdentifying CorrelationsIdentifying OutliersCreating GroupingsFurther ExplorationAnalyzing Your DataSeparating and Focusing Your DataWhat Is Your Data Saying?Drawing ConclusionsDocumenting Your ConclusionsSummary

Avoiding Storytelling PitfallsHow Will You Tell the Story?Know Your AudienceVisualizing Your DataChartsTime-Related DataMapsInteractivesWordsImages, Video, and IllustrationsPresentation ToolsPublishing Your DataUsing Available SitesOpen Source Platforms: Starting a New SiteJupyter (Formerly Known as IPython Notebooks)Summary
What to Scrape and HowAnalyzing a Web PageInspection: Markup StructureNetwork/Timeline: How the Page LoadsConsole: Interacting with JavaScriptIn-Depth Analysis of a PageGetting Pages: How to Request on the InternetReading a Web Page with Beautiful SoupReading a Web Page with LXMLA Case for XPathSummary
Browser-Based ParsingScreen Reading with SeleniumScreen Reading with Ghost.PySpidering the WebBuilding a Spider with ScrapyCrawling Whole Websites with ScrapyNetworks: How the Internet Works and Why It’s Breaking Your ScriptThe Changing Web (or Why Your Script Broke)A (Few) Word(s) of CautionSummary
API FeaturesREST Versus Streaming APIsRate LimitsTiered Data VolumesAPI Keys and TokensA Simple Data Pull from Twitter’s REST APIAdvanced Data Collection from Twitter’s REST APIAdvanced Data Collection from Twitter’s Streaming APISummary
Why Automate?Steps to AutomateWhat Could Go Wrong?Where to AutomateSpecial Tools for AutomationUsing Local Files, argv, and Config FilesUsing the Cloud for Data ProcessingUsing Parallel ProcessingUsing Distributed ProcessingSimple AutomationCronJobsWeb InterfacesJupyter NotebooksLarge-Scale AutomationCelery: Queue-Based AutomationAnsible: Operations AutomationMonitoring Your AutomationPython LoggingAdding Automated MessagingUploading and Other ReportingLogging and Monitoring as a ServiceNo System Is FoolproofSummary
Duties of a Data WranglerBeyond Data WranglingBecome a Better Data AnalystBecome a Better DeveloperBecome a Better Visual StorytellerBecome a Better Systems ArchitectWhere Do You Go from Here?
C, C++, and Java Versus PythonR or MATLAB Versus PythonHTML Versus PythonJavaScript Versus PythonNode.js Versus PythonRuby and Ruby on Rails Versus Python
Online ResourcesIn-Person Groups
BashNavigationModifying FilesExecuting FilesSearching with the Command LineMore ResourcesWindows CMD/Power ShellNavigationModifying FilesExecuting FilesSearching with the Command LineMore Resources
Step 1: Install GCCStep 2: (Mac Only) Install HomebrewStep 3: (Mac Only) Tell Your System Where to Find HomebrewStep 4: Install Python 2.7Step 5: Install virtualenv (Windows, Mac, Linux)Step 6: Set Up a New DirectoryStep 7: Install virtualenvwrapperInstalling virtualenvwrapper (Mac and Linux)Installing virtualenvwrapper-win (Windows)Testing Your Virtual Environment (Windows, Mac, Linux)Learning About Our New Environment (Windows, Mac, Linux)Advanced Setup Review
Hail the WhitespaceThe Dreaded GIL= Versus == Versus is, and When to Just CopyDefault Function ArgumentsPython Scope and Built-Ins: The Importance of Variable NamesDefining Objects Versus Modifying ObjectsChanging Immutable ObjectsType CheckingCatching Multiple ExceptionsThe Power of Debugging
Why Use IPython?Getting Started with IPythonMagic FunctionsFinal Thoughts: A Simpler Terminal
Spinning Up an AWS ServerAWS Step 1: Choose an Amazon Machine Image (AMI)AWS Step 2: Choose an Instance TypeAWS Step 7: Review Instance LaunchAWS Extra Question: Select an Existing Key Pair or Create a New OneLogging into an AWS ServerGet the Public DNS Name of the InstancePrepare Your Private KeyLog into Your ServerSummary

Content preview from Data Wrangling with Python

Chapter 5. PDFs and Problem Solving in Python

Publishing data only in PDFs is criminal, but sometimes you don’t have other options. In this chapter, you are going to learn how to parse PDFs, and in doing so you will learn how to troubleshoot your code.

We will also cover how to write a script, starting with some basic concepts like imports, and introduce some more complexity. Throughout this chapter, you will learn a variety of ways to think about and tackle problems in your code.

Avoid Using PDFs!

The data used in this section is the same data as in the previous chapter, but in PDF form. Normally, one does not seek data in difficult-to-parse formats, but we did for this book because the data you need to work with may not always be in the ideal format. You can find the PDF we use in this chapter in the book’s GitHub repository.

There are a few things you need to consider before you start parsing PDF data:

Have you tried to find the data in another form? If you can’t find it online, try using a phone or email.
Have you tried to copy and paste the data from the document? Sometimes, you can easily select, copy, and paste data from a PDF into a spreadsheet. This doesn’t always work, though, and it is not scalable (meaning you can’t do it for many files or pages quickly).

If you can’t avoid dealing with PDFs, you’ll need to learn how to parse your data with Python. Let’s get started.