book

Data Wrangling with Python

by Jacqueline Kazil, Katharine Jarmul

February 2016

Beginner to intermediate

508 pages

12h 27m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Who Should Read This BookWho Should Not Read This BookHow This Book Is OrganizedWhat Is Data Wrangling?What to Do If You Get StuckConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
Why PythonGetting Started with PythonWhich Python VersionSetting Up Python on Your MachineTest Driving PythonInstall pipInstall a Code EditorOptional: Install IPythonSummary
Basic Data TypesStringsIntegers and FloatsData ContainersVariablesListsDictionariesWhat Can the Various Data Types Do?String Methods: Things Strings Can DoNumerical Methods: Things Numbers Can DoList Methods: Things Lists Can DoDictionary Methods: Things Dictionaries Can DoHelpful Tools: type, dir, and helptypedirhelpPutting It All TogetherWhat Does It All Mean?Summary
CSV DataHow to Import CSV DataSaving the Code to a File; Running from Command LineJSON DataHow to Import JSON DataXML DataHow to Import XML DataSummary
Installing Python PackagesParsing Excel FilesGetting Started with ParsingSummary
Avoid Using PDFs!Programmatic Approaches to PDF ParsingOpening and Reading Using slateConverting PDF to TextParsing PDFs Using pdfminerLearning How to Solve ProblemsExercise: Use Table Extraction, Try a Different LibraryExercise: Clean the Data ManuallyExercise: Try Another ToolUncommon File TypesSummary
Not All Data Is Created EqualFact CheckingReadability, Cleanliness, and LongevityWhere to Find DataUsing a TelephoneUS Government DataGovernment and Civic Open Data WorldwideOrganization and Non-Government Organization (NGO) DataEducation and University DataMedical and Scientific DataCrowdsourced Data and APIsCase Studies: Example Data InvestigationEbola CrisisTrain SafetyFootball SalariesChild LaborStoring Your Data: When, Why, and How?Databases: A Brief IntroductionRelational Databases: MySQL and PostgreSQLNon-Relational Databases: NoSQLSetting Up Your Local Database with PythonWhen to Use a Simple FileCloud-Storage and PythonLocal Storage and PythonAlternative Data StorageSummary
Why Clean Data?Data Cleanup BasicsIdentifying Values for Data CleanupFormatting DataFinding Outliers and Bad DataFinding DuplicatesFuzzy MatchingRegEx MatchingWhat to Do with Duplicate RecordsSummary
Normalizing and Standardizing Your DataSaving Your DataDetermining What Data Cleanup Is Right for Your ProjectScripting Your CleanupTesting with New DataSummary
Exploring Your DataImporting DataExploring Table FunctionsJoining Numerous DatasetsIdentifying CorrelationsIdentifying OutliersCreating GroupingsFurther ExplorationAnalyzing Your DataSeparating and Focusing Your DataWhat Is Your Data Saying?Drawing ConclusionsDocumenting Your ConclusionsSummary

Avoiding Storytelling PitfallsHow Will You Tell the Story?Know Your AudienceVisualizing Your DataChartsTime-Related DataMapsInteractivesWordsImages, Video, and IllustrationsPresentation ToolsPublishing Your DataUsing Available SitesOpen Source Platforms: Starting a New SiteJupyter (Formerly Known as IPython Notebooks)Summary
What to Scrape and HowAnalyzing a Web PageInspection: Markup StructureNetwork/Timeline: How the Page LoadsConsole: Interacting with JavaScriptIn-Depth Analysis of a PageGetting Pages: How to Request on the InternetReading a Web Page with Beautiful SoupReading a Web Page with LXMLA Case for XPathSummary
Browser-Based ParsingScreen Reading with SeleniumScreen Reading with Ghost.PySpidering the WebBuilding a Spider with ScrapyCrawling Whole Websites with ScrapyNetworks: How the Internet Works and Why It’s Breaking Your ScriptThe Changing Web (or Why Your Script Broke)A (Few) Word(s) of CautionSummary
API FeaturesREST Versus Streaming APIsRate LimitsTiered Data VolumesAPI Keys and TokensA Simple Data Pull from Twitter’s REST APIAdvanced Data Collection from Twitter’s REST APIAdvanced Data Collection from Twitter’s Streaming APISummary
Why Automate?Steps to AutomateWhat Could Go Wrong?Where to AutomateSpecial Tools for AutomationUsing Local Files, argv, and Config FilesUsing the Cloud for Data ProcessingUsing Parallel ProcessingUsing Distributed ProcessingSimple AutomationCronJobsWeb InterfacesJupyter NotebooksLarge-Scale AutomationCelery: Queue-Based AutomationAnsible: Operations AutomationMonitoring Your AutomationPython LoggingAdding Automated MessagingUploading and Other ReportingLogging and Monitoring as a ServiceNo System Is FoolproofSummary
Duties of a Data WranglerBeyond Data WranglingBecome a Better Data AnalystBecome a Better DeveloperBecome a Better Visual StorytellerBecome a Better Systems ArchitectWhere Do You Go from Here?
C, C++, and Java Versus PythonR or MATLAB Versus PythonHTML Versus PythonJavaScript Versus PythonNode.js Versus PythonRuby and Ruby on Rails Versus Python
Online ResourcesIn-Person Groups
BashNavigationModifying FilesExecuting FilesSearching with the Command LineMore ResourcesWindows CMD/Power ShellNavigationModifying FilesExecuting FilesSearching with the Command LineMore Resources
Step 1: Install GCCStep 2: (Mac Only) Install HomebrewStep 3: (Mac Only) Tell Your System Where to Find HomebrewStep 4: Install Python 2.7Step 5: Install virtualenv (Windows, Mac, Linux)Step 6: Set Up a New DirectoryStep 7: Install virtualenvwrapperInstalling virtualenvwrapper (Mac and Linux)Installing virtualenvwrapper-win (Windows)Testing Your Virtual Environment (Windows, Mac, Linux)Learning About Our New Environment (Windows, Mac, Linux)Advanced Setup Review
Hail the WhitespaceThe Dreaded GIL= Versus == Versus is, and When to Just CopyDefault Function ArgumentsPython Scope and Built-Ins: The Importance of Variable NamesDefining Objects Versus Modifying ObjectsChanging Immutable ObjectsType CheckingCatching Multiple ExceptionsThe Power of Debugging
Why Use IPython?Getting Started with IPythonMagic FunctionsFinal Thoughts: A Simpler Terminal
Spinning Up an AWS ServerAWS Step 1: Choose an Amazon Machine Image (AMI)AWS Step 2: Choose an Instance TypeAWS Step 7: Review Instance LaunchAWS Extra Question: Select an Existing Key Pair or Create a New OneLogging into an AWS ServerGet the Public DNS Name of the InstancePrepare Your Private KeyLog into Your ServerSummary

Content preview from Data Wrangling with Python

Chapter 12. Advanced Web Scraping: Screen Scrapers and Spiders

You’ve begun your web scraping skills development, learning how to decipher what, how, and where to scrape in Chapter 11. In this chapter, we’ll take a look at more advanced scrapers, like browser-based scrapers and spiders to gather content.

We’ll also learn about debugging common problems with advanced web scraping and cover some of the ethical questions presented when scraping the Web. To begin, we’ll investigate browser-based web scraping: using a browser directly with Python to scrape content from the Web.

Browser-Based Parsing

Sometimes a site uses a lot of JavaScript or other post-page-load code to populate the pages with content. In these cases, it’s almost impossible to use a normal web scraper to analyze the site. What you’ll end up with is a very empty-looking page. You’ll have the same problem if you want to interact with pages (i.e., if you need to click on a button or enter some search text). In either situation, you’ll want to figure out how to screen read the page. Screen readers work by using a browser, opening the page, and reading and interacting with the page after it loads in the browser.

Tip

Screen readers are great for tasks performed by walking through a series of actions to get information. For this very reason, screen reader scripts are also an easy way to automate routine web tasks.

The most commonly used screen reading library in Python is Selenium. Selenium is a Java program used to open ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources

ISBN: 9781491948804Errata Page Supplemental Content

Data Wrangling with Python

by Jacqueline Kazil, Katharine Jarmul

Chapter 12. Advanced Web Scraping: Screen Scrapers and Spiders

Browser-Based Parsing

Tip

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like