Book description
What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how they’ve recovered from nasty data problems.
From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it.
Among the many topics covered, you’ll discover how to:
- Test drive your data to see if it’s ready for analysis
- Work spreadsheet data into a usable form
- Handle encoding problems that lurk in text data
- Develop a successful web-scraping effort
- Use NLP tools to reveal the real sentiment of online reviews
- Address cloud computing issues that can impact your analysis effort
- Avoid policies that create data analysis roadblocks
- Take a systematic approach to data quality analysis
Table of contents
- Bad Data Handbook
- SPECIAL OFFER: Upgrade this ebook with OâReilly
- About the Authors
- Preface
- 1. Setting the Pace: What Is Bad Data?
- 2. Is It Just Me, or Does This Data Smell Funny?
- 3. Data Intended for Human Consumption, Not Machine Consumption
- 4. Bad Data Lurking in Plain Text
- 5. (Re)Organizing the Webâs Data
- 6. Detecting Liars and the Confused in Contradictory Online Reviews
- 7. Will the Bad Data Please Stand Up?
- 8. Blood, Sweat, and Urine
- 9. When Data and Reality Donât Match
- 10. Subtle Sources of Bias and Error
- 11. Donât Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
- 12. When Databases Attack: A Guide for When to Stick to Files
- 13. Crouching Table, Hidden Network
-
14. Myths of Cloud Computing
- Introduction to the Cloud
- What Is âThe Cloudâ?
- The Cloud and Big Data
- Introducing Fred
- At First Everything Is Great
- They Put 100% of Their Infrastructure in the Cloud
- As Things Grow, They Scale Easily at First
- Then Things Start Having Trouble
- They Need to Improve Performance
- Higher IO Becomes Critical
- A Major Regional Outage Causes Massive Downtime
- Higher IO Comes with a Cost
- Data Sizes Increase
- Geo Redundancy Becomes a Priority
- Horizontal Scale Isnât as Easy as They Hoped
- Costs Increase Dramatically
- Fredâs Follies
- Myth 1: Cloud Is a Great Solution for All Infrastructure Components
- Myth 2: Cloud Will Save Us Money
- Myth 3: Cloud IO Performance Can Be Improved to Acceptable Levels Through Software RAID
- Myth 4: Cloud Computing Makes Horizontal Scaling Easy
- Conclusion and Recommendations
- 15. The Dark Side of Data Science
- 16. How to Feed and Care for Your Machine-Learning Experts
- 17. Data Traceability
- 18. Social Media: Erasable Ink?
- 19. Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough
- Index
- About the Author
- Colophon
- SPECIAL OFFER: Upgrade this ebook with OâReilly
- Copyright
Product information
- Title: Bad Data Handbook
- Author(s):
- Release date: November 2012
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781449324971
You might also like
book
Robust Python
Does it seem like your Python projects are getting bigger and bigger? Are you feeling the …
book
Designing Data-Intensive Applications
Data is at the center of many challenges in system design today. Difficult issues need to …
book
Generative Deep Learning, 2nd Edition
Generative AI is the hottest topic in tech. This practical book teaches machine learning engineers and …
book
Practical Time Series Analysis
Time series data analysis is increasingly important due to the massive production of such data through …