book

Practical Data Analysis Cookbook

Name: Practical Data Analysis Cookbook
Author: Tomasz Drabas
ISBN: 9781783551668

by Tomasz Drabas

April 2016

Beginner to intermediate

384 pages

8h 36m

English

Packt Publishing

Read now

Unlock full access

Practical Data Analysis Cookbook
Table of Contents
Practical Data Analysis Cookbook
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy Subscribe?Free Access for Packt account holders
Preface
What this book covers
What you need for this book

Who this book is for
Sections
Getting readyHow to do it…How it works…There's more…See also
Conventions
Reader feedback
Customer support
Downloading the example codeDownloading the color images of this bookErrataPiracyQuestions
1. Preparing the Data
Introduction
Reading and writing CSV/TSV files with Python
Getting readyHow to do it…How it works…There's more…See also
Reading and writing JSON files with Python
Getting readyHow to do it…How it works…There's more…See also
Reading and writing Excel files with Python
Getting readyHow to do it…How it works…There's more…See also
Reading and writing XML files with Python
Getting readyHow to do it…How it works…
Retrieving HTML pages with pandas
Getting readyHow to do it…How it works…
Storing and retrieving from a relational database
Getting readyHow to do it…How it works…There's more…See also
Storing and retrieving from MongoDB
Getting readyHow to do it…How it works…See also
Opening and transforming data with OpenRefine
Getting readyHow to do it…See also
Exploring the data with Open Refine
Getting readyHow to do it…
Removing duplicates
Getting readyHow to do it…
Using regular expressions and GREL to clean up data
Getting readyHow to do it…See also
Imputing missing observations
Getting readyHow to do it…How it works…There's more…
Normalizing and standardizing the features
Getting readyHow to do it…How it works…
Binning the observations
Getting readyHow to do it…How it works…There's more…
Encoding categorical variables
Getting readyHow to do it…How it works…
2. Exploring the Data
Introduction
Producing descriptive statistics
Getting readyHow to do it…How it works…There's more…See also…
Exploring correlations between features
Getting readyHow to do it…How it works…See also…
Visualizing the interactions between features
Getting readyHow to do it…How it works…See also…
Producing histograms
Getting readyHow to do it…How it works…There's more…See also…
Creating multivariate charts
Getting readyHow to do it…How it works…See also…
Sampling the data
Getting readyHow to do it…How it works…There's more…
Splitting the dataset into training, cross-validation, and testing
Getting readyHow to do it…How it works…There's more…
3. Classification Techniques
Introduction
Testing and comparing the models
Getting readyHow to do it…How it works…There's more…See also
Classifying with Naïve Bayes
Getting readyHow to do it…How it works…See also
Using logistic regression as a universal classifier
Getting readyHow to do it…How it works…There's more…See also
Utilizing Support Vector Machines as a classification engine
Getting readyHow to do it…How it works…There's more…
Classifying calls with decision trees
Getting readyHow to do it…How it works…There's more…
Predicting subscribers with random tree forests
Getting readyHow to do it…How it works…There's more…
Employing neural networks to classify calls
Getting readyHow to do it…How it works…There's more…See also
4. Clustering Techniques
Introduction
Assessing the performance of a clustering method
Getting readyHow to do it…How it works…See also…
Clustering data with k-means algorithm
Getting readyHow to do it…How it works…There's more…See also…
Finding an optimal number of clusters for k-means
Getting readyHow to do it…How it works…There's more…
Discovering clusters with mean shift clustering model
Getting readyHow to do it…How it works…See also…
Building fuzzy clustering model with c-means
Getting readyHow to do it…How it works…
Using hierarchical model to cluster your data
Getting readyHow to do it…How it works…There's more…See also…
Finding groups of potential subscribers with DBSCAN and BIRCH algorithms
Getting readyHow to do it…How it works…See also…
5. Reducing Dimensions
Introduction
Creating three-dimensional scatter plots to present principal components
Getting readyHow to do it…How it works…
Reducing the dimensions using the kernel version of PCA
Getting readyHow to do it…How it works…There's more…See also
Using Principal Component Analysis to find things that matter
Getting readyHow to do it…How it works…There's more…See also
Finding the principal components in your data using randomized PCA
Getting readyHow to do it…How it works…There's more…
Extracting the useful dimensions using Linear Discriminant Analysis
Getting readyHow to do it…How it works…
Using various dimension reduction techniques to classify calls using the k-Nearest Neighbors classification model
Getting readyHow to do it…How it works…
6. Regression Methods
Introduction
Identifying and tackling multicollinearity
Getting readyHow to do it…How it works…There's more…
Building Linear Regression model
Getting readyHow to do it…How it works…There's more…
Using OLS to forecast how much electricity can be produced
Getting readyHow to do it…How it works…There's more…See also
Estimating the output of an electric plant using CART
Getting readyHow to do it…How it works…There's more…See also
Employing the kNN model in a regression problem
Getting readyHow to do it…How it works…
Applying the Random Forest model to a regression analysis
Getting readyHow to do it…How it works…
Gauging the amount of electricity a plant can produce using SVMs
Getting readyHow to do it…How it works…There's more…See also
Training a Neural Network to predict the output of a power plant
Getting readyHow to do it…How it works…See also
7. Time Series Techniques
Introduction
Handling date objects in Python
Getting readyHow to do it…How it works…There's more…
Understanding time series data
Getting readyHow to do it…How it works…There's more…
Smoothing and transforming the observations
Getting readyHow to do it…How it works…There's more…
Filtering the time series data
Getting readyHow to do it…How it works…There's more…
Removing trend and seasonality
Getting readyHow to do it…How it works…There's more…
Forecasting the future with ARMA and ARIMA models
Getting readyHow to do it…How it works…See also
8. Graphs
Introduction
Handling graph objects in Python with NetworkX
Getting readyHow to do it…How it works…There's more…See also
Using Gephi to visualize graphs
Getting readyHow to do it…There's more…See also
Identifying people whose credit card details were stolen
Getting readyHow to do it…How it works…There's more…
Identifying those responsible for stealing the credit cards
Getting readyHow to do it…How it works…See also
9. Natural Language Processing
Introduction
Reading raw text from the Web
Getting readyHow to do it…How it works…
Tokenizing and normalizing text
Getting readyHow to do it…How it works…See also
Identifying parts of speech, handling n-grams, and recognizing named entities
Getting readyHow to do it…How it works…There's more…
Identifying the topic of an article
Getting readyHow to do it…How it works…
Identifying the sentence structure
Getting readyHow to do it…How it works…See also
Classifying movies based on their reviews
Getting readyHow to do it…How it works…
10. Discrete Choice Models
Introduction
Preparing a dataset to estimate discrete choice models
Getting readyHow to do it…How it works…There's more…
Estimating the well-known Multinomial Logit model
Getting readyHow to do it…How it works…See also
Testing for violations of the Independence from Irrelevant Alternatives
Getting readyHow to do it…How it works…There's more…
Handling IIA violations with the Nested Logit model
Getting readyHow to do it…How it works…
Managing sophisticated substitution patterns with the Mixed Logit model
Getting readyHow to do it…How it works…
11. Simulations
Introduction
Using SimPy to simulate the refueling process of a gas station
Getting readyHow to do it…How it works…There's more…
Simulating out-of-energy occurrences for an electric car
Getting readyHow to do it…How it works…
Determining if a population of sheep is in danger of extinction due to a wolf pack
Getting readyHow to do it…How it works…
Index

Content preview from Practical Data Analysis Cookbook

Clustering data with k-means algorithm

The k-means clustering algorithm is likely the most widely known data mining technique for clustering vectorized data. It aims at partitioning the observations into discrete clusters based on the similarity between them; the deciding factor is the Euclidean distance between the observation and centroid of the nearest cluster.

Getting ready

To run this recipe, you need pandas and Scikit. No other prerequisites are required.

How to do it…

Scikit offers several clustering models in its cluster submodule. Here, we will use .KMeans(...) to estimate our clustering model (the clustering_kmeans.py file):

def findClusters_kmeans(data): ''' Cluster data using k-means ''' # create the classifier object kmeans = cl.KMeans( ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781783551668

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design