book

Python: Data Analytics and Visualization

by Phuong Vo.T.H, Martin Czygan, Ashish Kumar, Kirthi Raman

March 2017

Beginner to intermediate

866 pages

18h 4m

English

Packt Publishing

Read now

Unlock full access

What this learning path covers
Downloading the example codeErrataPiracyQuestions

Data analysis and processing
NumPyPandasMatplotlibPyMongoThe scikit-learn library
NumPy arraysData typesArray creationIndexing and slicingFancy indexingNumerical operations on arrays
Loading and saving dataSaving an arrayLoading an array
An overview of the Pandas package
SeriesThe DataFrame
Reindexing and altering labelsHead and tailBinary operationsFunctional statisticsFunction applicationSorting
Hierarchical indexingThe Panel data
The matplotlib API primerLine propertiesFigures and subplots
Scatter plotsBar plotsContour plotsHistogram plots
BokehMayaVi
Time series primer
Interacting with data in text formatReading data from text formatWriting data to text format
HDF5
The simple valueListSetOrdered set
Data mungingCleaning dataFilteringMerging dataReshaping data
An overview of machine learning models
Introducing predictive modellingScope of predictive modellingEnsemble of statistical algorithmsStatistical toolsHistorical dataMathematical functionBusiness contextKnowledge matrix for predictive modellingTask matrix for predictive modelling
LinkedIn's "People also viewed" featureWhat it does?How is it done?Correct targeting of online adsHow is it done?Santa Cruz predictive policingHow is it done?Determining the activity of a smartphone user using accelerometer dataHow is it done?Sport and fantasy leaguesHow was it done?
AnacondaStandalone PythonInstalling a Python packageInstalling pipInstalling Python packages with pip
Reading the data – variations and examplesData framesDelimiters
Case 1 – reading a dataset using the read_csv method
Passing the directory address and filename as variablesReading a .txt dataset with a comma delimiterSpecifying the column names of a dataset from a list
Reading a dataset line by lineChanging the delimiter of a dataset
Reading from an .xls or .xlsx fileWriting to a CSV or Excel file
Checking for missing valuesWhat constitutes missing data?How missing values are generated and propagatedTreating missing valuesDeletionImputation
Scatter plotsHistogramsBoxplots
Subsetting a datasetSelecting columnsSelecting rowsSelecting a combination of rows and columnsCreating new columns
Various methods for generating random numbersSeeding a random numberGenerating random numbers following probability distributionsProbability density functionCumulative density functionUniform distributionNormal distributionUsing the Monte-Carlo simulation to find the value of piGeometry and mathematics behind the calculation of piGenerating a dummy data frame
AggregationFilteringTransformationMiscellaneous operations
Method 1 – using the Customer Churn ModelMethod 2 – using sklearnMethod 3 – using the shuffle function
Inner JoinLeft JoinRight JoinAn example of the Inner JoinAn example of the Left JoinAn example of the Right JoinSummary of Joins in terms of their length
Random sampling and the central limit theorem
Null versus alternate hypothesisZ-statistic and t-statisticConfidence intervals, significance levels, and p-valuesDifferent kinds of hypothesis testA step-by-step guide to do a hypothesis testAn example of a hypothesis test
Understanding the maths behind linear regressionLinear regression using simulated dataFitting a linear regression model and checking its efficacyFinding the optimum value of variable coefficients
p-valuesF-statisticsResidual Standard Error
Linear regression using the statsmodel libraryMultiple linear regressionMulti-collinearityVariance Inflation Factor
Training and testing data splitSummary of modelsLinear regression with scikit-learnFeature selection with scikit-learn
Handling categorical variablesTransforming a variable to fit non-linear relationsHandling outliersOther considerations and assumptions for linear regression
Linear regression versus logistic regression
Contingency tablesConditional probabilityOdds ratioMoving on to logistic regression from linear regressionEstimation using the Maximum Likelihood MethodLikelihood function:Log likelihood function:Building the logistic regression model from scratchMaking sense of logistic regression parametersWald testLikelihood Ratio Test statisticChi-square test
Processing the dataData explorationData visualizationCreating dummy variables for categorical variablesFeature selectionImplementing the model
Cross validation
The ROC curveConfusion matrix
Introduction to clustering – what, why, and how?What is clustering?How is clustering used?Why do we do clustering?
Distances between two observationsEuclidean distanceManhattan distanceMinkowski distanceThe distance matrixNormalizing the distancesLinkage methodsSingle linkageCompete linkageAverage linkageCentroid linkageWard's methodHierarchical clusteringK-means clustering
Importing and exploring the datasetNormalizing the values in the datasetHierarchical clustering using scikit-learnK-Means clustering using scikit-learnInterpreting the cluster
The elbow methodSilhouette Coefficient
Introducing decision treesA decision tree
HomogeneityEntropyInformation gainID3 algorithm to create a decision treeGini indexReduction in VariancePruning a treeHandling a continuous numerical variableHandling a missing value of an attribute
Visualizing the treeCross-validating and pruning the decision tree
Regression tree algorithmImplementing a regression tree using Python
The random forest algorithmImplementing a random forest using PythonWhy do random forests work?Important parameters for random forests
Best practices for codingCommenting the codesDefining functions for substantial individual tasksExample 1Example 2Example 3Avoid hard-coding of variables as much as possibleVersion controlUsing standard libraries, methods, and formulas
Data, information, knowledge, and insightDataInformationKnowledgeData analysis and insight
Transforming data into informationData collectionData preprocessingData processingOrganizing dataGetting datasetsTransforming information into knowledgeTransforming knowledge into insight
Visualization before computersMinard's Russian campaign (1812)The Cholera epidemics in London (1831-1855)Statistical graphics (1850-1915)Later developments in data visualization
Where does visualization fit in?Data visualization todayWhat is a good visualization?
Bar graphs and pie chartsBar graphsPie chartsBox plotsScatter plots and bubble chartsScatter plotsBubble chartsKDE plots
Why does visualization require planning?
Visually representing the results
Why are stories so important?Reader-driven narrativesGapminderThe State of the Union addressMortality rate in the USAA few other example narrativesAuthor-driven narratives
The Gestalt principles of perception
Comparison and rankingCorrelationDistributionLocation-specific or geodataPart-to-whole relationshipsTrends over time
Development toolsCanopy from EnthoughtAnaconda from Continuum Analytics
Event listenersLayoutsCircular layoutRadial layoutBalloon layout
The IDE tools in PythonPython 3.x versus Python 2.7Types of interactive toolsIPythonPlotlyTypes of Python IDEPyCharmPyDevInteractive Editor for Python (IEP)Canopy from EnthoughtAnaconda from Continuum AnalyticsAn overview of SpyderAn overview of conda
The surface-3D plotThe square map plot
BokehVisPy
NumPy, SciPy, and MKL functionsNumPyNumPy universal functionsShape and reshape manipulationAn example of interpolationVectorizing functionsSummary of NumPy linear algebraSciPyAn example of linear equationsThe vectorized numerical derivativeMKL functionsThe performance of Python
Slice using flat
Numerical indexingLogical indexing
StacksTuplesSetsQueuesDictionariesDictionaries for matrix representationSparse matricesVisualizing sparsenessDictionaries for memoizationTries
Word cloudsInstalling word cloudsInput for word cloudsWeb feedsThe Twitter textPlotting the stock price chartObtaining data
The deterministic modelGross returns
Monte Carlo simulationWhat exactly is Monte Carlo simulation?An inventory problem in Monte Carlo simulationMonte Carlo simulation in basketballThe volatility plotImplied volatilitiesThe portfolio valuationThe simulation modelGeometric Brownian simulationThe diffusion-based simulation
Schelling's Segregation Model
K-nearest neighborsGeneralized linear modelsBayesian linear regression
Classification methods
An example
Installing TextBlobDownloading corporaThe NaÃ¯ve Bayes classifier using TextBlob
Installing scikit-learn
Directed graphs and multigraphsStoring graph dataDisplaying graphsigraphNetworkXGraph-toolPageRank
Computer simulationPython's random packageSciPy's random functionsSimulation examplesSignal processingAnimationVisualization methods using HTML5How is Julia different from Python?D3.js for visualizationDashboards
An overview of conda

Content preview from Python: Data Analytics and Visualization

Model validation and evaluation

The preceding logistic regression model is built on the entire data. Let us now split the data into training and testing sets, build the model using the training set, and then check the accuracy using the testing set. The ultimate goal is to see whether it improves the accuracy of the prediction or not:

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)

The preceding code snippet creates testing and training datasets for a predictor and also outcome variables. Let us now build a logistic regression model over the training set:

from sklearn import linear_model from sklearn import metrics clf1 = linear_model.LogisticRegression() ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Practical Python Data Visualization: A Fast Track Approach To Learning Data Visualization With Python

Ashwin Pajankar

Python: End-to-end Data Analysis

Phuong Vothihong, Martin Czygan, Ivan Idris, Magnus Vilhelm Persson, Luiz Felipe Martins

Python for Geospatial Data Analysis

Bonny P. McClain

Data Visualization with Python and JavaScript, 2nd Edition

Kyran Dale

Publisher Resources

ISBN: 9781788290098Supplemental Content Purchase Link