book

Data Analysis with Python

Name: Data Analysis with Python
Author: David Taieb
ISBN: 9781789950069

by David Taieb

December 2018

Beginner to intermediate

490 pages

10h 38m

English

Packt Publishing

Read now

Unlock full access

Data Analysis with Python
Table of Contents
Data Analysis with Python
Why subscribe?
PacktPub.com
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Why am I writing this book?
Who this book is for
What this book covers

To get the most out of this book
Download the example code filesDownload the color imagesConventions used
Get in touch
Reviews
1. Programming and Data Science – A New Toolset
What is data science
Is data science here to stay?
Why is data science on the rise?
What does that have to do with developers?
Putting these concepts into practice
Deep diving into a concrete example
Data pipeline blueprint
What kind of skills are required to become a data scientist?
IBM Watson DeepQA
Back to our sentiment analysis of Twitter hashtags project
Lessons learned from building our first enterprise-ready data pipeline
Data science strategy
Jupyter Notebooks at the center of our strategy
Why are Notebooks so popular?
Summary
2. Python and Jupyter Notebooks to Power your Data Analysis
Why choose Python?
Introducing PixieDust
SampleData – a simple API for loading data
Wrangling data with pixiedust_rosie
Display – a simple interactive API for data visualization
Filtering
Bridging the gap between developers and data scientists with PixieApps
Architecture for operationalizing data science analytics
Summary
3. Accelerate your Data Analysis with Python Libraries
Anatomy of a PixieAppRoutesGenerating requests to routesA GitHub project tracking sample applicationDisplaying the search results in a tableInvoking the PixieDust display() API using pd_entity attributeInvoking arbitrary Python code with pd_scriptMaking the application more responsive with pd_refreshCreating reusable widgets
Summary
4. Publish your Data Analysis to the Web - the PixieApp Tool
Overview of Kubernetes
Installing and configuring the PixieGateway server
PixieGateway server configurationPixieGateway architecturePublishing an applicationEncoding state in the PixieApp URLSharing charts by publishing them as web pagesPixieGateway admin consolePython ConsoleDisplaying warmup and run code for a PixieApp
Summary
5. Python and PixieDust Best Practices and Advanced Concepts
Use @captureOutput decorator to integrate the output of third-party Python librariesCreate a word cloud image with @captureOutput
Increase modularity and code reuse
Creating a widget with pd_widgetPixieDust support of streaming dataAdding streaming capabilities to your PixieAppAdding dashboard drill-downs with PixieApp eventsExtending PixieDust visualizationsDebuggingDebugging on the Jupyter Notebook using pdbVisual debugging with PixieDebuggerDebugging PixieApp routes with PixieDebuggerTroubleshooting issues using PixieDust loggingClient-side debugging
Run Node.js inside a Python Notebook
Summary
6. Analytics Study: AI and Image Recognition with TensorFlow
What is machine learning?
What is deep learning?
Getting started with TensorFlow
Simple classification with DNNClassifier
Image recognition sample application
Part 1 – Load the pretrained MobileNet modelPart 2 – Create a PixieApp for our image recognition sample applicationPart 3 – Integrate the TensorBoard graph visualizationPart 4 – Retrain the model with custom training data
Summary
7. Analytics Study: NLP and Big Data with Twitter Sentiment Analysis
Getting started with Apache SparkApache Spark architectureConfiguring Notebooks to work with Spark
Twitter sentiment analysis application
Part 1 – Acquiring the data with Spark Structured Streaming
Architecture diagram for the data pipelineAuthentication with TwitterCreating the Twitter streamCreating a Spark Streaming DataFrameCreating and running a structured queryMonitoring active streaming queriesCreating a batch DataFrame from the Parquet files
Part 2 – Enriching the data with sentiment and most relevant extracted entity
Getting started with the IBM Watson Natural Language Understanding service
Part 3 – Creating a real-time dashboard PixieApp
Refactoring the analytics into their own methodsCreating the PixieApp
Part 4 – Adding scalability with Apache Kafka and IBM Streams Designer
Streaming the raw tweets to KafkaEnriching the tweets data with the Streaming Analytics serviceCreating a Spark Streaming DataFrame with a Kafka input source
Summary
8. Analytics Study: Prediction - Financial Time Series Analysis and Forecasting
Getting started with NumPyCreating a NumPy arrayOperations on ndarraySelections on NumPy arraysBroadcasting
Statistical exploration of time series
Hypothetical investmentAutocorrelation function (ACF) and partial autocorrelation function (PACF)
Putting it all together with the StockExplorer PixieApp
BaseSubApp – base class for all the child PixieAppsStockExploreSubApp – first child PixieAppMovingAverageSubApp – second child PixieAppAutoCorrelationSubApp – third child PixieApp
Time series forecasting using the ARIMA model
Build an ARIMA model for the MSFT stock time seriesStockExplorer PixieApp Part 2 – add time series forecasting using the ARIMA model
Summary
9. Analytics Study: Graph Algorithms - US Domestic Flight Data Analysis
Introduction to graphsGraph representationsGraph algorithmsGraph and big data
Getting started with the networkx graph library
Creating a graphVisualizing a graph
Part 1 – Loading the US domestic flight data into a graph
Graph centrality
Part 2 – Creating the USFlightsAnalysis PixieApp
Part 3 – Adding data exploration to the USFlightsAnalysis PixieApp
Part 4 – Creating an ARIMA model for predicting flight delays
Summary
10. The Future of Data Analysis and Where to Develop your Skills
Forward thinking – what to expect for AI and data science
References
A. PixieApp Quick-Reference
Annotations
Custom HTML attributes
Methods
Other Books You May Enjoy
Leave a review – let other readers know what you think
Index

Content preview from Data Analysis with Python

Summary

In this chapter, we've built a data pipeline that analyzes large quantities of streaming data containing unstructured text and applies NLP algorithms coming from external cloud services to extract sentiment and other important entities found in the text. We also built a PixieApp dashboard that displays live metrics with insights extracted from the tweets. We've also discussed various techniques for analyzing data at scale, including Apache Spark Structured Streaming, Apache Kafka, and IBM Streaming Analytics. As always, the goal of these sample applications is to show the art of the possible in building data pipelines with a special focus on leveraging existing frameworks, libraries, and cloud services.

In the next chapter, we'll discuss ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781789950069

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Analysis with Python

by David Taieb

Summary

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.