book

Elegant SciPy

by Juan Nunez-Iglesias, Stéfan van der Walt, Harriet Dashnow

August 2017

Intermediate to advanced

280 pages

6h 19m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Who Is This Book For?Why SciPy?What Is the SciPy Ecosystem?The Great Cataclysm: Python 2 Versus Python 3SciPy Ecosystem and CommunityFree and Open Source Software (FOSS)GitHub: Taking Coding SocialMake Your Mark on the SciPy EcosystemA Touch of Whimsy with Your PyGetting HelpInstalling PythonAccessing the Book MaterialsDiving InConventions Used in This BookUse of ColorUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
1. Elegant NumPy: The Foundation of Scientific Python
Introduction to the Data: What Is Gene Expression?NumPy N-Dimensional ArraysWhy Use ndarrays Instead of Python Lists?VectorizationBroadcastingExploring a Gene Expression DatasetReading in the Data with pandasNormalizationBetween SamplesBetween GenesNormalizing Over Samples and Genes: RPKMTaking Stock
2. Quantile Normalization with NumPy and SciPy
Getting the DataGene Expression Distribution Differences Between IndividualsBiclustering the Counts DataVisualizing ClustersPredicting SurvivalFurther Work: Using the TCGA’s Patient ClustersFurther Work: Reproducing the TCGA’s clusters
3. Networks of Image Regions with ndimage
Images Are Just NumPy ArraysExercise: Adding a Grid OverlayFilters in Signal ProcessingFiltering Images (2D Filters)Generic Filters: Arbitrary Functions of Neighborhood ValuesExercise: Conway’s Game of LifeExercise: Sobel Gradient MagnitudeGraphs and the NetworkX libraryExercise: Curve Fitting with SciPyRegion Adjacency GraphsElegant ndimage: How to Build Graphs from Image RegionsPutting It All Together: Mean Color Segmentation
4. Frequency and the Fast Fourier Transform
Introducing FrequencyIllustration: A Birdsong SpectrogramHistoryImplementationChoosing the Length of the DFTMore DFT ConceptsFrequencies and Their OrderingWindowingReal-World Application: Analyzing Radar DataSignal Properties in the Frequency DomainWindowing, AppliedRadar ImagesFurther Applications of the FFTFurther ReadingExercise: Image Convolution
5. Contingency Tables Using Sparse Coordinate Matrices
Contingency TablesExercise: Computational Complexity of Confusion MatricesExercise: Alternative Algorithm to Compute the Confusion MatrixExercise: Multiclass Confusion Matrixscipy.sparse Data FormatsCOO FormatExercise: COO RepresentationCompressed Sparse Row FormatApplications of Sparse Matrices: Image TransformationsExercise: Image RotationBack to Contingency TablesExercise: Reducing the Memory FootprintContingency Tables in SegmentationInformation Theory in BriefExercise: Computing Conditional EntropyInformation Theory in Segmentation: Variation of InformationConverting NumPy Array Code to Use Sparse MatricesUsing Variation of InformationFurther Work: Segmentation in Practice
6. Linear Algebra in SciPy
Linear Algebra BasicsLaplacian Matrix of a GraphExercise: Rotation MatrixLaplacians with Brain DataExercise: Showing the Affinity ViewExercise Challenge: Linear Algebra with Sparse MatricesPageRank: Linear Algebra for Reputation and ImportanceExercise: Dealing with Dangling NodesExercise: Equivalence of Different Eigenvector MethodsConcluding Remarks
7. Function Optimization in SciPy
Optimization in SciPy: scipy.optimizeAn Example: Computing Optimal Image ShiftImage Registration with OptimizeAvoiding Local Minima with Basin HoppingExercise: Modify the align Function“What Is Best?”: Choosing the Right Objective Function
8. Big Data in Little Laptop with Toolz
Streaming with yieldIntroducing the Toolz Streaming Libraryk-mer Counting and Error CorrectionCurrying: The Spice of StreamingBack to Counting k-mersExercise: PCA of Streaming DataMarkov Model from a Full GenomeExercise: Online Unzip
Epilogue
Where to Next?Mailing ListsGitHubConferencesBeyond SciPyContributing to This BookUntil Next Time...

Appendix. Exercise Solutions
Solution: Adding a Grid OverlaySolution: Conway’s Game of LifeSolution: Sobel Gradient MagnitudeSolution: Curve Fitting with SciPySolution: Image ConvolutionSolution: Computational Complexity of Confusion MatricesSolution: Alternative Confusion Matrix ComputingSolution: Computing the Confusion MatrixSolution: COO RepresentationSolution: Image RotationSolution: Reducing the Memory FootprintSolution: Computing Conditional EntropySolution: Rotation MatrixSolution: Showing the Affinity ViewChallenge Accepted: Linear Algebra with Sparse MatricesSolution: Dealing with Dangling NodesSolution: Verify MethodsSolution: Modify the align FunctionSolution: scikit-learn LibrarySolution: Add a Step to the Start of the Pipe
Index

Content preview from Elegant SciPy

Chapter 8. Big Data in Little Laptop with Toolz

GRACIE: A knife? The guy’s twelve feet tall! JACK: Seven. Hey, don’t worry, I think I can handle him.

Jack Burton, Big Trouble in Little China

Streaming is not a SciPy feature per se, but rather an approach that allows us to efficiently process large datasets, like those often seen in science. The Python language contains some useful primitives for streaming data processing, and these can be combined with Matt Rocklin’s Toolz library to generate elegant, concise code that is extremely memory-efficient. In this chapter, we will show you how to apply these streaming concepts to enable you to handle much larger datasets than can fit in your computer’s RAM.

You have probably already done some streaming, perhaps without thinking about it in these terms. The simplest form is probably iterating through lines in a files, processing each line without ever reading the entire file into memory. For example, a loop like this to calculate the mean of each row and sum them:

import numpy as np
with open('data/expr.tsv') as f:
    sum_of_means = 0
    for line in f:
        sum_of_means += np.mean(np.fromstring(line, dtype=int, sep='\t'))
print(sum_of_means)

1463.0

This strategy works really well for cases where your problem can be neatly solved with by-row processing. But things can quickly get out of hand when your code becomes more sophisticated.

In streaming programs, a function processes some of the input data, returns the processed chunk, then, while ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491922927Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Elegant SciPy

by Juan Nunez-Iglesias, Stéfan van der Walt, Harriet Dashnow

Chapter 8. Big Data in Little Laptop with Toolz

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.