book

Feature Engineering for Machine Learning

by Alice Zheng, Amanda Casari

April 2018

Beginner to intermediate

215 pages

5h 36m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
IntroductionConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgmentsSpecial Thanks from AliceSpecial Thanks from Amanda
1. The Machine Learning Pipeline
DataTasksModelsFeaturesModel Evaluation
2. Fancy Tricks with Simple Numbers
Scalars, Vectors, and SpacesDealing with CountsBinarizationQuantization or BinningLog TransformationLog Transform in ActionPower Transforms: Generalization of the Log TransformFeature Scaling or NormalizationMin-Max ScalingStandardization (Variance Scaling)ℓ2 NormalizationInteraction FeaturesFeature SelectionSummaryBibliography
3. Text Data: Flattening, Filtering, and Chunking
Bag-of-X: Turning Natural Text into Flat VectorsBag-of-WordsBag-of-n-GramsFiltering for Cleaner FeaturesStopwordsFrequency-Based FilteringStemmingAtoms of Meaning: From Words to n-Grams to PhrasesParsing and TokenizationCollocation Extraction for Phrase DetectionSummaryBibliography
4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf
Tf-Idf : A Simple Twist on Bag-of-WordsPutting It to the TestCreating a Classification DatasetScaling Bag-of-Words with Tf-Idf TransformationClassification with Logistic RegressionTuning Logistic Regression with RegularizationDeep Dive: What Is Happening?SummaryBibliography
5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens
Encoding Categorical VariablesOne-Hot EncodingDummy CodingEffect CodingPros and Cons of Categorical Variable EncodingsDealing with Large Categorical VariablesFeature HashingBin CountingSummaryBibliography
6. Dimensionality Reduction: Squashing the Data Pancake with PCA
IntuitionDerivationLinear ProjectionVariance and Empirical VariancePrincipal Components: First FormulationPrincipal Components: Matrix-Vector FormulationGeneral Solution of the Principal ComponentsTransforming FeaturesImplementing PCAPCA in ActionWhitening and ZCAConsiderations and Limitations of PCAUse CasesSummaryBibliography
7. Nonlinear Featurization via K-Means Model Stacking
k-Means ClusteringClustering as Surface Tilingk-Means Featurization for ClassificationAlternative Dense FeaturizationPros, Cons, and GotchasSummaryBibliography
8. Automating the Featurizer: Image Feature Extraction and Deep Learning
The Simplest Image Features (and Why They Don’t Work)Manual Feature Extraction: SIFT and HOGImage GradientsGradient Orientation HistogramsSIFT ArchitectureLearning Image Features with Deep Neural NetworksFully Connected LayersConvolutional LayersRectified Linear Unit (ReLU) TransformationResponse Normalization LayersPooling LayersStructure of AlexNetSummaryBibliography
9. Back to the Feature: Building an Academic Paper Recommender
Item-Based Collaborative FilteringFirst Pass: Data Import, Cleaning, and Feature ParsingAcademic Paper Recommender: Naive ApproachSecond Pass: More Engineering and a Smarter ModelAcademic Paper Recommender: Take 2Third Pass: More Features = More InformationAcademic Paper Recommender: Take 3SummaryBibliography

A. Linear Modeling and Linear Algebra Basics
Overview of Linear ClassificationThe Anatomy of a MatrixFrom Vectors to SubspacesSingular Value Decomposition (SVD)The Four Fundamental Subspaces of the Data MatrixSolving a Linear SystemBibliography
Index

Content preview from Feature Engineering for Machine Learning

Chapter 8. Automating the Featurizer: Image Feature Extraction and Deep Learning

Sight and sound are innate sensory inputs for humans. Our brains are hardwired to rapidly evolve our abilities to process visual and auditory signals, with some systems developing to respond to stimulus even before birth (Eliot, 2000). Language skills, on the other hand, are learned. They take months to develop and years to master. Many people take the development of their vision and hearing for granted, but all of us have had to intentionally train our brains to understand and use language.

Interestingly, the situation is the reverse for machine learning. We have made much more headway with text analysis applications than image or audio. Take the problem of search, for example. People have enjoyed years of relative success in information retrieval and text search, whereas image and audio search are still being perfected (though the breakthrough in deep learning models in the last five years may finally herald the long-awaited revolution in image and speech analysis).

The difficulty of progress is directly related to the difficulty of extracting meaningful features from the respective types of data. Machine learning models require semantically meaningful features to make semantically meaningful predictions. In text analysis, particularly for languages such as English where a basic unit of semantic meaning (a word) is easily extractable, progress can be made very fast. Images and audio, on the other ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491953235Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Feature Engineering for Machine Learning

by Alice Zheng, Amanda Casari

Chapter 8. Automating the Featurizer: Image Feature Extraction and Deep Learning

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.