book

Feature Engineering for Machine Learning

by Alice Zheng, Amanda Casari

April 2018

Beginner to intermediate

215 pages

5h 36m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
IntroductionConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgmentsSpecial Thanks from AliceSpecial Thanks from Amanda
1. The Machine Learning Pipeline
DataTasksModelsFeaturesModel Evaluation
2. Fancy Tricks with Simple Numbers
Scalars, Vectors, and SpacesDealing with CountsBinarizationQuantization or BinningLog TransformationLog Transform in ActionPower Transforms: Generalization of the Log TransformFeature Scaling or NormalizationMin-Max ScalingStandardization (Variance Scaling)ℓ2 NormalizationInteraction FeaturesFeature SelectionSummaryBibliography
3. Text Data: Flattening, Filtering, and Chunking
Bag-of-X: Turning Natural Text into Flat VectorsBag-of-WordsBag-of-n-GramsFiltering for Cleaner FeaturesStopwordsFrequency-Based FilteringStemmingAtoms of Meaning: From Words to n-Grams to PhrasesParsing and TokenizationCollocation Extraction for Phrase DetectionSummaryBibliography
4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf
Tf-Idf : A Simple Twist on Bag-of-WordsPutting It to the TestCreating a Classification DatasetScaling Bag-of-Words with Tf-Idf TransformationClassification with Logistic RegressionTuning Logistic Regression with RegularizationDeep Dive: What Is Happening?SummaryBibliography
5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens
Encoding Categorical VariablesOne-Hot EncodingDummy CodingEffect CodingPros and Cons of Categorical Variable EncodingsDealing with Large Categorical VariablesFeature HashingBin CountingSummaryBibliography
6. Dimensionality Reduction: Squashing the Data Pancake with PCA
IntuitionDerivationLinear ProjectionVariance and Empirical VariancePrincipal Components: First FormulationPrincipal Components: Matrix-Vector FormulationGeneral Solution of the Principal ComponentsTransforming FeaturesImplementing PCAPCA in ActionWhitening and ZCAConsiderations and Limitations of PCAUse CasesSummaryBibliography
7. Nonlinear Featurization via K-Means Model Stacking
k-Means ClusteringClustering as Surface Tilingk-Means Featurization for ClassificationAlternative Dense FeaturizationPros, Cons, and GotchasSummaryBibliography
8. Automating the Featurizer: Image Feature Extraction and Deep Learning
The Simplest Image Features (and Why They Don’t Work)Manual Feature Extraction: SIFT and HOGImage GradientsGradient Orientation HistogramsSIFT ArchitectureLearning Image Features with Deep Neural NetworksFully Connected LayersConvolutional LayersRectified Linear Unit (ReLU) TransformationResponse Normalization LayersPooling LayersStructure of AlexNetSummaryBibliography
9. Back to the Feature: Building an Academic Paper Recommender
Item-Based Collaborative FilteringFirst Pass: Data Import, Cleaning, and Feature ParsingAcademic Paper Recommender: Naive ApproachSecond Pass: More Engineering and a Smarter ModelAcademic Paper Recommender: Take 2Third Pass: More Features = More InformationAcademic Paper Recommender: Take 3SummaryBibliography

A. Linear Modeling and Linear Algebra Basics
Overview of Linear ClassificationThe Anatomy of a MatrixFrom Vectors to SubspacesSingular Value Decomposition (SVD)The Four Fundamental Subspaces of the Data MatrixSolving a Linear SystemBibliography
Index

Content preview from Feature Engineering for Machine Learning

Preface

Introduction

Machine learning fits mathematical models to data in order to derive insights or make predictions. These models take features as input. A feature is a numeric representation of an aspect of raw data. Features sit between data and models in the machine learning pipeline. Feature engineering is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model. It is a crucial step in the machine learning pipeline, because the right features can ease the difficulty of modeling, and therefore enable the pipeline to output results of higher quality. Practitioners agree that the vast majority of time in building a machine learning pipeline is spent on feature engineering and data cleaning. Yet, despite its importance, the topic is rarely discussed on its own. Perhaps this is because the right features can only be defined in the context of both the model and the data; since data and models are so diverse, it’s difficult to generalize the practice of feature engineering across projects.

Nevertheless, feature engineering is not just an ad hoc practice. There are deeper principles at work, and they are best illustrated in situ. Each chapter of this book addresses one data problem: how to represent text data or image data, how to reduce the dimensionality of autogenerated features, when and how to normalize, etc. Think of this as a collection of interconnected short stories, as opposed to a single long novel. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491953235Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Feature Engineering for Machine Learning

by Alice Zheng, Amanda Casari

Preface

Introduction

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.