book

Feature Engineering for Machine Learning

by Alice Zheng, Amanda Casari

April 2018

Beginner to intermediate

215 pages

5h 36m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
IntroductionConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgmentsSpecial Thanks from AliceSpecial Thanks from Amanda
1. The Machine Learning Pipeline
DataTasksModelsFeaturesModel Evaluation
2. Fancy Tricks with Simple Numbers
Scalars, Vectors, and SpacesDealing with CountsBinarizationQuantization or BinningLog TransformationLog Transform in ActionPower Transforms: Generalization of the Log TransformFeature Scaling or NormalizationMin-Max ScalingStandardization (Variance Scaling)ℓ2 NormalizationInteraction FeaturesFeature SelectionSummaryBibliography
3. Text Data: Flattening, Filtering, and Chunking
Bag-of-X: Turning Natural Text into Flat VectorsBag-of-WordsBag-of-n-GramsFiltering for Cleaner FeaturesStopwordsFrequency-Based FilteringStemmingAtoms of Meaning: From Words to n-Grams to PhrasesParsing and TokenizationCollocation Extraction for Phrase DetectionSummaryBibliography
4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf
Tf-Idf : A Simple Twist on Bag-of-WordsPutting It to the TestCreating a Classification DatasetScaling Bag-of-Words with Tf-Idf TransformationClassification with Logistic RegressionTuning Logistic Regression with RegularizationDeep Dive: What Is Happening?SummaryBibliography
5. Categorical Variables: Counting Eggs in the Age of Robotic Chickens
Encoding Categorical VariablesOne-Hot EncodingDummy CodingEffect CodingPros and Cons of Categorical Variable EncodingsDealing with Large Categorical VariablesFeature HashingBin CountingSummaryBibliography
6. Dimensionality Reduction: Squashing the Data Pancake with PCA
IntuitionDerivationLinear ProjectionVariance and Empirical VariancePrincipal Components: First FormulationPrincipal Components: Matrix-Vector FormulationGeneral Solution of the Principal ComponentsTransforming FeaturesImplementing PCAPCA in ActionWhitening and ZCAConsiderations and Limitations of PCAUse CasesSummaryBibliography
7. Nonlinear Featurization via K-Means Model Stacking
k-Means ClusteringClustering as Surface Tilingk-Means Featurization for ClassificationAlternative Dense FeaturizationPros, Cons, and GotchasSummaryBibliography
8. Automating the Featurizer: Image Feature Extraction and Deep Learning
The Simplest Image Features (and Why They Don’t Work)Manual Feature Extraction: SIFT and HOGImage GradientsGradient Orientation HistogramsSIFT ArchitectureLearning Image Features with Deep Neural NetworksFully Connected LayersConvolutional LayersRectified Linear Unit (ReLU) TransformationResponse Normalization LayersPooling LayersStructure of AlexNetSummaryBibliography
9. Back to the Feature: Building an Academic Paper Recommender
Item-Based Collaborative FilteringFirst Pass: Data Import, Cleaning, and Feature ParsingAcademic Paper Recommender: Naive ApproachSecond Pass: More Engineering and a Smarter ModelAcademic Paper Recommender: Take 2Third Pass: More Features = More InformationAcademic Paper Recommender: Take 3SummaryBibliography

A. Linear Modeling and Linear Algebra Basics
Overview of Linear ClassificationThe Anatomy of a MatrixFrom Vectors to SubspacesSingular Value Decomposition (SVD)The Four Fundamental Subspaces of the Data MatrixSolving a Linear SystemBibliography
Index

Content preview from Feature Engineering for Machine Learning

Chapter 4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf

A bag-of-words representation is simple to generate but far from perfect. If we count all words equally, then some words end up being emphasized more than we need. Recall our example of Emma and the raven from Chapter 3. We’d like a document representation that emphasizes the two main characters. The words “Emma” and “raven” both appear three times, but “the” appears a whopping eight times, “and” appears five times, and “it” and “was” both appear four times. The main characters do not stand out by simple frequency count alone. This is problematic.

It would also be nice to pick out words such as “magnificently,” “gleamed,” “intimidated,” “tentatively,” and “reigned,” because they help to set the overall tone of the paragraph. They indicate sentiment, which can be very valuable information to a data scientist. So, ideally, we’d like a representation that highlights meaningful words.

Tf-Idf : A Simple Twist on Bag-of-Words

Tf-idf is a simple twist on the bag-of-words approach. It stands for term frequency–inverse document frequency. Instead of looking at the raw counts of each word in each document in a dataset, tf-idf looks at a normalized count where each word count is divided by the number of documents this word appears in. That is:

bow(w, d) = # times word w appears in document d

tf-idf(w, d) = bow(w, d) * N / (# documents in which word w appears)

N is the total number of documents in the dataset. The ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781491953235Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Feature Engineering for Machine Learning

by Alice Zheng, Amanda Casari

Chapter 4. The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf

Tf-Idf : A Simple Twist on Bag-of-Words

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.