O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Feature Engineering Made Easy

Book Description

A perfect guide to speed up the predicting power of machine learning algorithms

About This Book

  • Design, discover, and create dynamic, efficient features for your machine learning application
  • Understand your data in-depth and derive astonishing data insights with the help of this Guide
  • Grasp powerful feature-engineering techniques and build machine learning systems

Who This Book Is For

If you are a data science professional or a machine learning engineer looking to strengthen your predictive analytics model, then this book is a perfect guide for you. Some basic understanding of the machine learning concepts and Python scripting would be enough to get started with this book.

What You Will Learn

  • Identify and leverage different feature types
  • Clean features in data to improve predictive power
  • Understand why and how to perform feature selection, and model error analysis
  • Leverage domain knowledge to construct new features
  • Deliver features based on mathematical insights
  • Use machine-learning algorithms to construct features
  • Master feature engineering and optimization
  • Harness feature engineering for real world applications through a structured case study

In Detail

Feature engineering is the most important step in creating powerful machine learning systems. This book will take you through the entire feature-engineering journey to make your machine learning much more systematic and effective.

You will start with understanding your data—often the success of your ML models depends on how you leverage different feature types, such as continuous, categorical, and more, You will learn when to include a feature, when to omit it, and why, all by understanding error analysis and the acceptability of your models. You will learn to convert a problem statement into useful new features. You will learn to deliver features driven by business needs as well as mathematical insights. You'll also learn how to use machine learning on your machines, automatically learning amazing features for your data.

By the end of the book, you will become proficient in Feature Selection, Feature Learning, and Feature Optimization.

Style and approach

This step-by-step guide with use cases, examples, and illustrations will help you master the concepts of feature engineering.

Along with explaining the fundamentals, the book will also introduce you to slightly advanced concepts later on and will help you implement these techniques in the real world.

Downloading the example code for this book You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  2. Introduction to Feature Engineering
    1. Motivating example – AI-powered communications
    2. Why feature engineering matters
    3. What is feature engineering?
      1. Understanding the basics of data and machine learning
        1. Supervised learning
        2. Unsupervised learning
      2. Unsupervised learning example – marketing segments
    4. Evaluation of machine learning algorithms and feature engineering procedures
      1. Example of feature engineering procedures – can anyone really predict the weather?
      2. Steps to evaluate a feature engineering procedure
      3. Evaluating supervised learning algorithms
      4. Evaluating unsupervised learning algorithms
    5. Feature understanding – what’s in my dataset?
    6. Feature improvement – cleaning datasets
    7. Feature selection – say no to bad attributes
    8. Feature construction – can we build it?
    9. Feature transformation – enter math-man
    10. Feature learning – using AI to better our AI
    11. Summary
  3. Feature Understanding – What's in My Dataset?
    1. The structure, or lack thereof, of data
    2. An example of unstructured data – server logs
    3. Quantitative versus qualitative data
      1. Salary ranges by job classification
    4. The four levels of data
      1. The nominal level
        1. Mathematical operations allowed
      2. The ordinal level
        1. Mathematical operations allowed
      3. The interval level
        1. Mathematical operations allowed
        2. Plotting two columns at the interval level
      4. The ratio level
        1. Mathematical operations allowed
    5. Recap of the levels of data
    6. Summary
  4. Feature Improvement - Cleaning Datasets
    1. Identifying missing values in data
      1. The Pima Indian Diabetes Prediction dataset
      2. The exploratory data analysis (EDA)
    2. Dealing with missing values in a dataset
      1. Removing harmful rows of data
      2. Imputing the missing values in data
      3. Imputing values in a machine learning pipeline
        1. Pipelines in machine learning
    3. Standardization and normalization
      1. Z-score standardization
      2. The min-max scaling method
      3. The row normalization method
      4. Putting it all together
    4. Summary
  5. Feature Construction
    1. Examining our dataset
    2. Imputing categorical features
      1. Custom imputers
      2. Custom category imputer
      3. Custom quantitative imputer
    3. Encoding categorical variables
      1. Encoding at the nominal level
      2. Encoding at the ordinal level
      3. Bucketing continuous features into categories
      4. Creating our pipeline
    4. Extending numerical features
      1. Activity recognition from the Single Chest-Mounted Accelerometer dataset
      2. Polynomial features
        1. Parameters
        2. Exploratory data analysis
    5. Text-specific feature construction
      1. Bag of words representation
      2. CountVectorizer
        1. CountVectorizer parameters
      3. The Tf-idf vectorizer
      4. Using text in machine learning pipelines
    6. Summary
  6. Feature Selection
    1. Achieving better performance in feature engineering
      1. A case study – a credit card defaulting dataset
    2. Creating a baseline machine learning pipeline
    3. The types of feature selection
      1. Statistical-based feature selection
        1. Using Pearson correlation to select features
        2. Feature selection using hypothesis testing
          1. Interpreting the p-value
          2. Ranking the p-value
      2. Model-based feature selection
        1. A brief refresher on natural language processing
        2. Using machine learning to select features
          1. Tree-based model feature selection metrics
        3. Linear models and regularization
          1. A brief introduction to regularization
          2. Linear model coefficients as another feature importance metric
    4. Choosing the right feature selection method
    5. Summary
  7. Feature Transformations
    1. Dimension reduction – feature transformations versus feature selection versus feature construction
    2. Principal Component Analysis
      1. How PCA works
      2. PCA with the Iris dataset – manual example
        1. Creating the covariance matrix of the dataset
        2. Calculating the eigenvalues of the covariance matrix
        3. Keeping the top k eigenvalues (sorted by the descending eigenvalues)
        4. Using the kept eigenvectors to transform new data-points
    3. Scikit-learn's PCA
    4. How centering and scaling data affects PCA
    5. A deeper look into the principal components
    6. Linear Discriminant Analysis
      1. How LDA works
        1. Calculating the mean vectors of each class
        2. Calculating within-class and between-class scatter matrices
        3.  Calculating eigenvalues and eigenvectors for SW-1SB 
        4. Keeping the top k eigenvectors by ordering them by descending eigenvalues
        5. Using the top eigenvectors to project onto the new space
      2. How to use LDA in scikit-learn
    7. LDA versus PCA – iris dataset
    8. Summary
  8. Feature Learning
    1. Parametric assumptions of data
      1. Non-parametric fallacy
      2. The algorithms of this chapter
    2. Restricted Boltzmann Machines
      1. Not necessarily dimension reduction
      2. The graph of a Restricted Boltzmann Machine
      3. The restriction of a Boltzmann Machine
      4. Reconstructing the data
      5. MNIST dataset
    3. The BernoulliRBM
      1. Extracting PCA components from MNIST
    4. Extracting RBM components from MNIST
    5. Using RBMs in a machine learning pipeline
      1. Using a linear model on raw pixel values
      2. Using a linear model on extracted PCA components
      3. Using a linear model on extracted RBM components
    6. Learning text features – word vectorizations
      1. Word embeddings
      2. Two approaches to word embeddings - Word2vec and GloVe
      3. Word2Vec - another shallow neural network
      4. The gensim package for creating Word2vec embeddings
      5. Application of word embeddings - information retrieval
    7. Summary
  9. Case Studies
    1. Case study 1 - facial recognition
      1. Applications of facial recognition
      2. The data
      3. Some data exploration
      4. Applied facial recognition
    2. Case study 2 - predicting topics of hotel reviews data
      1. Applications of text clustering
      2. Hotel review data
      3. Exploration of the data
      4. The clustering model
      5. SVD versus PCA components
      6. Latent semantic analysis 
    3. Summary
  10. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think