Latent semantic analysis 

Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for text that is a series of these three steps, which we have already learned in this book:

  • A tfidf vectorization
  • A PCA (SVD in this case to account for the sparsity of text)
  • Row normalization 

We can create a scikit-learn pipeline to perform LSA:

tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
svd = TruncatedSVD(n_components=10)  # will extract 10 "topics"
normalizer = Normalizer() # will give each document a unit norm

lsa = Pipeline(steps=[('tfidf', tfidf), ('svd', svd), ('normalizer', normalizer)])

Now, we can fit and transform our sentences data, like so:

lsa_sentences = lsa.fit_transform(sentences)

lsa_sentences.shape ...

Get Feature Engineering Made Easy now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.