Latent semantic analysis 

Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for text that is a series of these three steps, which we have already learned in this book:

  • A tfidf vectorization
  • A PCA (SVD in this case to account for the sparsity of text)
  • Row normalization 

We can create a scikit-learn pipeline to perform LSA:

tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english')
svd = TruncatedSVD(n_components=10)  # will extract 10 "topics"
normalizer = Normalizer() # will give each document a unit norm

lsa = Pipeline(steps=[('tfidf', tfidf), ('svd', svd), ('normalizer', normalizer)])

Now, we can fit and transform our sentences data, like so:

lsa_sentences = lsa.fit_transform(sentences)

lsa_sentences.shape ...

