Latent semantic analysis (LSA) is a feature extraction tool. It is helpful for text that is a series of these three steps, which we have already learned in this book:
- A tfidf vectorization
- A PCA (SVD in this case to account for the sparsity of text)
- Row normalization
We can create a scikit-learn pipeline to perform LSA:
tfidf = TfidfVectorizer(ngram_range=(1, 2), stop_words='english') svd = TruncatedSVD(n_components=10) # will extract 10 "topics" normalizer = Normalizer() # will give each document a unit norm lsa = Pipeline(steps=[('tfidf', tfidf), ('svd', svd), ('normalizer', normalizer)])
Now, we can fit and transform our sentences data, like so:
lsa_sentences = lsa.fit_transform(sentences) lsa_sentences.shape ...