Truncated SVD for categorical and sparse data

Dimensionality reduction can be very useful for datasets with many categorical variables, especially when each of these variables have a lot of possible values.

When we have sparse matrices of very high dimensionality, computing full SVD is typically very expensive. Thus, truncated SVD is especially for this case, and here we will see how we can use it. Later on in the next chapter, we will see that this is also pretty useful for text data, and we will cover this case in the next chapter. For now, we will have a look at how to use it for categorical variables.

For this, we will use a dataset about customer complaints from Kaggle. You can download it from here: https://www.kaggle.com/cfpb/us-consumer-finance-complaints ...

Get Mastering Java for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.