Getting the data

Note that you will need to accept the terms and conditions of the competition and data usage to get this dataset. For a direct download, you can get the train and test data from the data tab on the challenge website.

Alternatively, you can use the official Kaggle API (github link) to download the data via a Terminal or Python program as well. In the case of both direct download and Kaggle API, you have to split your train data into smaller train and validation splits for this notebook. You can create train and validation splits of the train data by using the  sklearn.model_selection.train_test_split utility. Alternatively, you can download this directly from the accompanying code repository with this book.

Get Natural Language Processing with Python Quick Start Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.