The first challenge is to write the model data and code it to disk. Let's start by training the pipeline first.
Let's get the imports out of the way:
import gzipimport loggingimport osfrom pathlib import Pathfrom urllib.request import urlretrieveimport numpy as npimport pandas as pdfrom sklearn.externals import joblibfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformerfrom sklearn.linear_model import LogisticRegression as LRfrom sklearn.pipeline import Pipelinefrom tqdm import tqdm
Let's write some utils for reading the data from text files and downloading them if absent:
Let's start by setting up a download progress bar for our use. We will do this by building a small abstraction over the tqdm