We use the same torchtext for downloading, tokenizing and building vocabulary for the IMDB dataset. When creating the Field object, we leave the batch_first argument at False. RNN networks expect the data to be in the form of Sequence_length, batch_size and features. The following is used for preparing the dataset:
TEXT = data.Field(lower=True,fix_length=200,batch_first=False)LABEL = data.Field(sequential=False,)train, test = IMDB.splits(TEXT, LABEL)TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300),max_size=10000,min_freq=10)LABEL.build_vocab(train,)