Preprocessing of the corpora

The first step is to retrieve the corpora. We've already seen how to do this, but let's now formalize it in a function. To make it generic enough, let's enclose these functions in a file named

  1. Let's do some imports that we will use later on:
import pickleimport refrom collections import Counterfrom nltk.corpus import comtrans
  1. Now, let's create the function to retrieve the corpora:
def retrieve_corpora(translated_sentences_l1_l2='alignment-de-en.txt'):    print("Retrieving corpora: {}".format(translated_sentences_l1_l2))    als = comtrans.aligned_sents(translated_sentences_l1_l2)    sentences_l1 = [sent.words for sent in als]    sentences_l2 = [sent.mots for sent in als] return sentences_l1, sentences_l2 ...

