Preprocessing of the corpora

The first step is to retrieve the corpora. We've already seen how to do this, but let's now formalize it in a function. To make it generic enough, let's enclose these functions in a file named corpora_tools.py.

  1. Let's do some imports that we will use later on:
import pickleimport refrom collections import Counterfrom nltk.corpus import comtrans
  1. Now, let's create the function to retrieve the corpora:
def retrieve_corpora(translated_sentences_l1_l2='alignment-de-en.txt'):    print("Retrieving corpora: {}".format(translated_sentences_l1_l2))    als = comtrans.aligned_sents(translated_sentences_l1_l2)    sentences_l1 = [sent.words for sent in als]    sentences_l2 = [sent.mots for sent in als] return sentences_l1, sentences_l2 ...

Get TensorFlow Deep Learning Projects now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.