Data preparation

First, we will read the source text and the target text, which are in French and English, respectively:

frdata=[]endata=[]with open('data/train_fr_lines.txt') as frfile:    for li in frfile:        frdata.append(li)with open('data/train_en_lines.txt') as enfile:    for li in enfile:        endata.append(li)mtdata = pd.DataFrame({'FR':frdata,'EN':endata})mtdata['FR_len'] = mtdata['FR'].apply(lambda x: len(x.split(' ')))mtdata['EN_len'] = mtdata['EN'].apply(lambda x: len(x.split(' ')))print(mtdata['FR'].head(2).values)print(mtdata['EN'].head(2).values)Output:['Voici Bill Lange. Je suis Dave Gallo.\n' 'Nous allons vous raconter quelques histoires de la mer en vidéo.\n']["This is Bill Lange. I'm Dave Gallo.\n" "And we're going to tell you some stories ...

Get Hands-On Natural Language Processing with Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.