How it works...
In step 1, we preprocessed the raw data by removing any punctuation and non-alphanumeric characters, normalizing all the Unicode characters to ASCII, and converting all the data into lowercase. We created lists of German and English phrases and combined them into a DataFrame for easy data manipulation.
In a sequence-to-sequence model, both the input and output phrases need to be converted into integer sequences of a fixed length. Thus, in step 2, we calculated the number of words in the lengthiest statements from each of these lists, which will be used to pad the sentences in their respective languages in the upcoming steps.
Next, in step 3, we created tokenizers for both the German and English phrases. For working with language ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access