Chapter 17. Supporting Multiple Languages

When building an NLP system, the first thing you should answer is what language or languages will you support. This can affect everything from data storage, to modeling, to the user interface. In this chapter, we will talk about what you want to consider if you are productionizing a multilingual NLP system.

At the end of the chapter, we will have a checklist of questions to ask yourself about your project.

Language Typology

When supporting multiple languages, one way you can manage complexity is by identifying commonalities between your expected languages. For example, if you are dealing with only Western European languages, you know that you need to consider only the Latin alphabet and its extensions. Also, you know that all the languages are fusional languages, so stemming or lemmatizing will work. They also have similar grammatical gender systems: masculine, feminine, and maybe an inanimate neuter.

Let’s look at a hypothetical scenario.

Scenario: Academic Paper Classification

In this scenario, your inputs will be text documents, PDF documents, or scans of text documents. The output is expected to be JSON documents with text, title, and tags. The languages you will be accepting as input are English, French, German, and Russian. You have labeled data, but it is from only the last five years of articles. This is when the publisher started requiring that articles be tagged during submission. The initial classifications can be at the ...

Get Natural Language Processing with Spark NLP now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.