15Strategies for Corpus Development for Low-Resource Languages: Insights from Nepal

Bal Krishna Bal1*, Balaram Prasain2, Rupak Raj Ghimire1 and Praveen Acharya3

1Information and Language Processing Research Lab, Department of Computer Science and Engineering, Kathmandu University, Dhulikhel, Nepal

2Central Department of Linguistics, Tribhuvan University, Kirtipur, Nepal

3School of Computing, Dublin City University, Dublin, Ireland

Abstract

Datasets or corpora are crucial ingredients for the development of any language technology projects. However, in the majority of situations, these resources appear to be a major issue or bottleneck, especially for low-resource languages. Typically, any low-resource language lacks technological support to encode the script or language computationally. Even for those with such support, the language resources are sparsely developed and lack benchmarking mechanisms, raising the question about the validity of any research and development using those resources. Apparently, it is high time that the low-resource languages develop specific short, medium, and long-term strategies to address these issues so that they could advance research and development of language technologies for their respective languages, at least not falling too much behind, if not at par, with the high-resource languages. This chapter explores the scenario of language computing with a particular focus on the speech and machine translation domains in the context of low-resource ...

Get Automatic Speech Recognition and Translation for Low Resource Languages now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.