20Domain Adaptation of Parts of Speech Annotators in Hindi Biomedical Corpus: An NLP Approach
Pitambar Behera1* and Om Prakash Jena2
1Centre for Linguistics, SLL & CS, Jawaharlal Nehru University, New Delhi, India
2Department of Computer Science, Ravenshaw University, Cuttack, India
Abstract
The envisaged research demonstrates the development of bio-medically annotated Parts of Speech (POS) corpus in Hindi. The study presents the adaptation of POS tagger trained in general domain corpus to automatically annotate the corpus of health domain. The tagger is trained with 200,000 word tokens applied from the ILCI (Indian Languages Corpora Initiative) data of mixed domains (in addition to 50k newswire tokens of biomedical data) which provides a satisfactory accuracy of 92%. When adapted and tested with the fresh data of the biomedical domain, the tagger registers an accuracy of 86.5%. In addition, the paper also focuses light on the resource-poor scenario of Hindi and other Indian regional languages in general domain and biomedical corpus in particular. Furthermore, the study provides a detailed account of the issues and challenges encountered pertaining to inter-rater reliability, domain adaptation of corpus, linguistics, and NLP (Natural Language Processing).
Keywords: Parts of Speech Annotation, NLP, ILCI, biomedical text processing, Hindi, resource-poor, domain adaptation
20.1 Introduction
In the age of information revolution, the scientific research output of biomedical research ...
Get Computational Intelligence and Healthcare Informatics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.