Published Date: 2018-02-08
Registration: ISSN 2374-3468 (Online) ISSN 2159-5399 (Print)
Copyright: Published by AAAI Press, Palo Alto, California USA Copyright © 2018, Association for the Advancement of Artificial Intelligence All Rights Reserved.
Fine-grained Entity Recognition (FgER) is the task of detecting and classifying entity mentions into more than 100 types. The type set can span various domains including biomedical (e.g., disease, gene), sport (e.g., sports event, sports player), religion and mythology (e.g., religion, god) and entertainment (e.g., movies, music). Most of the existing literature for Entity Recognition (ER) focuses on coarse-grained entity recognition (CgER), i.e., recognition of entities belonging to few types such as person, location and organization. In the past two decades, several manually annotated datasets spanning different genre of texts were created to facilitate the development and evaluation of CgER systems (Nadeau and Sekine 2007). The state-of-the-art CgER systems use supervised statistical learning models trained on manually annotated datasets (Ma and Hovy 2016). In contrast, FgER systems are yet to match the performance level of CgER systems. There are two major challenges associated with failure of FgER systems. First, manually annotating a large-scale multi-genre training data for FgER task is expensive, time-consuming and error-prone. Note that, a human-annotator will have to choose a subset of types from a large set of types and types for the same entity might differ in sentences based on the contextual information. Second, supervised statistical learning models when trained on automatically generated noisy training data fits to noise, impacting the model’s performance. The objective of my thesis is to create a FgER system by exploring an off the beaten path which can eliminate the need for manually annotating large-scale multi-genre training dataset. The path includes: (1) automatically generating a large-scale single-genre training dataset, (2) noise-aware learning models that learn better in noisy datasets, and (3) use of knowledge transfer approaches to adapt FgER system to different genres of text.