Scaling up Labeling, Mining, and Inferencing on Event Extraction
dc.contributor.advisor | Grant, Christan | |
dc.contributor.author | Liang, Yan | |
dc.contributor.committeeMember | Fagg, Andrew | |
dc.contributor.committeeMember | Lu, Kun | |
dc.contributor.committeeMember | Hougen, Dean | |
dc.contributor.committeeMember | Cheng, Qi | |
dc.date.accessioned | 2022-05-06T19:27:20Z | |
dc.date.available | 2022-05-06T19:27:20Z | |
dc.date.issued | 2022-05 | |
dc.date.manuscript | 2022-05 | |
dc.description.abstract | Numerous important events happen every day and are reported in different media sources with varying narrative styles across different knowledge domains and languages. Detecting the real-world events that have been reported from online articles and posts is one of the main tasks in event extraction. Other tasks include identifying event triggers and trigger types, identifying event arguments and argument types, clustering and tracking similar events from different texts, event prediction, and event evolution. As one of the most important research themes in natural language processing and understanding, event extraction has wide applications in diverse domains and has been intensively researched for decades. This work targets a scaling-up of End-to-End event extraction task through three ways. First, scaling up the event labeling process to different languages and domains. We designed and implemented four approaches to accurately and efficiently produce multi-lingual labels for events. Using the approaches we developed, we were able to complete Arabic actor and verb dictionaries with coverage equivalent to English in less than two years of work, compared to two decades for English dictionary development. Second, scaling up event extraction by using the document topics information in a topic-aware deep learning framework. We propose a domain-aware event extraction method by using the topic name embeddings to enrich the sentences' contextual representations and multi-task setup of event extraction and topic classification task. With the topic-aware model we developed, we were able to improve F1 by 1.8% on all event types, and F1 by 13.34% on few-shot event types. Third, scaling up event extraction by designing containerized and efficient pipelines, which researchers can comfortably adopt. The pipeline has a container-based architecture that adapts to the available systems and load to process text. With the Kalman filter based batch size optimization, we were able to achieve 20.33% improvement on processing time compared to static batch size. Using the pipeline we developed, we were able to publish largest machine-coded political event dataset covering 1979 to 2016 (2TB, 300 million documents). | en_US |
dc.identifier.uri | https://hdl.handle.net/11244/335586 | |
dc.language | en_US | en_US |
dc.subject | Machine Learning | en_US |
dc.subject | Information Extraction | en_US |
dc.subject | Natural Language Processing | en_US |
dc.subject | Text Mining | en_US |
dc.thesis.degree | Ph.D. | en_US |
dc.title | Scaling up Labeling, Mining, and Inferencing on Event Extraction | en_US |
ou.group | Gallogly College of Engineering::School of Computer Science | en_US |
shareok.nativefileaccess | restricted | en_US |
shareok.orcid | 0000-0002-1192-7288 | en_US |