Show simple item record

dc.contributor.advisorThomas, Johnson
dc.contributor.authorMylavarapu, Sesha Sai Goutam Sarma
dc.date.accessioned2021-02-22T22:24:11Z
dc.date.available2021-02-22T22:24:11Z
dc.date.issued2020-07
dc.identifier.urihttps://hdl.handle.net/11244/328620
dc.description.abstractData analysis is a crucial process in the field of data science that extracts useful information from any form of data. The ease of access and maintenance makes structured data the most popular choice among many organizations even today. On the other hand, with the rapid growth of technology, more and more unstructured data, such as text and image, are being produced in large amounts. Apart from the techniques used, the quality of the data plays a prominent role in the accurate analysis. Data quality becomes inferior to poor maintenance and mediocre data generation strategies employed by amateur users. This problem escalates with the advent of big data. Data cleaning is one possible solution to this problem. However, it requires a great deal of domain knowledge and expert inference to verify and repair the data. Data Quality Assessment (DQA) is an effective alternative that differentiates between good and bad quality data. Although DQA requires domain knowledge, since it does not repair or change the inherent data, it is more viable to automate the process. In this dissertation, we propose two quality assessment models for structured data and textual form of unstructured data. The context of data plays an important role in determining the quality of the data. Therefore, we automate the process of context extraction in structured data using machine learning techniques. For textual data, we use natural language processing to identify data errors and assess quality. However, an accurate source of information is necessary to identify data errors. Therefore, we propose an automated mechanism to identify the closest dataset using deep neural networks with minimal user intervention. In addition, we also look into multiple dimensions of data quality such as completeness, accuracy, and consistency, to create a comprehensive quality assessment model. Our experimental results show the importance of the data context and multiple dimensions in quality assessment.
dc.formatapplication/pdf
dc.languageen_US
dc.rightsCopyright is held by the author who has granted the Oklahoma State University Library the non-exclusive right to share this material in its institutional repository. Contact Digital Library Services at lib-dls@okstate.edu or 405-744-9161 for the permission policy on the use, reproduction or distribution of this material.
dc.titleContext-aware quality assessment of structured and unstructured data
dc.contributor.committeeMemberGeorge, K. M.
dc.contributor.committeeMemberCrick, Christopher
dc.contributor.committeeMemberSheng, Weihua
osu.filenameMylavarapu_okstate_0664D_16833.pdf
osu.accesstypeOpen Access
dc.type.genreDissertation
dc.type.materialText
dc.subject.keywordscontext-aware
dc.subject.keywordsdata quality
dc.subject.keywordsdata science
dc.subject.keywordsmachine learning
dc.subject.keywordsstructured data
dc.subject.keywordsunstructured data
thesis.degree.disciplineComputer Science
thesis.degree.grantorOklahoma State University


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record