OUTLIER EXPLANATIONS FOR DATA STREAMS: OUTLYING ATTRIBUTES AND ROOT CAUSE FACTORS OF OUTLIERS

Panjei, Egawati

dc.contributor.advisor	Gruenwald, Le
dc.contributor.author	Panjei, Egawati
dc.date.accessioned	2024-07-15T15:42:04Z
dc.date.available	2024-07-15T15:42:04Z
dc.date.issued	2024-08-01
dc.identifier.uri	https://hdl.handle.net/11244/340474
dc.description.abstract	Data streams, which are continuous sequences of timestamped data points, necessitate real-time monitoring due to their time-sensitive nature. They have special characteristics such as the notion of infinity (continuous arrival of data points and unbounded volume of data) and concept drift. Detecting outliers which are data points significantly different from the rest of the data in the dataset, is crucial in many data stream applications. For example, in network security and credit card transaction monitoring, real-time detection of outliers is vital, as these outliers often signify potential threats. However, investigating detected outliers usually requires significant time and effort from the users. Therefore, providing real-time outlier explanations is equally important, as it enables users to gain insights and shorten their investigation time. When an outlier is detected as a multidimensional object, the investigation time of the outlier is equivalent to the number of outlier attributes. Hence providing an outlier explanation in the form of the set of attributes responsible for the outlier abnormality (also known as the outlying attributes) is necessary. In some applications such as cloud monitoring, expert users spend considerable time and effort to investigate the root cause factors of an outlier detected in the front-end service. The investigation of the root cause factors involves examining the front-service all the way to the backend services. Providing an outlier explanation in the form of root cause factors of the outlier is important to minimize user effort in identifying the reasons for their occurrences. There exist techniques that discover outlying attributes and root causes factors of outliers for data streams. However, they do not simultaneously address the characteristics of data streams, especially for those involving the notion of infinity and concept drift. This dissertation proposes two outlier explanation algorithms, EXOS and Ocular, for discovering outlying attributes and root cause factors of outliers, respectively. EXOS is designed for discovering outlying attributes of multi-dimensional outliers in data streams. Unlike other existing techniques, EXOS leverages cross-correlations among data streams, accommodates varying data stream schemas and arrival rates, and effectively addresses challenges related to the unbounded volume of data and concept drift. The algorithm provides real-time explanations based on the local context of the outlier, derived from a time-based tumbling window. Ocular is an algorithm designed to identify root cause factors of point outliers in continuous real-time data streams. It utilizes a user-provided normal causal graph, which depicts the causal relationships or dependencies between variables in a system. When the value of a variable at a particular timestamp is detected as an outlier, Ocular employs this causal graph to identify the variables responsible for the anomalous value of the target variable. The algorithm simultaneously addresses inherent characteristics of data streams: the notion of time, the notion of infinity, and concept drift. Extensive theoretical and empirical analyses have been conducted to evaluate the performance of EXOS and Ocular using both real and synthetic datasets. The evaluation results show that, on average, EXOS achieves a 45.6% better F1 Score and is 7.3 times lower in explanation time compared to existing outlying attribute algorithms. Additionally, Ocular outperforms current root cause identification algorithms by 170% in F1 Score on average, while maintaining comparable or lower explanation times.	en_US
dc.language	en_US	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 4.0 International	*
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0/	*
dc.subject	Outlier Explanation	en_US
dc.subject	Root Cause Analysis	en_US
dc.subject	Data Mining	en_US
dc.subject	Data Stream Analysis	en_US
dc.title	OUTLIER EXPLANATIONS FOR DATA STREAMS: OUTLYING ATTRIBUTES AND ROOT CAUSE FACTORS OF OUTLIERS	en_US
dc.contributor.committeeMember	Cheng, Qi
dc.contributor.committeeMember	Mudduluru, Sanjana
dc.contributor.committeeMember	Trafalis, Theodore
dc.date.manuscript	2024-07-10
dc.thesis.degree	Ph.D.	en_US
ou.group	Gallogly College of Engineering::School of Computer Science	en_US
shareok.orcid	https://orcid.org/0000-0001-6681-7847	en_US
shareok.nativefileaccess	restricted	en_US