OUTLIER EXPLANATIONS FOR DATA STREAMS: OUTLYING ATTRIBUTES AND ROOT CAUSE FACTORS OF OUTLIERS

Panjei, Egawati

View/Open

2024_Panjei_Egawati_Dissertation.pdf (4.725Mb)

Date

2024-08-01

Author

Panjei, Egawati

Metadata

Show full item record

Abstract

Data streams, which are continuous sequences of timestamped data points, necessitate real-time monitoring due to their time-sensitive nature. They have special characteristics such as the notion of infinity (continuous arrival of data points and unbounded volume of data) and concept drift. Detecting outliers which are data points significantly different from the rest of the data in the dataset, is crucial in many data stream applications. For example, in network security and credit card transaction monitoring, real-time detection of outliers is vital, as these outliers often signify potential threats. However, investigating detected outliers usually requires significant time and effort from the users. Therefore, providing real-time outlier explanations is equally important, as it enables users to gain insights and shorten their investigation time. When an outlier is detected as a multidimensional object, the investigation time of the outlier is equivalent to the number of outlier attributes. Hence providing an outlier explanation in the form of the set of attributes responsible for the outlier abnormality (also known as the outlying attributes) is necessary. In some applications such as cloud monitoring, expert users spend considerable time and effort to investigate the root cause factors of an outlier detected in the front-end service. The investigation of the root cause factors involves examining the front-service all the way to the backend services. Providing an outlier explanation in the form of root cause factors of the outlier is important to minimize user effort in identifying the reasons for their occurrences. There exist techniques that discover outlying attributes and root causes factors of outliers for data streams. However, they do not simultaneously address the characteristics of data streams, especially for those involving the notion of infinity and concept drift. This dissertation proposes two outlier explanation algorithms, EXOS and Ocular, for discovering outlying attributes and root cause factors of outliers, respectively. EXOS is designed for discovering outlying attributes of multi-dimensional outliers in data streams. Unlike other existing techniques, EXOS leverages cross-correlations among data streams, accommodates varying data stream schemas and arrival rates, and effectively addresses challenges related to the unbounded volume of data and concept drift. The algorithm provides real-time explanations based on the local context of the outlier, derived from a time-based tumbling window. Ocular is an algorithm designed to identify root cause factors of point outliers in continuous real-time data streams. It utilizes a user-provided normal causal graph, which depicts the causal relationships or dependencies between variables in a system. When the value of a variable at a particular timestamp is detected as an outlier, Ocular employs this causal graph to identify the variables responsible for the anomalous value of the target variable. The algorithm simultaneously addresses inherent characteristics of data streams: the notion of time, the notion of infinity, and concept drift. Extensive theoretical and empirical analyses have been conducted to evaluate the performance of EXOS and Ocular using both real and synthetic datasets. The evaluation results show that, on average, EXOS achieves a 45.6% better F1 Score and is 7.3 times lower in explanation time compared to existing outlying attribute algorithms. Additionally, Ocular outperforms current root cause identification algorithms by 170% in F1 Score on average, while maintaining comparable or lower explanation times.