Loading...
Thumbnail Image

Date

2013

Journal Title

Journal ISSN

Volume Title

Publisher

In applications, such as Web clicks and environmental monitoring, data are in the form of a stream, each of which is an infinite sequence of data points with explicit or implicit timestamps and has special characteristics, such as transiency, uncertainty, dynamic data distribution, multi-dimensionality, asynchronous data arrival, dynamic relationships, and schema heterogeneity of data from different sources. In those applications, outliers do exist due to many reasons including human error, instrument error, catastrophe, and malicious behavior. Being able to detect outliers effectively is critical to many data management and mining tasks. However, not much research has been conducted to discover outliers in data stream applications, especially for those involving multi-dimensionality, related, heterogeneous, and asynchronous streams.


In this dissertation, two innovative outlier detection algorithms, Orion and Wadjet, which take all the data streams' characteristics into consideration are presented. Orion is designed for applications where data are from single stream. It looks for a projected dimension that reveals the outlier nature of multi-dimensional data points with the help of an evolutionary algorithm, and identifies a data point as an outlier if it resides in a low density region in that dimension. Wadjet is designed for applications where data are from multiple, heterogeneous, and asynchronous streams. It has two phases: in the first phase, it processes each stream independently like Orion, and in the second phase, it captures and continuously evaluates the cross-correlation, if any, among the data points from multiple streams, and identifies a data point as an outlier if its value does not conform to the captured cross-correlation.


Extensive theoretical and empirical analyses have been conducted to evaluate the performance of Orion and Wadjet using real and synthetic datasets. The evaluation results show that both algorithms have better accuracy and execution time than the state-of-art techniques when applied to homogeneous data stream applications. The results also show that Wadjet is effective in detecting outliers in heterogeneous data streams which cannot be handled by existing algorithms.

Description

Keywords

Outliers (Statistics), Algorithms, Data mining

Citation

DOI

Related file

Notes

Sponsorship