Classifying discoveries: Implementing a generalized multiple testing protocol for exploratory data analysis
Abstract
In large-scale exploratory data analysis one objective is to discover interesting attributes that are worthy of further study. Standard statistical analysis employs a multiple testing procedure which aims to discover as many attributes as possible subject to the constraint that an error rate, such as the false discovery rate (FDR), is controlled at a prespecified level. However, the objective of this statistical protocol need not be in line with the objectives of the study at hand since discovered attributes need not be interesting (worthy of further study), and likewise, interesting attributes need not be discovered. This work provides a new statistical method that allows for the nature of the follow-up analysis to be considered when determining which attributes are discovered. The methodology is illustrated on a dataset in which the objective is to discover bacterial species near the roots of wheat plants that are associated with plant health and to classify discovered species into groups based on the nature (positive or negative) and degree (strong or weak) of their association. This definition of interesting leads to a procedure that ranks attributes according to their local misclassification rates (LMCR). Theoretical and numerical results illustrate that the proposed LMCR procedure outperforms the current standard procedure in that it has a smaller misclassification rate among discoveries and still controls the FDR. The new method also performs favorably over the traditional approach when applied to real-world datasets, including the aforementioned plant health data, where expectation-maximization (EM) algorithms are used to estimate unknown parameters.
Collections
- OSU Dissertations [11222]