A Simulation Study Comparing the Use of Supervised Machine Learning Variable Selection Methods in the Psychological Sciences

Bain, Catherine

View/Open

2022_Bain_Catherine_Thesis.pdf (1.824Mb)

Date

2022

Author

Bain, Catherine

Metadata

Show full item record

Abstract

When specifying a predictive model for classification, variable selection (or subset selection) is one of the most important steps for researchers to consider. Reducing the necessary number of variables in a prediction model is vital for many reasons, including reducing the burden of data collection and increasing model efficiency and generalizability. The pool of variable selection methods from which to choose is large, and researchers often struggle to identify which method they should use given the specific features of their data set. Yet, there is a scarcity of literature available to guide researchers in their choice; the literature centers on comparing different implementations of a given method rather than comparing different methodologies under vary data features. Through the implementation of a large-scale Monte Carlo simulation and the application to three psychological datasets, we evaluated the prediction error rates, area under the receiver operating curve, number of variables selected, computation times, and true positive rates of five different variable selection methods using R under varying parameterizations (i.e., default vs. grid tuning): the genetic algorithm (ga), LASSO (glmnet), Elastic Net (glmnet), Support Vector Machines (svmfs), and random forest (Boruta). Performance measures did not converge upon a single best method; as such, researchers should guide their method selection based on what measure of performance they deem most important. Results do, however, indicate that the genetic algorithm is the most widely applicable method, exhibiting minimum error rates in hold-out samples when compared to other variable selection methods. Thus, if little is known of the format of the data by the researcher, choosing to implement the genetic algorithm will provide strong results.