Show simple item record

dc.contributor.advisorLoeffelman, Jordan
dc.contributor.authorBain, Catherine
dc.date.accessioned2022-05-02T16:09:45Z
dc.date.available2022-05-02T16:09:45Z
dc.date.issued2022
dc.identifier.urihttps://hdl.handle.net/11244/335400
dc.description.abstractWhen specifying a predictive model for classification, variable selection (or subset selection) is one of the most important steps for researchers to consider. Reducing the necessary number of variables in a prediction model is vital for many reasons, including reducing the burden of data collection and increasing model efficiency and generalizability. The pool of variable selection methods from which to choose is large, and researchers often struggle to identify which method they should use given the specific features of their data set. Yet, there is a scarcity of literature available to guide researchers in their choice; the literature centers on comparing different implementations of a given method rather than comparing different methodologies under vary data features. Through the implementation of a large-scale Monte Carlo simulation and the application to three psychological datasets, we evaluated the prediction error rates, area under the receiver operating curve, number of variables selected, computation times, and true positive rates of five different variable selection methods using R under varying parameterizations (i.e., default vs. grid tuning): the genetic algorithm (ga), LASSO (glmnet), Elastic Net (glmnet), Support Vector Machines (svmfs), and random forest (Boruta). Performance measures did not converge upon a single best method; as such, researchers should guide their method selection based on what measure of performance they deem most important. Results do, however, indicate that the genetic algorithm is the most widely applicable method, exhibiting minimum error rates in hold-out samples when compared to other variable selection methods. Thus, if little is known of the format of the data by the researcher, choosing to implement the genetic algorithm will provide strong results.en_US
dc.languageen_USen_US
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 International*
dc.rights.urihttps://creativecommons.org/licenses/by-nc-nd/4.0/*
dc.subjectPsychologyen_US
dc.subjectQuantitative Psychologyen_US
dc.subjectMachine Learningen_US
dc.titleA Simulation Study Comparing the Use of Supervised Machine Learning Variable Selection Methods in the Psychological Sciencesen_US
dc.contributor.committeeMemberEthridge, Lauren
dc.contributor.committeeMemberShi, Dingjing
dc.date.manuscript2022
dc.thesis.degreeMaster of Scienceen_US
ou.groupDodge Family College of Arts and Sciences::Department of Psychologyen_US
shareok.orcid0000-0002-2767-6882en_US
shareok.nativefileaccessrestricteden_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record


Attribution-NonCommercial-NoDerivatives 4.0 International
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivatives 4.0 International