Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance

Ahsan, Md Manjurul; Mahmud, M. A. Parvez; Saha, Pritom Kumar; Gupta, Kishor Datta; Siddique, Zahed

dc.contributor.author	Ahsan, Md Manjurul
dc.contributor.author	Mahmud, M. A. Parvez
dc.contributor.author	Saha, Pritom Kumar
dc.contributor.author	Gupta, Kishor Datta
dc.contributor.author	Siddique, Zahed
dc.date.accessioned	2022-01-31T21:42:59Z
dc.date.available	2022-01-31T21:42:59Z
dc.date.issued	2021-07-24
dc.identifier.citation	Ahsan MM, Mahmud MAP, Saha PK, Gupta KD, Siddique Z. Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies. 2021; 9(3):52. https://doi.org/10.3390/technologies9030052	en_US
dc.identifier.uri	https://hdl.handle.net/11244/334438
dc.description.abstract	Heart disease, one of the main reasons behind the high mortality rate around the world, requires a sophisticated and expensive diagnosis process. In the recent past, much literature has demonstrated machine learning approaches as an opportunity to efficiently diagnose heart disease patients. However, challenges associated with datasets such as missing data, inconsistent data, and mixed data (containing inconsistent missing data both as numerical and categorical) are often obstacles in medical diagnosis. This inconsistency led to a higher probability of misprediction and a misled result. Data preprocessing steps like feature reduction, data conversion, and data scaling are employed to form a standard dataset—such measures play a crucial role in reducing inaccuracy in final prediction. This paper aims to evaluate eleven machine learning (ML) algorithms—Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB), Support Vector Machine (SVM), XGBoost (XGB), Random Forest Classifier (RF), Gradient Boost (GB), AdaBoost (AB), Extra Tree Classifier (ET)—and six different data scaling methods—Normalization (NR), Standscale (SS), MinMax (MM), MaxAbs (MA), Robust Scaler (RS), and Quantile Transformer (QT) on a dataset comprising of information of patients with heart disease. The result shows that CART, along with RS or QT, outperforms all other ML algorithms with 100% accuracy, 100% precision, 99% recall, and 100% F1 score. The study outcomes demonstrate that the model’s performance varies depending on the data scaling method.	en_US
dc.description.sponsorship	Open Access fees paid for in whole or in part by the University of Oklahoma Libraries.	en_US
dc.language	en_US	en_US
dc.rights	Attribution 4.0 International	*
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	*
dc.subject	Heart Disease	en_US
dc.subject	Machine Learning Algorithm	en_US
dc.subject	Data Scaling	en_US
dc.subject	Prediction	en_US
dc.subject	Automated Model	en_US
dc.title	Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance	en_US
dc.type	Article	en_US
dc.description.peerreview	Yes	en_US
dc.identifier.doi	10.3390/technologies9030052	en_US
ou.group	Gallogly College of Engineering::School of Industrial and Systems Engineering	en_US