Loading...
Thumbnail Image

Date

2023-05-12

Journal Title

Journal ISSN

Volume Title

Publisher

Creative Commons
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-ShareAlike 4.0 International

The imbalanced ratio of data is one of the most significant challenges in various industrial domains. Consequently, numerous data-balancing approaches have been proposed over the years. However, most of these data-balancing methods come with their own limitations that can potentially impact data-driven decision-making models in critical sectors such as product quality assurance, manufacturing defect identification, and pattern recognition in healthcare diagnostics. This dissertation addresses three research questions related to data-balancing approaches: 1) What are the scopes of data-balancing approaches toward the major and minor samples? 2) What is the effect of traditional Machine Learning (ML) and Synthetic Minority Over-sampling Technique (SMOTE)-based data-balancing on imbalanced data analysis? and 3) How does imbalanced data affect the performance of Deep Learning (DL)-based models?

To achieve these objectives, this dissertation thoroughly analyzes existing reference works and identifies their limitations. It has been observed that most existing data-balancing approaches have several limitations, such as creating noise during oversampling, removing important information during undersampling, and being unable to perform well with multidimensional data. Furthermore, it has also been observed that SMOTE-based approaches have been the most widely used data-balancing approaches as they can create synthetic samples that are easy to implement compared to other existing techniques. However, SMOTE also has its limitations, and therefore, it is required to identify whether there is any significant effect of SMOTE-based oversampled approaches on ML-based data-driven models' performance. To do that, the study conducts several hypothesis tests considering several popular ML algorithms with and without hyperparameter settings. Based on the overall hypothesis, it is found that, in many cases based on the reference dataset, there is no significant performance improvement on data-driven ML models once the imbalanced data is balanced using SMOTE approaches.

Additionally, the study finds that SMOTE-based synthetic samples often do not follow the Gaussian distribution or do not follow the same distribution of the data as the original dataset. Therefore, the study suggests that Generative Adversarial Network (GAN)-based approaches could be a better alternative to develop more realistic samples and might overcome the limitations of SMOTE-based data-balancing approaches. However, GAN is often difficult to train, and very limited studies demonstrate the promising outcome of GAN-based tabular data balancing as GAN is mainly developed for image data generation. Additionally, GAN is hard to train as it is computationally not efficient. To overcome such limitations, the present study proposes several data-balancing approaches such as GAN-based oversampling (GBO), Support Vector Machine (SVM)-SMOTE-GAN (SSG), and Borderline-SMOTE-GAN (BSGAN). The proposed approaches outperform existing SMOTE-based data-balancing approaches in various highly imbalanced tabular datasets and can produce realistic samples. Additionally, the oversampled data follows the distribution of the original dataset.

The dissertation later examines two case scenarios where data-balancing approaches can play crucial roles, specifically in healthcare diagnostics and additive manufacturing. The study considers several Chest radiography (X-ray) and Computed Tomography (CT)-scan image datasets for the healthcare diagnostics scenario to detect patients with COVID-19 symptoms. The study employs six different Transfer Learning (TL) approaches, namely Visual Geometry Group (VGG)16, Residual Network (ResNet)50, ResNet101, Inception-ResNet Version 2 (InceptionResNetV2), Mobile Network version 2 (MobileNetV2), and VGG19. Based on the overall analysis, it has been observed that, except for the ResNet-based model, most of the TL models have been able to detect patients with COVID-19 symptoms with an accuracy of almost 99%. However, one potential drawback of TL approaches is that the models have been learning from the wrong regions. For example, instead of focusing on the infected lung regions, the TL-based models have been focusing on the non-infected regions. To address this issue, the study has updated the TL-based models to reduce the models' wrong localization.

Similarly, the study conducts an additional investigation on an imbalanced dataset containing defect and non-defect images of 3D-printed cylinders. The results show that TL-based models are unable to locate the defect regions, highlighting the challenge of detecting defects using imbalanced data. To address this limitation, the study proposes preprocessing-based approaches, including algorithms such as Region of Interest Net (ROIN), Region of Interest and Histogram Equalizer Net (ROIHEN), and Region of Interest with Histogram Equalization and Details Enhancer Net (ROIHEDEN) to improve the model's performance and accurately identify the defect region.

Furthermore, this dissertation employs various model interpretation techniques, such as Local Interpretable Model-Agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), and Gradient-weighted Class Activation Mapping (Grad-CAM), to gain insights into the features in numerical, categorical, and image data that characterize the models' predictions. These techniques are used across multiple experiments and significantly contribute to a better understanding the models' decision-making processes.

Lastly, the study considers a small mixed dataset containing numerical, categorical, and image data. Such diverse data types are often challenging for developing data-driven ML models. The study proposes a computationally efficient and simple ML model to address these data types by leveraging the Multilayer Perceptron and Convolutional Neural Network (MLP-CNN). The proposed MLP-CNN models demonstrate superior accuracy in identifying COVID-19 patients' patterns compared to existing methods.

In conclusion, this research proposes various approaches to tackle significant challenges associated with class imbalance problems, including the sensitivity of ML models to multidimensional imbalanced data, distribution issues arising from data expansion techniques, and the need for model explainability and interpretability. By addressing these issues, this study can potentially mitigate data balancing challenges across various industries, particularly those that involve quality, defect, and pattern analysis, such as healthcare diagnostics, additive manufacturing, and product quality. By providing valuable insights into the models' decision-making process, this research could pave the way for developing more accurate and robust ML models, thereby improving their performance in real-world applications.

Description

Keywords

Engineering, Industrial., Computer Science., Engineering, Biomedical., Health Sciences, Radiology.

Citation

DOI

Related file

Notes

Sponsorship