An optimal imputation algorithm for reducing bias and errors in missing data handling for AI models
Anu Maria Sebastian, David Peter, Rinu Ann Sebastian
Data is an essential fuel for artificial intelligence (AI) to power the underlying machine learning (ML) algorithms. Missing data is common in most real-world datasets due to measurement errors, non-responses, and human errors during the data collection, which can ultimately lead to reduced accuracy and reliability for the AI models. Moreover, many ML algorithms are designed to work with complete datasets. Data imputation (DI) assists in creating a comprehensive representation of the data, allowing AI models to learn from an affluent dataset and generate more accurate results. Therefore, choosing the proper imputation technique is essential in minimizing the errors and biases introduced in the data during the imputation. The difficulty in creating an imputation method that performs optimally across the entire spectrum of data stems from the disparity in the inherent characteristics displayed by the different datasets. All the existing DI selection approaches are computationally intensive, demanding repetitive and exhaustive experimentation of the popular DI methods on every new dataset to evaluate its suitability, resulting in significant wastage of time and effort. This research proposes an algorithm for systematically selecting an optimal imputation technique based on the intrinsic characteristics of the dataset. It associates the performance of DI algorithms with the specific characteristics of a given dataset using a characteristics chart (C-chart). The resulting DI recommendation will remain valid for another dataset with a similar C-chart. Thus, our method eliminates the need for exhaustive experimentation to find the proper DI method and offers a reliable imputation for real-world datasets that lack a verifiable ground truth. We have demonstrated the performance of our method using a suite of six benchmark DI algorithms, eight public datasets, and two ML classifiers. We use both Normalized Root Mean Square Error (NRMSE) and Jensen Shannon Distance (JSD) scores to evaluate the potential of the DI algorithms. We could observe that the recommended DI algorithms could enhance the ML classifier accuracy by up to 19.8%. We believe that the proposed algorithm is a significant step towards automating the selection of an optimal DI technique based on data characteristics.
| Year of publication: |
2025
|
|---|---|
| Authors: | Sebastian, Anu Maria ; Peter, David ; Sebastian, Rinu Ann |
| Subject: | Algorithm optimization | Bias reduction | Data imputation | Data preprocessing | Machine learning | Predictive analytics | Algorithmus | Algorithm | Künstliche Intelligenz | Artificial intelligence | Systematischer Fehler | Bias | Prognoseverfahren | Forecasting model | Statistischer Fehler | Statistical error | Statistische Methode | Statistical method | Fehlende Daten | Missing data | Datenerhebung | Data collection | Theorie | Theory | Datenqualität | Data quality |
Saved in:
Saved in favorites
Similar items by subject
-
Weighting and imputation for missing data in a cost and earnings fishery survey
Lew, Daniel K., (2015)
-
Forecasting mortality using imputed data : the case of Taiwan
Luo, Sheng-Feng, (2016)
-
Non-linear missing data imputation for healthcare data via index-aware autoencoders
Kabir, Sadaf, (2022)
- More ...