An optimal imputation algorithm for reducing bias and errors in missing data handling for AI models

Anu Maria Sebastian, David Peter, Rinu Ann Sebastian

Data is an essential fuel for artificial intelligence (AI) to power the underlying machine learning (ML) algorithms. Missing data is common in most real-world datasets due to measurement errors, non-responses, and human errors during the data collection, which can ultimately lead to reduced accuracy and reliability for the AI models. Moreover, many ML algorithms are designed to work with complete datasets. Data imputation (DI) assists in creating a comprehensive representation of the data, allowing AI models to learn from an affluent dataset and generate more accurate results. Therefore, choosing the proper imputation technique is essential in minimizing the errors and biases introduced in the data during the imputation. The difficulty in creating an imputation method that performs optimally across the entire spectrum of data stems from the disparity in the inherent characteristics displayed by the different datasets. All the existing DI selection approaches are computationally intensive, demanding repetitive and exhaustive experimentation of the popular DI methods on every new dataset to evaluate its suitability, resulting in significant wastage of time and effort. This research proposes an algorithm for systematically selecting an optimal imputation technique based on the intrinsic characteristics of the dataset. It associates the performance of DI algorithms with the specific characteristics of a given dataset using a characteristics chart (C-chart). The resulting DI recommendation will remain valid for another dataset with a similar C-chart. Thus, our method eliminates the need for exhaustive experimentation to find the proper DI method and offers a reliable imputation for real-world datasets that lack a verifiable ground truth. We have demonstrated the performance of our method using a suite of six benchmark DI algorithms, eight public datasets, and two ML classifiers. We use both Normalized Root Mean Square Error (NRMSE) and Jensen Shannon Distance (JSD) scores to evaluate the potential of the DI algorithms. We could observe that the recommended DI algorithms could enhance the ML classifier accuracy by up to 19.8%. We believe that the proposed algorithm is a significant step towards automating the selection of an optimal DI technique based on data characteristics.

MoreLess

Year of publication:	2025
Authors:	Sebastian, Anu Maria ; Peter, David ; Sebastian, Rinu Ann
Published in:	Decision analytics journal. - Amsterdam : Elsevier, ISSN 2772-6622, ZDB-ID 3106160-6. - Vol. 16.2025, Art.-No. 100627, p. 1-15
Subject:	Algorithm optimization \| Bias reduction \| Data imputation \| Data preprocessing \| Machine learning \| Predictive analytics \| Algorithmus \| Algorithm \| Künstliche Intelligenz \| Artificial intelligence \| Systematischer Fehler \| Bias \| Prognoseverfahren \| Forecasting model \| Statistischer Fehler \| Statistical error \| Statistische Methode \| Statistical method \| Fehlende Daten \| Missing data \| Datenerhebung \| Data collection \| Theorie \| Theory \| Datenqualität \| Data quality

Type of publication:	Article
Type of publication (narrower categories):	Aufsatz in Zeitschrift ; Article in journal
Language:	English
Other identifiers:	10.1016/j.dajour.2025.100627 [DOI]
Source:	ECONIS - Online Catalogue of the ZBW

Persistent link: https://www.econbiz.de/10015506770