Surprisal-based algorithm for detecting anomalies in categorical data
Ossama Cherkaoui, Houda Anoun, Abderrahim Maizate
Anomaly detection is an important research area in a diverse range of real-world applications. Although many algorithms have been proposed to address anomaly detection for numerical datasets, categorical and mixed datasets remain a significant challenge, primarily because a natural distance metric is lacking. Consequently, the methods proposed in the literature implement entirely different assumptions regarding the definition of categorical anomalies. This paper presents a novel categorical anomaly detection approach, offering two key contributions to existing methods. First, a novel surprisal-based anomaly score is introduced, which provides a more accurate assessment of anomalies by considering the full distribution of categorical values. Second, the proposed method considers complex correlations in the data beyond the pairwise interactions of features. This study proposed and tested the novel categorical surprisal anomaly detection algorithm (CSAD) by comparing and evaluating it against six competitors. The experimental results indicate that CSAD produced the best overall performance, achieving the highest average ROC-AUC and PR-AUC values of 0.8 and 0.443, respectively. Furthermore, CSAD's execution time is satisfactory even when processing large, high-dimensional datasets.
| Year of publication: |
2025
|
|---|---|
| Authors: | Cherkaoui, Ossama ; Anoun, Houda ; Maizate, Abderrahim |
| Published in: |
Data science and management : DSM. - [Amsterdam] : Elsevier B.V., ISSN 2666-7649, ZDB-ID 3108238-5. - Vol. 8.2025, 2, p. 185-195
|
| Subject: | Anomaly detection | Categorical data | Surprisal anomaly score | Unsupervised learning | Theorie | Theory | Algorithmus | Algorithm | Qualitative Methode | Qualitative method |
Saved in:
Saved in favorites
Similar items by subject
-
Dlugosz, Stephan, (2011)
-
Repairing non-monotone ordinal data sets by changing class labels
Pijls, Wim, (2015)
-
Explaining and predicting customer churn by monotonic rules induced from ordinal data
SzelÄ…g, Marcin, (2024)
- More ...