Enhancing OCR in historical documents with complex layouts through machine learning

Abstract This paper explores the challenge of processing and extracting information from large quantities of printed serial sources from the 19th century, which have been largely untapped due to the inadequacies of existing extraction techniques. We focus on the Habsburg Central Europe’s Hof- und Staatsschematismus , a comprehensive record published between 1702 and 1918 that documents the Habsburg civil service’s hierarchy and the evolution of its central administration over two centuries. Our approach sees the significant investment into machine learning-driven layout detection prior to the OCR-process. We generated synthetic data mimicking the Hof- und Staatsschematismus style for initial training of a Faster R-CNN model, followed by fine-tuning the model with a smaller dataset of manually annotated historical documents. Subsequently, we optimised Tesseract-OCR for our document style to enhance the combined structure extraction and OCR process. Our evaluation demonstrates significant improvements in OCR performance metrics (WER and CER), with the combined structure detection and fine-tuned OCR process showing a decrease in error rates of 15.68 percentage points for CER and 19.95 percentage points for WER. These findings underscore the potential of ML techniques in facilitating the extraction and analysis of historical documents.

MoreLess

Year of publication:	2025
Authors:	Fleischhacker, David ; Kern, Roman ; Göderle, Wolfgang
Published in:	International Journal on Digital Libraries. - London : Springer London, ISSN 1432-1300. - Vol. 26.2025, 1
Publisher:	London : Springer London
Subject:	PDF extraction \| Layout detection \| OCR fine-tuning \| Synthetic training data \| Document analysis and recognition

More details

Type of publication:	Article
Type of publication (narrower categories):	Article
Language:	English
Other identifiers:	10.1007/s00799-025-00413-z [DOI]
Source:	EconStor

Persistent link: https://www.econbiz.de/10015409590