Enhancing OCR in historical documents with complex layouts through machine learning
Abstract This paper explores the challenge of processing and extracting information from large quantities of printed serial sources from the 19th century, which have been largely untapped due to the inadequacies of existing extraction techniques. We focus on the Habsburg Central Europe’s Hof- und Staatsschematismus , a comprehensive record published between 1702 and 1918 that documents the Habsburg civil service’s hierarchy and the evolution of its central administration over two centuries. Our approach sees the significant investment into machine learning-driven layout detection prior to the OCR-process. We generated synthetic data mimicking the Hof- und Staatsschematismus style for initial training of a Faster R-CNN model, followed by fine-tuning the model with a smaller dataset of manually annotated historical documents. Subsequently, we optimised Tesseract-OCR for our document style to enhance the combined structure extraction and OCR process. Our evaluation demonstrates significant improvements in OCR performance metrics (WER and CER), with the combined structure detection and fine-tuned OCR process showing a decrease in error rates of 15.68 percentage points for CER and 19.95 percentage points for WER. These findings underscore the potential of ML techniques in facilitating the extraction and analysis of historical documents.
Year of publication: |
2025
|
---|---|
Authors: | Fleischhacker, David ; Kern, Roman ; Göderle, Wolfgang |
Published in: |
International Journal on Digital Libraries. - London : Springer London, ISSN 1432-1300. - Vol. 26.2025, 1
|
Publisher: |
London : Springer London |
Subject: | PDF extraction | Layout detection | OCR fine-tuning | Synthetic training data | Document analysis and recognition |
Saved in:
Saved in favorites
Similar items by subject
-
Find similar items by using search terms and synonyms from our Thesaurus for Economics (STW).