Extraction of Type Style Based Meta-Information from Imaged Documents

Extraction of some meta-information from printed documents without doing OCR is considered. It can be statistically verified that important terms in technical articles are mostly printed in italic, bold and all capital style. A quick approach of detecting them is proposed here. The approach is based on the global shape heuristics of these styles of any font. Important words in a document are sometimes printed in larger size as well. A smart approach for the determination of font size is also presented. Detection of type styles helps in improving the OCR performance, especially for reading italicized text. Another usefulness of identifying word type styles and font size has been discussed in the context of extracting (i) different logical labels and (ii) important terms from the document. Experimental results on the performance of the approach on a large number of good quality as well as degraded document images are presented

MoreLess

Year of publication:	2018
Authors:	Chaudhuri, B.B.
Other Persons:	Garain, U. (contributor)
Publisher:	[2018]: [S.l.] : SSRN

Extent:	1 Online-Ressource (14 p)
Series:	Computer Science Preprint Archive ; Vol. 2002, Issue 7, pp 670-683
Type of publication:	Book / Working Paper
Language:	English
Notes:	Nach Informationen von SSRN wurde die ursprüngliche Fassung des Dokuments July 2002 erstellt
Source:	ECONIS - Online Catalogue of the ZBW

Persistent link: https://www.econbiz.de/10012927335