On Development and Statistical Analysis of a Corpus for Printed and Handwritten Mathematical Expressions
This paper deals with development of a corpus for a diagram language used to write mathematical expressions. This language involves a large set of special symbols, Greek letters in addition to English letters and digits. Symbols are not standardized and show wide variations in font size and styles. Moreover, mathematical notations convey meaning through varieties of spatial relationships among the symbols. Scientific and technical documents (printed in English, Devnagari and Bangla) containing mathematics are considered for corpus construction. Both machine printed as well as handwritten expressions are collected. Handwritten expressions are stored in a format different from that of printed mathematics. Statistical analysis of this corpus is also studied. Computed statistics are presented in tabular forms and their usefulness for automatic processing of mathematics is discussed