Compression-based distance between string data and its application to literary work classification based on authorship
There are many well-known document classification/clustering algorithms. In this paper, compression-based distances between documents are focused on, in particular, the normalized compression distance (NCD). The NCD is a popular and powerful metric between strings. A new distance <InlineEquation ID="IEq1"> <EquationSource Format="TEX">$$D_\alpha $$</EquationSource> </InlineEquation> with one parameter <InlineEquation ID="IEq2"> <EquationSource Format="TEX">$$\alpha $$</EquationSource> </InlineEquation> between strings is designed on the basis of the NCD, and several properties of <InlineEquation ID="IEq3"> <EquationSource Format="TEX">$$D_\alpha $$</EquationSource> </InlineEquation> are studied. It is also proved that every pair of strings <InlineEquation ID="IEq4"> <EquationSource Format="TEX">$$(x,y)$$</EquationSource> </InlineEquation> can be plotted on the contour graphs of NCD and <InlineEquation ID="IEq5"> <EquationSource Format="TEX">$$D_\alpha $$</EquationSource> </InlineEquation> (and some other compression-based distances) in a 2-dimensional plane. The distance <InlineEquation ID="IEq6"> <EquationSource Format="TEX">$$D_\alpha (x,y)$$</EquationSource> </InlineEquation> is defined to take a relatively small value if a string <InlineEquation ID="IEq7"> <EquationSource Format="TEX">$$x$$</EquationSource> </InlineEquation> is a portion of a string <InlineEquation ID="IEq8"> <EquationSource Format="TEX">$$y.$$</EquationSource> </InlineEquation> Literary works <InlineEquation ID="IEq9"> <EquationSource Format="TEX">$$x$$</EquationSource> </InlineEquation> and <InlineEquation ID="IEq10"> <EquationSource Format="TEX">$$y$$</EquationSource> </InlineEquation> are usually assumed to be written by the same author(s) if <InlineEquation ID="IEq11"> <EquationSource Format="TEX">$$x$$</EquationSource> </InlineEquation> is a portion of <InlineEquation ID="IEq12"> <EquationSource Format="TEX">$$y.$$</EquationSource> </InlineEquation> Therefore, it may be appropriate to consider the performance of <InlineEquation ID="IEq13"> <EquationSource Format="TEX">$$D_\alpha $$</EquationSource> </InlineEquation> for literary work classification based on authorship, as a benchmark. An algorithm to determine an appropriate value of <InlineEquation ID="IEq14"> <EquationSource Format="TEX">$$\alpha $$</EquationSource> </InlineEquation> is presented using the contour graphs, and this algorithm does not require the knowledge of the names of the authors of each work. According to experimental results of the area under receiver operating characteristics curves and clustering, <InlineEquation ID="IEq15"> <EquationSource Format="TEX">$$D_\alpha $$</EquationSource> </InlineEquation> with such an appropriate value of <InlineEquation ID="IEq16"> <EquationSource Format="TEX">$$\alpha $$</EquationSource> </InlineEquation> performs somewhat better in literary work classification based on authorship. Copyright Springer-Verlag 2013