DOCUMENT RETRIEVAL EXPERIMENTS USING INDEXING VOCABULARIES OF VARYING SIZE. II. HASHING, TRUNCATION, DIGRAM AND TRIGRAM ENCODING OF INDEX TERMS
This paper describes the use of fixed‐length character strings for controlling the size of indexing vocabularies in reference retrieval systems. Experiments with the Cranfield test collection show that trigram encoding of words performs noticeably better than the use of digrams; however, use of the least frequent digram in each term produces more acceptable results. Hashing of terms gives a better performance than that obtained from a vocabulary of comparable size produced by right‐hand truncation. The application of small indexing vocabularies to the sequential searching of large document files is discussed.