The Dirichlet-Multinomial Model for Bayesian Information Retrieval
In the standard probabilistic approach to information retrieval it is commonly assumed that the occurrence of query terms are conditionally independent given a relevant or non-relevant document. Together with the Probability Ranking Principle, this assumption has led to the development of binary independence model for information retrieval. Due to its simplicity and self-updating property through the use of conjugate priors, Bayesian information retrieval using the binary independence model has emerged and been demonstrated to be an appealing process. In a real-world context, however, the assumption of conditional independence of query terms holds rarely. In this paper, we developed a retrieval model based on the Dirichlet-Multinomial distribution and show that this model provides an exact characterization of the Bayesian information retrieval process without any assumption of conditional independence. We invoke classical statistical methods to motivate the selection of initial parameters of the prior distribution. We illustrate that Bayesian information retrieval of the Dirichlet-Multinomial model can easily implement the self-updating process of the Bayesian information retrieval based on the binary independence model. Finally, we conclude by a discussion of how frequency weightings of query terms may be used to improve the model performance