Evaluation of Pseudo-Relevance Feedback using Wikipedia

Murtadha Aljubran
Alex James

ABSTRACT

Users have specific information needs when they do search using any
information retrieval system. The users try to express their needs in
unstructured queries which tend to be short and doesn't have a very
good description of the information needs in most cases. Using the
shallow language statistics including probabilistic or language models
such as BM25 or Indri respectively can be used to enhance the
retrieval system metrics like Mean Average Precision (MAP). However,
such methods depend on query terms and their presence in the retrieved
document to dene relevance.  Query expansion is a technique that can
be used to overcome this problem and allow including other terms to
the query that have a high probability of being important in defining
the relevance of retrieved documents.

In this project, we explore query expansion using the documents
resulting from the initial query on the same corpus on which we will
run the final query. There are a couple of parameters to optimize for
such as the number of documents and terms to be included in query
expansion, and Indri model internal parameters including smoothing
factors.  We create an index of the Wikipedia corpus which is used to
construct the expansion query. The question that we try to answer is
whether the quality of the corpus used for expansion, along with the
basis for expansion, produce a signicant improvement in metrics such
as MAP and precision at top 30 retrieved documents. We show that the
quality and the selection criteria of expansion documents are
important factors in query expansion performance that can improve
these metrics.