(局外人).ppt_第1页
(局外人).ppt_第2页
(局外人).ppt_第3页
(局外人).ppt_第4页
(局外人).ppt_第5页
已阅读5页,还剩15页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Learning to Estimate Query DifficultyIncluding Applications to Missing Content Detection and Distributed Information Retrieval,Elad Yom-Tov, Shai Fine, David Carmel, Adam Darlow IBM Haifa Research Labs SIGIR 2005,2,Abstract,Novel learning methods are used for estimating the quality of results return

2、ed by a search engine in response to a query. Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. Quality estimation are useful for several applications, including improvement of retrieval, detecting queries for which no relevant con

3、tent exists in the document collection, and distributed information retrieval.,3,Introduction (1/2),Many IR systems suffer from a radical variance in performance. Estimating query difficulty is an attempt to quantify the quality of results returned by a given system for the query. Reasons for query

4、difficulty estimation Feedback to the user The user can rephrase “difficult” queries. Feedback to the search engine To invoke alternative strategies for different queries Feedback to the system administrator To identify queries related to a specific subject, and expand the document collection. For d

5、istributed information retrieval,4,Introduction (2/2),The observation and motivation: queries answered well are those whose query terms agree on most of the returned documents. Agreement is measured by the overlap between the top results. Difficult queries are those: The query terms cannot agree on

6、top results. Most of the terms do agree except a few outliers (局外人). A TREC query for example: “What impact has the chunnel (水底隧道) had on the British economy and/or the life style of the British”,5,Related Work (1/2),In the Robust track of TREC 2004, systems are asked to rank the topics by predicted

7、 difficulty. The goal is eventually to use such predictions to do topic-specific processing. Prediction methods suggested by the participants: Measuring clarity based on the systems score of the top results Analyzing the ambiguity of the query terms Learning a predictor using old TREC topics as trai

8、ning data (Ounis, 2004) showed that IDF-based predictor is positively related to query precision. (Diaz, 2004) used temporal distribution together with content of the documents to improve the prediction of AP for a query.,6,Related Work (2/2),The Reliable Information Access (RIA) workshop investigat

9、ed the reasons for system performance variance across queries. 10 failure categories were identified. 4 of which are due to emphasizing only partial aspects of the query. One of the conclusions of this workshop: “comparing a full topic ranking against ranking based on only one aspect of the topic wi

10、ll give a measure of the importance of that aspect to the retrieved set”,7,Estimating Query Difficulty,Query terms are defined as the keywords and the lexical affinities. Features used for learning: The overlap between each sub-query and the full query Measured by -statistics The rounded logarithm o

11、f the document frequency, log(DF), of each of the sub-queries. Two challenges for learning: The number of sub-queries is not constant. A canonic representation is needed. The sub-queries are not ordered.,8,Query Estimator Using a Histogram (1/2),The basic procedure: Find the top N results for the fu

12、ll query and for each sub-query. Build a histogram of the overlaps h(i,j) to form a feature vector. Values of log(DF) are split into 3 discrete values 01, 23, 4. h(i,j) means log(DF)i & overlapsj. The rows of h(i,j) are concatenated as a feature vector. Compute the linear weight vector c for predict

13、ion. An example, suppose a query has 4 sub-queries: log(DF(n)0 1 1 2, overlap2 0 0 1, h(i)0 0 1 2 0 0 0 1 0,9,Query Estimator Using a Histogram (2/2),Two additional features The score of the top-ranked document The number of words in the query Estimate the linear weight vector c (Moore-Penrose pseud

14、o-inverse): c = (HHT)-1HtT Hthe matrix with columns are feature vectors of training queries ta vector of the target measure (P10 or MAP) of training queries (H and t can be modified according to the objective),10,Query Estimator Using a Modified Decision Tree (1/2),Useful for sparseness, i.e. querie

15、s are too short. A binary decision tree Pairs of overlap and log(DF) of sub-queries form features. Each node consists of a weight vector, threshold, and score. An example:,11,Query Estimator Using a Modified Decision Tree (2/2),The concept of Random Forest Better decision trees can be obtained by tr

16、aining a multitude of trees, each in a slightly different manner or using different data. Apply AdaBoost algo. to resample training data,12,Experiment and Evaluation (1/2),The IR system is Juru. Two document collections TREC-8: 528,155 documents, 200 topics WT10G: 1,692,096 documents, 100 topics Fou

17、r-fold cross-validation, Measured by Kendalls-coefficient,13,Experiment and Evaluation (2/2),Compared with some other algorithms Estimation based on the score of the top result Estimation based on the average score of the top ten results Estimation based on the standard deviation of IDF values of qu

18、ery terms Estimation based on learning a SVM for regression,14,Application 1: Improving IR Using Query Estimation (1/2),Selective automatic query expansion Adding terms to the query based on frequently appearing terms in the top retrieved documents Only works for easy queries Using the same features

19、 to train a SVM classifier Deciding which part of the topic should be used TREC topics contain two parts: short title and longer description Some topics that are not answered well by the description part are better answered by the title part. Difficult topics use title part and easy topics use descr

20、iption.,15,Application 1: Improving IR Using Query Estimation (2/2),16,Application 2: Detecting Missing Content (1/2),Missing content queries (MCQs) are those have no relevant document in the collection. Experiment method 166 MCQs are created artificially from 400 TREC queries 200 TREC topics consis

21、t of title and description. Ten-fold cross-validation A tree-based classifier is trained to classify MCQs from non-MCQs. A query difficulty estimator may or may not be used as a pre-filter of easy queries before the MCQ classifier.,17,Application 2: Detecting Missing Content (2/2),18,Application 3:

22、Merging the Results of Distributed Retrieval (1/2),It is difficult to rerank the documents from different datasets since the scores are local for each specific dataset. CORI (W. Croft, 1995) is one of the state-of-the-art algorithm for distributed retrieval, using inference network to do collection

23、ranking. Apply the estimator to this problem: A query estimator is trained for each dataset. The estimated difficulty is used for weighting the scores. These weighted scores are merged to built the final ranking. Ten-fold cross-validation Only minimal information is supplied by the search engine.,19,Application 3: Merging the Results of Distributed Retrieval (2/2),Selective weighting All queries are clustered (2-means) based on their estimations for each of the datasets. In one cluster, the variance of t

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论