




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Learning to Estimate Query DifficultyIncluding Applications to Missing Content Detection and Distributed Information Retrieval,Elad Yom-Tov, Shai Fine, David Carmel, Adam Darlow IBM Haifa Research Labs SIGIR 2005,2,Abstract,Novel learning methods are used for estimating the quality of results return
2、ed by a search engine in response to a query. Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. Quality estimation are useful for several applications, including improvement of retrieval, detecting queries for which no relevant con
3、tent exists in the document collection, and distributed information retrieval.,3,Introduction (1/2),Many IR systems suffer from a radical variance in performance. Estimating query difficulty is an attempt to quantify the quality of results returned by a given system for the query. Reasons for query
4、difficulty estimation Feedback to the user The user can rephrase “difficult” queries. Feedback to the search engine To invoke alternative strategies for different queries Feedback to the system administrator To identify queries related to a specific subject, and expand the document collection. For d
5、istributed information retrieval,4,Introduction (2/2),The observation and motivation: queries answered well are those whose query terms agree on most of the returned documents. Agreement is measured by the overlap between the top results. Difficult queries are those: The query terms cannot agree on
6、top results. Most of the terms do agree except a few outliers (局外人). A TREC query for example: “What impact has the chunnel (水底隧道) had on the British economy and/or the life style of the British”,5,Related Work (1/2),In the Robust track of TREC 2004, systems are asked to rank the topics by predicted
7、 difficulty. The goal is eventually to use such predictions to do topic-specific processing. Prediction methods suggested by the participants: Measuring clarity based on the systems score of the top results Analyzing the ambiguity of the query terms Learning a predictor using old TREC topics as trai
8、ning data (Ounis, 2004) showed that IDF-based predictor is positively related to query precision. (Diaz, 2004) used temporal distribution together with content of the documents to improve the prediction of AP for a query.,6,Related Work (2/2),The Reliable Information Access (RIA) workshop investigat
9、ed the reasons for system performance variance across queries. 10 failure categories were identified. 4 of which are due to emphasizing only partial aspects of the query. One of the conclusions of this workshop: “comparing a full topic ranking against ranking based on only one aspect of the topic wi
10、ll give a measure of the importance of that aspect to the retrieved set”,7,Estimating Query Difficulty,Query terms are defined as the keywords and the lexical affinities. Features used for learning: The overlap between each sub-query and the full query Measured by -statistics The rounded logarithm o
11、f the document frequency, log(DF), of each of the sub-queries. Two challenges for learning: The number of sub-queries is not constant. A canonic representation is needed. The sub-queries are not ordered.,8,Query Estimator Using a Histogram (1/2),The basic procedure: Find the top N results for the fu
12、ll query and for each sub-query. Build a histogram of the overlaps h(i,j) to form a feature vector. Values of log(DF) are split into 3 discrete values 01, 23, 4. h(i,j) means log(DF)i & overlapsj. The rows of h(i,j) are concatenated as a feature vector. Compute the linear weight vector c for predict
13、ion. An example, suppose a query has 4 sub-queries: log(DF(n)0 1 1 2, overlap2 0 0 1, h(i)0 0 1 2 0 0 0 1 0,9,Query Estimator Using a Histogram (2/2),Two additional features The score of the top-ranked document The number of words in the query Estimate the linear weight vector c (Moore-Penrose pseud
14、o-inverse): c = (HHT)-1HtT Hthe matrix with columns are feature vectors of training queries ta vector of the target measure (P10 or MAP) of training queries (H and t can be modified according to the objective),10,Query Estimator Using a Modified Decision Tree (1/2),Useful for sparseness, i.e. querie
15、s are too short. A binary decision tree Pairs of overlap and log(DF) of sub-queries form features. Each node consists of a weight vector, threshold, and score. An example:,11,Query Estimator Using a Modified Decision Tree (2/2),The concept of Random Forest Better decision trees can be obtained by tr
16、aining a multitude of trees, each in a slightly different manner or using different data. Apply AdaBoost algo. to resample training data,12,Experiment and Evaluation (1/2),The IR system is Juru. Two document collections TREC-8: 528,155 documents, 200 topics WT10G: 1,692,096 documents, 100 topics Fou
17、r-fold cross-validation, Measured by Kendalls-coefficient,13,Experiment and Evaluation (2/2),Compared with some other algorithms Estimation based on the score of the top result Estimation based on the average score of the top ten results Estimation based on the standard deviation of IDF values of qu
18、ery terms Estimation based on learning a SVM for regression,14,Application 1: Improving IR Using Query Estimation (1/2),Selective automatic query expansion Adding terms to the query based on frequently appearing terms in the top retrieved documents Only works for easy queries Using the same features
19、 to train a SVM classifier Deciding which part of the topic should be used TREC topics contain two parts: short title and longer description Some topics that are not answered well by the description part are better answered by the title part. Difficult topics use title part and easy topics use descr
20、iption.,15,Application 1: Improving IR Using Query Estimation (2/2),16,Application 2: Detecting Missing Content (1/2),Missing content queries (MCQs) are those have no relevant document in the collection. Experiment method 166 MCQs are created artificially from 400 TREC queries 200 TREC topics consis
21、t of title and description. Ten-fold cross-validation A tree-based classifier is trained to classify MCQs from non-MCQs. A query difficulty estimator may or may not be used as a pre-filter of easy queries before the MCQ classifier.,17,Application 2: Detecting Missing Content (2/2),18,Application 3:
22、Merging the Results of Distributed Retrieval (1/2),It is difficult to rerank the documents from different datasets since the scores are local for each specific dataset. CORI (W. Croft, 1995) is one of the state-of-the-art algorithm for distributed retrieval, using inference network to do collection
23、ranking. Apply the estimator to this problem: A query estimator is trained for each dataset. The estimated difficulty is used for weighting the scores. These weighted scores are merged to built the final ranking. Ten-fold cross-validation Only minimal information is supplied by the search engine.,19,Application 3: Merging the Results of Distributed Retrieval (2/2),Selective weighting All queries are clustered (2-means) based on their estimations for each of the datasets. In one cluster, the variance of t
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 河南省正阳县第二高级中学2017-2018学年高一下学期数学周练(三)
- EasySMS短信收发系统中间件技术研究
- 家蚕二分浓核病毒NS1蛋白的表达及细胞毒性研究
- 我国土地资源利用和保护存在问题及对策分析
- 专家聘用合同范例
- 江苏专用2025版高考语文精准刷题3读+3练第3周周四排序题专练含解析
- 书刊设计合同范例
- 人员解除合同范例
- 农村房子赠予合同范例
- 公告类合同范例
- 四年级数学(四则混合运算)计算题专项练习与答案汇编
- 8年级上册(人教版)物理电子教材-初中8~9年级物理电子课本
- 人教版高中英语新教材必修2单词默写表
- 中金公司在线测评真题
- 项目资金管理统筹实施方案
- 2024年秋新沪科版物理八年级上册 6.3来自地球的力 教学课件
- 定密培训课件教学课件
- 三、种植芽苗菜(教学设计)鲁科版二年级下册综合实践活动
- 2025届东北师大附属中学高考物理五模试卷含解析
- GB/T 7409.1-2024同步电机励磁系统第1部分:定义
- 液化气站双重预防体系手册
评论
0/150
提交评论