




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、An Introduction to IR10th CourseChapter 11:Probabilistic Information RetrievalRecap of IR rankingVSM: Term weighting in a vector space Similarity measure is cosine of vector anglesQuery expansion: Improving search resultsEspecially for high recall. E.g., searching for aircraft so it matches with pla
2、ne; thermodynamic with heatRelevance feedback: Options for improving resultsGlobal methodsQuery expansionThesauriAutomatic thesaurus generationGlobal indirect relevance feedbackLocal methodsRelevance feedbackPseudo relevance feedbackProbabilistic relevance feedbackRather than reweighting in a vector
3、 spaceIf user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilistic classifier, such as a Naive Bayes model:P(tk|R) = |Drk| / |Dr|P(tk|NR) = |Dnrk| / |Dnr|tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr i
4、s the set of known irrelevant documents; Dnrk is the subset that contain tk.Why probabilities in IR?User Information NeedDocumentsDocumentRepresentationQueryRepresentationHow to match?In traditional IR systems, matching between each document andquery is attempted in a semantically imprecise space of
5、 index terms.Probabilities provide a principled foundation for uncertain reasoning.Can we use probabilities to quantify our uncertainties?Uncertain guess ofwhether document has relevant contentUnderstandingof user need isuncertainProbabilistic IR topicsClassical probabilistic retrieval modelProbabil
6、ity ranking principle, etc.(Nave) Bayesian Text Categorization Bayesian networks for text retrievalLanguage model approach to IRAn important emphasis in recent workProbabilistic methods are one of the oldest but also one of the currently hottest topics in IR.Traditionally: neat ideas, but theyve nev
7、er won on performance. It may be different now.The document ranking problemWe have a collection of documentsUser issues a queryA list of documents needs to be returnedRanking method is core of an IR system:In what order do we present documents to the user?We want the “best” document to be first, sec
8、ond best second, etc.Idea: Rank by probability of relevance of the document w.r.t. information needP(relevant|documenti, query)Recall a few probability basicsFor events a and b:Bayes RuleOdds:PosteriorPriorThe Probability Ranking Principle “If a reference retrieval systems response to each request i
9、s a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall eff
10、ectiveness of the system to its user will be the best that is obtainable on the basis of those data.”1960s/1970s S. Robertson, W.S. Cooper, M.E. Maron; vanRijsbergen(1979:113); Manning & Schtze (1999:538)Probability Ranking PrincipleLet x be a document in the collection. Let R represent relevance of
11、 a document w.r.t. given (fixed) query and let NR represent non-relevance.p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.Need to find p(R|x) - probability that a document x is relevant.p(R),p(NR) - prior probabilityof retrieving a (non) relevantdocumen
12、tR=0,1 vs. NR/RProbability Ranking Principle (PRP)Simple case: no selection costs or other utility concerns that would differentially weight errorsBayes Optimal Decision Rulex is relevant iff p(R|x) p(NR|x)PRP in action: Rank all documents by p(R|x)Theorem:Using the PRP is optimal, in that it minimi
13、zes the loss (Bayes risk) under 1/0 lossProvable if all probabilities correct, etc. e.g., Ripley 1996Probability Ranking PrincipleMore complex case: retrieval costs.Let d be a documentC - cost of retrieval of relevant documentC - cost of retrieval of non-relevant documentProbability Ranking Principl
14、e: iffor all d not yet retrieved, then d is the next document to be retrievedWe wont further consider loss/utility from now onProbability Ranking PrincipleHow do we compute all those probabilities?Do not know exact probabilities, have to use estimates Binary Independence Retrieval (BIR) which we dis
15、cuss later today is the simplest modelQuestionable assumptions“Relevance” of each document is independent of relevance of other documents.Really, its bad to keep on returning duplicatesBoolean model of relevanceThat one has a single step information needSeeing a range of results might let user refin
16、e queryProbabilistic Retrieval StrategyEstimate how terms contribute to relevanceHow do things like tf, df, and length influence your judgments about document relevance? One answer is the Okapi formulae (S. Robertson)Combine to find document relevance probabilityOrder documents by decreasing probabi
17、lity Probabilistic RankingBasic concept:For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents.By making assumptions about the distribution of terms and applying Bayes Theorem, it
18、 is possible to derive weights theoretically.Van RijsbergenBinary Independence ModelTraditionally used in conjunction with PRP“Binary” = Boolean: documents are represented as binary incidence vectors of terms (cf. lecture 1): iff term i is present in document x.“Independence”: terms occur in documen
19、ts independently Different documents can be modeled as same vectorBernoulli Naive Bayes model (cf. text categorization!)Binary Independence ModelQueries: binary term incidence vectorsGiven query q, for each document d need to compute p(R|q,d).replace with computing p(R|q,x) where x is binary term in
20、cidence vector representing d Interested only in rankingWill use odds and Bayes Rule:Binary Independence Model Using Independence Assumption:Constant for a given queryNeeds estimationSo :Binary Independence Model Since xi is either 0 or 1: Let Assume, for all terms not occurring in the query (qi=0)T
21、hen.This can be changed (e.g., inrelevance feedback)All matching termsNon-matching query termsBinary Independence ModelAll matching termsAll query termsBinary Independence ModelConstant foreach queryOnly quantity to be estimated for rankings Retrieval Status Value:Binary Independence Model All boils
22、 down to computing RSV.So, how do we compute cis from our data ?Binary Independence Model Estimating RSV coefficients. For each term i look at this table of document counts: Estimates:For now,assume nozero terms.More nextlecture.Estimation key challengeIf non-relevant documents are approximated by t
23、he whole collection, then ri (prob. of occurrence in non-relevant documents for query) is n/N andlog (1 ri)/ri = log (N n)/n log N/n = IDF!pi (probability of occurrence in relevant documents) can be estimated in various ways:from relevant documents if know someRelevance weighting can be used in feed
24、back loopconstant (Croft and Harper combination match) then just get idf weighting of termsproportional to prob. of occurrence in collectionmore accurately, to log of this (Greiff, SIGIR 1998)Iteratively estimating piAssume that pi constant over all xi in querypi = 0.5 (even odds) for any given docD
25、etermine guess of relevant document set:V is fixed size set of highest ranked documents on this model (note: now a bit like tf.idf!)We need to improve our guesses for pi and ri, soUse distribution of xi in docs in V. Let Vi be set of documents containing xi pi = |Vi| / |V|Assume if not retrieved the
26、n not relevant ri = (ni |Vi|) / (N |V|)Go to 2. until converges then return ranking24Probabilistic Relevance FeedbackGuess a preliminary probabilistic description of R and use it to retrieve a first set of documents V, as above.Interact with the user to refine the description: learn some definite me
27、mbers of R and NRReestimate pi and ri on the basis of theseOr can combine new information with original guess (use Bayesian prior):Repeat, thus generating a succession of approximations to R. is priorweightPRP and BIRGetting reasonable approximations of probabilities is possible.Requires restrictive
28、 assumptions:term independenceterms not in query dont affect the outcomeboolean representation of documents/queries/relevancedocument relevance values are independentSome of these assumptions can be removedProblem: either require partial relevance information or only can derive somewhat inferior ter
29、m weightsRemoving term independenceIn general, index terms arent independentDependencies can be complexvan Rijsbergen (1979) proposed model of simple tree dependenciesExactly Friedman and Goldszmidts Tree Augmented Naive Bayes (AAAI 13, 1996)Each term dependent on one otherIn 1970s, estimation probl
30、ems held back success of this modelFood for thoughtThink through the differences between standard tf.idf and the probabilistic retrieval model in the first iterationThink through the differences between vector space (pseudo) relevance feedback and probabilistic (pseudo) relevance feedbackGood and Ba
31、d NewsStandard Vector Space ModelEmpirical for the most part; success measured by resultsFew properties provableProbabilistic Model AdvantagesBased on a firm theoretical foundationTheoretically justified optimal ranking schemeDisadvantagesMaking the initial guess to get VBinary word-in-doc weights (
32、not using term frequencies)Independence of terms (can be alleviated)Amount of computationHas never worked convincingly better in practiceBayesian Networks for Text Retrieval (Turtle and Croft 1990)Standard probabilistic model assumes you cant estimate P(R|D,Q)Instead assume independence and use P(D|
33、R)But maybe you can with a Bayesian network*What is a Bayesian network?A directed acyclic graphNodes Events or VariablesAssume values. For our purposes, all BooleanLinksmodel direct dependencies between nodesBayesian Networksabca,b,c - propositions (events).p(c|ab) for all values for a,b,cp(a)p(b) B
34、ayesian networks model causal relations between eventsInference in Bayesian Nets:Given probability distributionsfor roots and conditional probabilities can compute apriori probability of any instance Fixing assumptions (e.g., b was observed) will cause recomputation of probabilities Conditional depe
35、ndenceFor more information see:R.G. Cowell, A.P. Dawid, S.L. Lauritzen, and D.J. Spiegelhalter. 1999. Probabilistic Networks and Expert Systems. Springer Verlag. J. Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufman.Toy ExampleGloom(g)Finals(
36、f)Project Due(d)No Sleep(n)Triple Latte(t)Independence Assumptions Independence assumption: P(t|g, f)=P(t|g) Joint probability P(f d n g t) =P(f) P(d) P(n|f) P(g|f d) P(t|g)Gloom(g)Finals(f)Project Due(d)No Sleep(n)Triple Latte(t)Chained inferenceEvidence - a node takes on some valueInference Comput
37、e belief (probabilities) of other nodesconditioned on the known evidenceTwo kinds of inference: Diagnostic and PredictiveComputational complexityGeneral network: NP-hardTree-like networks are easily tractableMuch other work on efficient exact and approximate Bayesian network inferenceClever dynamic
38、programmingApproximate inference (“loopy belief propagation”)Model for Text RetrievalGoalGiven a users information need (evidence), find probability a doc satisfies needRetrieval modelModel docs in a document networkModel information need in a query networkBayesian Nets for IR: IdeaDocument NetworkQ
39、uery NetworkLarge, butCompute once for each document collectionSmall, compute once forevery queryd1dnd2t1t2tnr1r2r3rkdi -documentsti - document representationsri - “concepts”Iq2q1cmc2c1ci - query conceptsqi - high-level conceptsI - goal nodeBayesian Nets for IRConstruct Document Network (once !)For
40、each queryConstruct best Query Network Attach it to Document NetworkFind subset of dis which maximizes the probability value of node I (best subset).Retrieve these dis as the answer to query.Bayesian nets for text retrievald1d2r1r3c1c3q1q2ir2c2DocumentNetworkQueryNetworkDocumentsTerms/ConceptsConcep
41、tsQuery operators(AND/OR/NOT)Information needLink matrices and probabilitiesPrior doc probability P(d) = 1/nP(r|d)within-document term frequencytf idf - basedP(c|r)1-to-1thesaurusP(q|c): canonical forms of query operatorsAlways use things like AND and NOT never store a full CPT*conditional probabili
42、ty tableExample: “reason trouble two”HamletMacbethreasondoublereasontwoORNOTUser querytroubletroubleDocumentNetworkQueryNetworkExtensionsPrior probs dont have to be 1/n.“User information need” doesnt have to be a query - can be words typed, in docs read, any combination Phrases, inter-document links
43、Link matrices can be modified over time.User feedback.The promise of “personalization”Computational detailsDocument network built at indexing timeQuery network built/scored at query timeRepresentation:Link matrices from docs to any single term are like the postings entry for that termCanonical link
44、matrices are efficient to store and computeAttach evidence only at roots of networkCan do single pass from roots to leavesBayes Nets in IRFlexible ways of combining term weights, which can generalize previous approachesBoolean modelBinary independence modelProbabilistic models with weaker assumption
45、sEfficient large-scale implementationInQuery text retrieval system from U MassTurtle and Croft (1990) Commercial version defunct?Need approximations to avoid intractable inferenceNeed to estimate all the probabilities by some means (whether more or less ad hoc)Much new Bayes net technology yet to be applied?Resources
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 仓库走访活动方案
- 【浦银国际】2025年中期展望:渗透率保持快速上扬智能辅助驾驶劲草逢春
- 仙游水利局活动方案
- 代表活动小组活动方案
- 北京市丰台区2023-2024学年五年级下学期数学期末试卷(含答案)
- 价值体现在岗位活动方案
- 企业元宵线上活动方案
- 改性无水磷石膏增强高密度聚乙烯(HDPE-PG)六棱结构壁管材编制说明
- 企业中层聚会活动方案
- 企业倡导节约活动方案
- 土木工程专业外文文献及翻译
- 2024年江苏常州中考满分作文《那么旧那样新》8
- 不要慌太阳下山有月光二部合唱线谱
- 实习三方协议电子版(2025年版)
- 数智融合:媒体发展的未来之路
- 肾病综合征病人的护理邵启轩
- 2024年江苏省盐城市中考地理试卷(含答案)
- 《生物电化学》课件
- 《鸡的常见品种》课件
- 第9课 近代西方的法律与教化 说课稿-2024-2025学年高二上学期历史统编版(2019)选择性必修1国家制度与社会治理
- 成人手术后疼痛评估与护理团体标准
评论
0/150
提交评论