版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、 基本的匹配计算主要内容 关键词查询 结构化查询 字符串的匹配算法 允许出错的字符串的匹配算法关键词查询目前关键词查询是最常用的信息查询方式。又可分为:1. 单个词2. 多个词组成的上下文 (高级检索)3. 多个词用and,or 或not 组成的句子4. 自然语言句子单个词查询 用一个最贴切的词表示查询的意思 考研 原理:文档用词组成的向量或文本。匹配变成是否文档中含有查询词。上下文查询类型:类型: 短语Phrase e.g, Shandong University 近似句子 允许有拼写错误等 Shadong University高级查询中组成的上下文: 书名作者,出版社, 发表时间,价格De
2、finitionA syntax (语法) composed of atoms that retrieve documents, and of Boolean operators which work on their operandse.g, translation AND (syntax OR syntactic) (布尔表达式)Fuzzy Boolean Retrieve documents appearing in some operands (The AND may require it to appear in more operands than the OR) Ranked h
3、igher which has a larger number of elements Boolean 查询语法句法自然语言 Generalization of “fuzzy Boolean” A query is an enumeration of words and context queries All the documents matching a portion of the user query are retrieved Set a threshold so that the document with very low weight are not retrieved结构查询
4、 内容与结构混合查询内容与结构混合查询 - 给出匹配模板进行匹配 三种结构三种结构 - 固定结构 - 超链结构 - 层次结构Fixed (固定)StructureDocument:a fixed set of fieldsEX: a mail has a sender, a receiver, a date, a subject and a body fieldSearch for the mails sent to a given person with “football” in the Subject fieldA hypertext is a directed graph where
5、nodes hold some text (text contents)the links represent connections between nodes or between positions inside nodes (structural connectivity)HypertextHierarchical StructureHierarchical Structure层次查询的处理 从根到叶逐层限制的多次查询 基于树或图的匹配算法String Matchingdetecting the occurrence of a particular substring (pattern
6、) in another string (text)A straightforward SolutionThe Knuth-Morris-Pratt AlgorithmStraightforward solution Algorithm: Simple string matching Input: P and T, the pattern and text strings; m, the length of P. The pattern is assumed to be nonempty. Output: The return value is the index in T where a c
7、opy of P begins, or -1 if no match for P is found.Definition 11.1 Notation for patterns and textP the pattern being search for;T the text in which P is sought;m the length of Pn the length of T, not known to the algorithm; m m ) /* m is the length of P*/ match = i; /match found. / success case / bre
8、ak; /*exit the loop here*/ if(tj = pk) j+; k+; else /Back up over matched characters. int backup=k-1; /从本次查询点的下一个顶点开始从本次查询点的下一个顶点开始/ j = j-backup; k = k-backup; /Slide pattern forward,start over. j+; i=j;return match;ikjPTAnalysis Worst-case complexity is in (mn)P=aaab T=aaaaaaaaaaaaaab Need to back
9、 up. However, it works quite well on average for natural language.The Knuth-Morris-Pratt Algorithm Pattern Matching with Finite Automata (自动机) e.g. P = “AABC” start is the beginning index of T Idea: remembering the matched part by utilizing the prefix ofpattern P and do not consider T. However, it i
10、s not scalable for the size of term table.The Knuth-Morris-Pratt Flowchart (流程图) Character labels are inside the nodes, not on the arcs. Each node has two arrows out to other nodes: success link, or fail link next character is read only after a success link A special node, node 0, called “get next c
11、har” which read in next text character. e.g. P = “ABABCB”T=ABABABCBConstruction of the KMP Flowchart Definition:Fail links We define failk as the largest r (with rk) such that p1,.pr-1 matches pk-r+1.pk-1.That is the (r-1) character prefix of P is identical to the one (r-1) character substring endin
12、g at index k-1. Thus the fail links are determined by repetition within P itself.P: |A B A B A B| C B | |T: |A B A B A B| x mismatchP: |A B A B | A B C B | |T: A B |A B A B| x A B A B A B A B A B A B xr = 5; k = 7, p1.p4 = p3p6Fail7 = 5注意:T的指针没有改变函数 fail 是从左到右计算的,即递推方式进行的。我们可以假设 在计算failk 时,所有failt,
13、tk 已经被计算了。fail1=0. | p1 ps-1 | ps p1 | pk-r-1 pk-2 | pk-1 pk pm (a) by definition of failk-1 (which is s) p1. pfails-1 |pfails | | p1 ps-1 | ps | p1 | pk-r-1 pk-2 | pk-1 | pk pm do these match? (b) looking back for a match for pk-1. failk = s+1 = Pfailk-1+1Algorithm: KMP flowchart construction Input
14、: P,a string of characters;m,the length of P. Output: fail,the array of failure links,defined for indexes 1,.,m.The array is passed in and the algorithm fills it. Step: void kmpSetup(char P, int m, int fail) int k,s 1. fail1=0; /start point/ 2. for(k=2;k=1) / p1,ps 与 pk-s+1,pk-1比较 5. if(ps=pk-1) /*就
15、是它!*/ 6. break; 7. s=fails; /*否则递归向下找*/ 8. failk=s+1; fail1 = 0 ; fail2-1 =fail1= s = 0; fail2 =1; k= 3; s=fail3-1= 1; p2 p1, s= fail1=0; fail3= s+1=1. k=4; s= fail3 =1; p1=p3; fail4=s+1=2; To compute fail8, s= fail7 = 5, but p7 p5, recompute s= fail5 = 3, but p7 p3 either, so re-compute s= fail3 =
16、1. still p7 p1. Finally, s= fail1 = 0, end the search, and fail8 is assigned s+1 = 1;P = A B A B A B C Bfail 0 1 1 2 3 4 5 1 index 1 2 3 4 5 6 7 8The Knuth-Morris-Pratt Scan Algorithm int kmpScan(char P,char T,int m,int fail) int match, j,k; match= -1; j=1; k=1; while(endText(T,j)=false) if(km) / su
17、ccess/ match = j-m; break; if(k= =0) /the point of T moves ahead, and rescan/ j+; k=1; else if(tj= =pk) / success at position k of P/ j+; k+; else /Follow fail arrow. k=failk; / fail and go back to the point of P / /continue loop. return match; 没有使用变量iAnalysis Based on the similar method on analyzin
18、g the time complexity of algorithm KMP setup, The scan algorithm requires 2n character comparisons in the worst case Overall: worst case complexity is (n+m)RK 算法 输入: Two n bit strings A(a1,a2, ,an) and B(b1,b2, ,bn) 输出: whether A = B. 传统方法:传输n位依次比较。 指纹机制: 定义n位整数 根据指纹函数 Fp(x)=x mod p , p是一个素数是一个素数 比较
19、Fp(a)是否等于Fp(b),传输位数减小为O(logp)112iniiaa112iniibb设 代表字符集合,x , 定义函数ord(x), d = | |, ord(x): 0,1,2,d-1 对任意的模式P, |P| = m, 利用多项式指纹 Q(P) = ord(P1)dm-1+ord(P2)dm-2+ord(Pm-1)d+ord(Pm) 代表 P 同样对文本T = T1,T2,.,Tn 从左到右计算长度为m的连续子串的指纹,如Q(i)= ord(Ti)dm-1+ord(Ti+1)dm-2 +ord(Ti+m-2)d+ ord(T i+m-1)并和Q(P)相比较。若相同,则找到匹配的子
20、串。起始位置为i, 0i0, b- 1 aa 0*2+0=0, bb- 2*1+1=3, 0 3 ba- (3-2*1)*2+0 = 2 0 2 ab- (2-1*2)*2+ 1 = 1 ba- (1-0*2)*2+0 = 2 aa - (2-1*2)*2+0 = 0 find the position问题是得到的整数无法表示了,过于大取素数 q, Q(i) (mod q) = Q(p) (mod q)Q(i+1) (mod q) = (Q(i) ord(Ti)d m-1)*d ) (mod q) + ord(Ti+m) 但这样的话,当 Q(i) (mod q) = Q(p) (mod q),
21、 不一定 对应的字符串相同,这时可以逐位进行检查,有人证明该算法的期望时间复杂性为O(m+n), 是较好的算法。 特点:可以推广到高维的字符串匹配,是否可以应用到对2维图像的匹配?应用到对3维物体的匹配?计算具有一定误差的匹配Elements of Dynamic Programming Constructing solution to a problem by building it up dynamically from solutions to smaller (or simpler) sub-problems sub-instances are combined to obtain s
22、ub-instances of increasing size, until finally arriving at the solution of the original instance. make a choice at each step, but the choice may depend on the solutions to sub-problemsPrinciple of optimalityQthe optimal solution to any nontrivial instance of a problem is a combination of optimal sol
23、utions to some of its sub-instances. Memorization (for overlapping sub-problems)Qavoid calculating the same thing twice, Qusually by keeping a table of know results that fills up as sub-instances are solved. Principle of optimalityQthe optimal solution to any nontrivial instance of a problem is a co
24、mbination of optimal solutions to some of its sub-instances. Memorization (for overlapping sub-problems)Qavoid calculating the same thing twice, Qusually by keeping a table of know results that fills up as sub-instances are solved. Memorization for Dynamic programming version of a recursive algorith
25、m e.g. Trade space for speed by storing solutions to sub-problems rather than re-computing them. As solutions are found for suproblems, they are recorded in a dictionary, Before any recursive call, say on subproblem Q, check the dictionary to see if a solution for Q has been stored. If no solution h
26、as been stored, go ahead with recursive call. If a solution has been stored for Q, retrieve the stored solution, and do not make the recursive call. Just before returning the solution, store it in the dictionary. Dynamic programming version of the fib.Development of a dynamic programming algorithm C
27、haracterize the structure of an optimal solution Breaking a problem into sub-problem whether principle of optimality apply Recursively define the value of an optimal solution define the value of an optimal solution based on value of solutions to sub-problems Compute the value of an optimal solution
28、in a bottom-up fashion compute in a bottom-up fashion and save the values along the way later steps use the save values of pervious steps Construct an optimal solution from computed information字符串的近似匹配(Approximate string matching) In many applications we cant expect an exact copy, we want to find a
29、approximating string match with at most k mistakes, e.g., a spelling corrector. We will develop a dynamic programming algorithm for the k-approximate match.Definition: Let k be a nonnegative integer. A k-approximate match is a match of P in T that has at most k differences. The differences can be an
30、y of the following three types, the name of the difference is the operation needed on T to bring it closer to P.Revise: The corresponding characters in P and T are different;Delete: T contains a character that is missing from P.Insert: T is missing a character that appears in P. 如何修改T中的子串,使其能匹配上e.g.
31、 3-approximate match P: u n n e c e s s a r i l y T: un e s c e s s a r a l y (made three spelling errors)Definition 11.6 Difference tableDij = the minimum number of difference between P1,Pi and a segment of T ending at tj. 1i m, 1j m.定义: D0j=0; Di,0 = i; There will be a k-approximate match ending a
32、t tj for any j such that Dmj k, so we can stop as soon as we find an entry less than or equal to k in the last row of D, which is the first k -approximate match.The rules for the computation of D Dij = Di-1j-1 if pi = tj /*no error*/ Dij = Di-1j-1+1 if pi tj and revise tj to pi and both i and j increase; Dij = Di-1j+1 if insert pi into T, only i increase. Dij = Dij-1+1 if delete tj from T and only j increase. Each entry requires only entries above it and to its left in the table0 0 0 0 012m
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 陶瓷销售年终工作总结5篇
- 普通护士实习个人小结参考五篇
- 铺面房屋租赁5篇
- 豆制品深加工技改扩建项目可行性实施报告
- 滑雪场项目可行性研究报告
- 请遗骨协议书
- 三轮车事故协议书
- 山西焦煤就业协议书
- 酒店销售经理个人工作计划模板5篇
- 地下管廊机械施工合同
- 2024年福建省托育服务职业技能竞赛理论考试题库(含答案)
- 2024下半年江苏苏州城市学院招聘管理岗位工作人员27人历年(高频重点提升专题训练)共500题附带答案详解
- 二年级乘除法口算题大全500题(可直接打印)
- 测量复核记录
- 建造节活动策划书
- sk239g报警器说明书
- 半导体芯片项目创业计划书(参考范文)
- 困难职工基本情况汇总统计表
- 档案统计台帐
- 七大浪费实战案例(消除企业中的浪费)
- 停用常压储罐管理办法
评论
0/150
提交评论