生物资讯相关演算法AlgorithmsinBioinformatics-课件_第1页
生物资讯相关演算法AlgorithmsinBioinformatics-课件_第2页
生物资讯相关演算法AlgorithmsinBioinformatics-课件_第3页
生物资讯相关演算法AlgorithmsinBioinformatics-课件_第4页
生物资讯相关演算法AlgorithmsinBioinformatics-课件_第5页
已阅读5页,还剩90页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、生物資訊相關演算法Algorithms in Bioinformatics呂學一 (中央研究院 資訊科學所).tw/hil/hil/2019/10/141Algorithms in Bioinformatics, Lecture 4生物資訊相關演算法Algorithms in BioinfReminderDont forget to do Homework 2!2019/10/142Algorithms in Bioinformatics, Lecture 4ReminderDont forget to do HomTodayConstructing suffix tree in linear

2、 time.Applications of suffix treesSubstring problem (暖身)“Exact string matching” revisitedLinearization of circular string (挪移乾坤)Longest common substring (異中求同)Intermission小巨s magic show: “flaming”2019/10/143Algorithms in Bioinformatics, Lecture 4TodayConstructing suffix tree Constructing suffix tree

3、 in linear time2019/10/144Algorithms in Bioinformatics, Lecture 4Constructing suffix tree in liTaking a closer lookCase-1 Step2019/10/145Algorithms in Bioinformatics, Lecture 4Taking a closer lookCase-1 StCloser look at Case-1 stepAlways occurs at a leaf (growing point).At the beginning of iteration

4、 k, the path above a leaf has label ?, k1.Each Case-1 step in iteration k simply changes the label ?, k1 to ?, k.?, k1?, k2019/10/146Algorithms in Bioinformatics, Lecture 4Closer look at Case-1 stepAlwaAn idea for Case-1 stepsUse ?, - to label the path above each leaf.Thus, no need to do anything fo

5、r each case-1 step.?, -2019/10/147Algorithms in Bioinformatics, Lecture 4An idea for Case-1 stepsUse ?Taking a closer lookCase-2 Steps: 節外生枝2019/10/148Algorithms in Bioinformatics, Lecture 4Taking a closer lookCase-2 StCloser look at Case-2 stepAlways occurs at a growing point that is not a leaf, wh

6、ich is not necessarily an internal node.When the growing point is an internal node Label = k, -, where k is the current iteration index.k, -2019/10/149Algorithms in Bioinformatics, Lecture 4Closer look at Case-2 stepAlwaCloser look at Case-2 stepWhen the growing point is not an internal node k, -ti,

7、 ji+t, ji, i+t-12019/10/1410Algorithms in Bioinformatics, Lecture 4Closer look at Case-2 stepWhenTaking a closer lookCase-3 Steps: 若無其事2019/10/1411Algorithms in Bioinformatics, Lecture 4Taking a closer lookCase-3 StCloser look at Case-3 stepAlways occurs at a growing point that is not a leaf, which

8、is not necessarily an internal node. When the new position of the growing point is an internal node ti, j02019/10/1412Algorithms in Bioinformatics, Lecture 4Closer look at Case-3 stepAlwaCloser look at Case-3 stepWhen the new position of the growing point is not an internal node ti, jt + 12019/10/14

9、13Algorithms in Bioinformatics, Lecture 4Closer look at Case-3 stepWhenUsing ?,- for each leafS =bbabbaab1,-2,-3,-3,-3,37,-4,-3,37,-4,-2,37,-4,-11,1121112019/10/1414Algorithms in Bioinformatics, Lecture 4Using ?,- for each leafS =bbSaving a lot of effortsWe can simply ignore all Case-1 steps.Recall

10、that the number of Case-2 steps is at most |S|.Q: Is this good enough?Case 1: 長此以往Case 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1415Algorithms in Bioinformatics, Lecture 4Saving a lot of effortsWe can ExerciseGive a sequence S

11、such that the number of Case-3 steps throughout the process of growing its suffix trie (or suffix tree) is W(|S|2).2019/10/1416Algorithms in Bioinformatics, Lecture 4ExerciseGive a sequence S suchHow does Ukkonen overcome the problem of too many Case-3 (若無其事) steps?Completely ignore them以若無其事的態度處理若無

12、其事的狀況2019/10/1417Algorithms in Bioinformatics, Lecture 4How does Ukkonen overcome the Why Case-3 steps?ti, j0ti, jt + 12019/10/1418Algorithms in Bioinformatics, Lecture 4Why Case-3 steps?ti, j0ti, Why Case-3 steps?For correctly maintaining the position of each growing point. (Why?)For correctly runn

13、ing Case-2 steps. (By what?)k, -ti, ji+t, ji, i+t-12019/10/1419Algorithms in Bioinformatics, Lecture 4Why Case-3 steps?For correctlyWhat if Saving the book keeping efforts in all Case-3 steps 2019/10/1420Algorithms in Bioinformatics, Lecture 4What if Saving the book keepiSaving even more effortsCase

14、 1: 長此以往Case 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1421Algorithms in Bioinformatics, Lecture 4Saving even more effortsCase 1Rough ideaJust keep one current growing point throughout the execution.Deriving the new position of

15、the current growing point from its previous position (with the help of suffix links (斷頭指標)2019/10/1422Algorithms in Bioinformatics, Lecture 4Rough ideaJust keep one currenOnly one growing pointCase 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a

16、b b The challenges: How do we derive the position of the current growing point?Vertically (case 2)Horizontally (case 3)Q: Which one is easier?2019/10/1423Algorithms in Bioinformatics, Lecture 4Only one growing pointCase 2: Horizontally, Moving from iteration k 1 to iteration k.The growing point does

17、 not move!This is the easier case.Case 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1424Algorithms in Bioinformatics, Lecture 4Horizontally, Moving from iteHorizontal movement1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b

18、 b b a a b b a a b a a b a b b Case 1: 長此以往Case 2: 節外生枝Case 3: 若無其事2019/10/1425Algorithms in Bioinformatics, Lecture 4Horizontal movement1 2 3 4 5 6Vertically, Moving from Step i to Step i+1 in the same iteration.The growing point moves dramatically.This is the tougher case.Case 2: 節外生枝Case 3: 若無其事1

19、 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1426Algorithms in Bioinformatics, Lecture 4Vertically, Moving from Step Vertical movement1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1: 長此以往Case 2: 節外生枝Case 3

20、: 若無其事2019/10/1427Algorithms in Bioinformatics, Lecture 4Vertical movement1 2 3 4 5 6 7斷頭指標 (suffix link)前人種樹後人涼的哲學2019/10/1428Algorithms in Bioinformatics, Lecture 4斷頭指標 (suffix link)前人種樹後人涼的哲學前人種樹後人涼每次千辛萬苦找到vertical movement的目的時, 把這個movement的起點與終點用一個link記錄下來.下回遇到這個起點時, 就可以直接走到終點去,不用再重新找一次.這些link就叫

21、做suffix link (斷頭指標).2019/10/1429Algorithms in Bioinformatics, Lecture 4前人種樹後人涼每次千辛萬苦找到vertical moveme為何稱為斷頭指標?終點所對應的字串,是起點所對應之字串的斷頭字串(second suffix)Case 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1430Algorithms in Bioinformatics,

22、 Lecture 4為何稱為斷頭指標?終點所對應的字串,是起點所對應之字串的斷頭指標的性質(1)每個斷頭指標的起點一定是個internal node, 不會葉子不會是某個suffix tree edge的中間Why?Case 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1431Algorithms in Bioinformatics, Lecture 4斷頭指標的性質(1)每個斷頭指標的起點一定是個int斷頭指標

23、的性質(2)每個internal node一定是某個斷頭指標的起點Why?Case 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1432Algorithms in Bioinformatics, Lecture 4斷頭指標的性質(2)每個internal node一定是運用斷頭指標S =bbabbaab1,-2,-3,-3,-3,37,-4,-3,37,-4,-2,37,-4,-11,1121111 2 3 4

24、5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1433Algorithms in Bioinformatics, Lecture 4運用斷頭指標S =bbabbaab1,-2,-Moving the growing point遇到必須節外生枝(case 2),就在長好新樹枝之後,移到斷頭位置,繼續從那裡檢查目前這個Sk是case 2還是case 3.遇到若無其事(case 3),就直接按照Sk的方向移動,繼續從那裡檢查下一個字元Sk+1是case 2還是case 3.

25、2019/10/1434Algorithms in Bioinformatics, Lecture 4Moving the growing point遇到必須節 1234567890123456 S=bbabbaabbbabbaab11,13,1,12,3,127,2,34,7,14,17,3,34,1210,4,56,110,2,23,323453,32019/10/1435Algorithms in Bioinformatics, Lecture 4 1234567890123456 S=bbabbaaTraversal with the help of suffix links: pha

26、se (1)Going up to a closest internal node (whose suffix link must be available). Suppose this upward traversal passes through t characters.Following the suffix link that starts from this internal node.ti, j2019/10/1436Algorithms in Bioinformatics, Lecture 4Traversal with the help of sufTraversal wit

27、h the help of suffix links: phase (2)Going down by matching the t-character substring Si, i + t 1 of S.ti, j2019/10/1437Algorithms in Bioinformatics, Lecture 4Traversal with the help of sufRunning Time?Navely: O(t).Cleverly: O(1+ d), where d is the number of internal nodes being went through during

28、phase (2). ti, j2019/10/1438Algorithms in Bioinformatics, Lecture 4Running Time?Navely: O(t).tiOverall Time = O(|S|)Suppose di is the d in the i-th Case-2-step traversal. It suffices to show d1+d2+d|S| =O(|S|).Case 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b

29、 b a a b a a b a b b 2019/10/1439Algorithms in Bioinformatics, Lecture 4Overall Time = O(|S|)Suppose d = the “slack” of the growing pointThe slack means the distance between a position P and the closest internal node above P.ti, j2019/10/1440Algorithms in Bioinformatics, Lecture 4 = the “slack” of t

30、he growingcase-3 traversalEach case-3 traversal (i.e., horizontal movement) can only increase the value of by at most one. (It can even decrease the value of .)Case 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1441Algorithms in Bio

31、informatics, Lecture 4case-3 traversalEach case-3 trcase-2 traversalThe i-th case-2 traversal (i.e., vertical movement) decreases the value of by at least di.Case 2: 節外生枝Case 3: 若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 2019/10/1442Algorithms in Bioin

32、formatics, Lecture 4case-2 traversalThe i-th case-d1+d2+d|S| = O(|S|)Initial = O(1). can be increased by one for at most |S| times (because there are at most |S| horizontal movements (i.e., case-3 traversals).Since is always non-negative, the above bound is proved.2019/10/1443Algorithms in Bioinform

33、atics, Lecture 4d1+d2+d|S| = O(|S|)Initial TodayConstructing suffix tree in linear time.Applications of suffix treesSubstring problem (暖身)“Exact string matching” revisitedLinearization of circular string (挪移乾坤)Longest common substring (異中求同)Intermission小巨s magic show2019/10/1444Algorithms in Bioinfo

34、rmatics, Lecture 4TodayConstructing suffix tree Application OneSubstring Problem (recap as a warm-up)2019/10/1445Algorithms in Bioinformatics, Lecture 4Application OneSubstring ProblSubstring ProblemInput: two strings P and S, where S is allowed to be preprocessed in O(|S|) time.Output: an occurrenc

35、e of P in S.Objective: done in O(|P|) time.2019/10/1446Algorithms in Bioinformatics, Lecture 4Substring ProblemInput: two st 12345678 S=bbabbaab1,13,1,12,3,127,2,34,7,14,17,3,34,13,32019/10/1447Algorithms in Bioinformatics, Lecture 4 12345678 S=bbabbaab1,1 12345678 S=bbabbaab1,17,2,34,7,4,7,3,34,3,3

36、Q: Where are abba, baa, bb?2019/10/1448Algorithms in Bioinformatics, Lecture 4 12345678 S=bbabbaab1,17 12345678 S=bbabbaab121,1347,2,34,57,4,67,3,34,3,33211Q: Where are abba, baa, bb?2019/10/1449Algorithms in Bioinformatics, Lecture 4 12345678 S=bbabbaab121,1Application TwoExact String Matching2019/

37、10/1450Algorithms in Bioinformatics, Lecture 4Application TwoExact String MaExact String MatchingInput: two strings P and S, where S is allowed to be preprocessed in O(|S|) time.Output: all occurrences of P in S.Challenge: solving this in O(|P| + k) time, where k is the number of occurrences of P in

38、 S.2019/10/1451Algorithms in Bioinformatics, Lecture 4Exact String MatchingInput: twIdeaEach internal node keeps the labels of all its descendant leaves.2019/10/1452Algorithms in Bioinformatics, Lecture 4IdeaEach internal node keeps t 12345678 S=bbabbaab121,1347,2,34,57,4,67,3,34,3,3Q: Somethings mi

39、ssing? 6,35,24,15,2,4,1Q: How do we fix this problem?2019/10/1453Algorithms in Bioinformatics, Lecture 4 12345678 S=bbabbaab121,1 123456789 S=bbabbaab$121,1347,2,34,57,4,67,3,34,3,35,24,15,2,4,1,8179,4,45,89,99,6,3,7Q: Obtainable in O(|S|) time?2019/10/1454Algorithms in Bioinformatics, Lecture 4 123

40、456789 S=bbabbaab$121,Perhaps notS = a a a a a $1654321,2,3,4,51,2,3,41,2,31,22019/10/1455Algorithms in Bioinformatics, Lecture 4Perhaps notS = a a a a a $165An observationConsider the sequence L of leaves from left to right. The descendant leaves of each internal node has to be consecutive in L.201

41、9/10/1456Algorithms in Bioinformatics, Lecture 4An observationConsider the seq 123456789 S=bbabbaab$121,1347,2,34,57,4,67,3,33,35,24,15,2,4,1,879,4,45,89,99,6,3,7 123456789L=6375241891,34,84,56,72019/10/1457Algorithms in Bioinformatics, Lecture 4 123456789 S=bbabbaab$121,Application ThreeCircular St

42、ring Linearization (挪移乾坤)2019/10/1458Algorithms in Bioinformatics, Lecture 4Application ThreeCircular StriNotationLet挪(S, i)denote the string Si|S| S1i 1.Si挪(S,i)2019/10/1459Algorithms in Bioinformatics, Lecture 4NotationLetSi挪(S,i)2019/10/145The problemInputa string S.Outputan index i that maximize

43、s the alphabetical order of 挪(S, i). 1 2 3 4 5 6 7 8挪(S,1) = b b a b b a a b挪(S,2) = b a b b a a b b挪(S,3) = a b b a a b b b挪(S,4) = b b a a b b b a挪(S,5) = b a a b b b a b挪(S,6) = a a b b b a b b挪(S,7) = a b b b a b b a挪(S,8) = b b b a b b a a2019/10/1460Algorithms in Bioinformatics, Lecture 4The p

44、roblemInput Nave algorithmlet j = 1;for i = 2 to |S| do if (挪(S,i) 挪(S,j) let j = i;output j;Time complexity?2019/10/1461Algorithms in Bioinformatics, Lecture 4Nave algorithmlet j = 1;Time Q: Can we beat O(|S|2)? 1 2 3 4 5 6 7 8挪(S,1) = b b a b b a a b挪(S,2) = b a b b a a b b挪(S,3) = a b b a a b b b

45、挪(S,4) = b b a a b b b a挪(S,5) = b a a b b b a b挪(S,6) = a a b b b a b b挪(S,7) = a b b b a b b a挪(S,8) = b b b a b b a a2019/10/1462Algorithms in Bioinformatics, Lecture 4Q: Can we beat O(|S|2)? Linear-Time Algorithm via Suffix Tree2019/10/1463Algorithms in Bioinformatics, Lecture 4Linear-Time Algor

46、ithm via SuffFirst attempt going right1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Q: How to fix the problem?2019/10/1464Algorithms in Bioinformatics, Lecture 4First attempt going right1 2Second AttemptSuffix tree for SS2019/10/1465Algorithms in Bioinformati

47、cs, Lecture 4Second AttemptSuffix tree for Key observationEach length-|S| substring of SS is a 挪(S, j) for some index j with 1 j |S|.Each 挪(S, j) with 1 j |S| is a length-|S| substring of SS.2019/10/1466Algorithms in Bioinformatics, Lecture 4Key observationEach length-|S| 1234567890123456SS=bbabbaab

48、bbabbaab11,13,1,12,3,127,2,34,7,14,17,3,34,1210,4,56,110,2,23,323453,32019/10/1467Algorithms in Bioinformatics, Lecture 4 1234567890123456SS=bbabbaa1,17,4,7,4,7,3,310,4,56,10,2,23,33,3Q: How to use this suffix tree? 1234567890123456SS=bbabbaabbbabbaab2019/10/1468Algorithms in Bioinformatics, Lecture

49、 41,17,4,7,4,7,Equivalently, Output the index i such that SSi|SS| corresponds to the rightmost leaf of the suffix tree for SS.Clearly, this takes O(|S|) time.2019/10/1469Algorithms in Bioinformatics, Lecture 4Equivalently, Output the indeApplication FourLongest common substring (異中求同)2019/10/1470Alg

50、orithms in Bioinformatics, Lecture 4Application FourLongest commonThe problemInput: two strings A and B.Output: a longest string C that occurs in both A and B.A = b b b a b b a a bB = b a a b b a b b a bC = b bC = b a a bC = a b b aC = b b a b b a2019/10/1471Algorithms in Bioinformatics, Lecture 4Th

51、e problemInput: two strings Nave algorithmbuild suffix tree for B;for L = |A| downto 1 do for i = 1 to |A|-L+1 do if Aii+L-1 occurs in B output Aii+L-1 and halt; output “no common substring”;Time complexity?2019/10/1472Algorithms in Bioinformatics, Lecture 4Nave algorithmbuild suffix trO(|A|3+|B|)()

52、3|12|1|1|)()1|(|AOAiiAOiOiAALALAL=+-=+-=The for-loop takes timeCan we do better than this?2019/10/1473Algorithms in Bioinformatics, Lecture 4O(|A|3+|B|)()3|12|1|1|A faster algorithmbuild suffix tree for B;for i = 1 to |A| do find the largest integer L(i) such that Aii+L(i)-1 occurs in B by binary se

53、arch;output AiL(i) for the i with the largest L(i);Time complexity?2019/10/1474Algorithms in Bioinformatics, Lecture 4A faster algorithmbuild suffixO(|A|2 log|A|+|B|)The for-loop takes O(|A|2 log|A|) time.Each binary search takes time O(|A| log |A|).There are overall O(|A|) binary searches.Can we do

54、 better than this?2019/10/1475Algorithms in Bioinformatics, Lecture 4O(|A|2 log|A|+|B|)The for-loopDonald E. Knuth conjectured in 1970 that it is impossible to solve this longest common substring problem in O(|A|+|B|) time.2019/10/1476Algorithms in Bioinformatics, Lecture 4Donald E. Knuth conjecture

55、d inLongest Common Substringin O(|A|+|B|) time via suffix tree2019/10/1477Algorithms in Bioinformatics, Lecture 4Longest Common Substringin O(|IdeaConstruct a suffix tree T for A#B$, where # and $ are two characters not in A and B.There are exactly |A|+|B|+2 leaves in T, each leaf corresponds to a s

56、uffix of A#B$.A-leaf: with label in 1, 2, , |A|corresponds to an A-suffix.B-leaf: with label in |A|+2, ,|A|+|B|+1corresponds to a B-suffix.$#ABA-suffixB-suffix2019/10/1478Algorithms in Bioinformatics, Lecture 4IdeaConstruct a suffix tree T ObservationLet v be an arbitrary position of T (i.e., v is n

57、ot necessarily a node of T.)v has a descendant A-leaf if and only if v corresponds to a prefix of an A-suffix of A#B$.v has a descendant B-leaf if and only if v corresponds to a prefix of a B-suffix of A#B$.rootv2019/10/1479Algorithms in Bioinformatics, Lecture 4ObservationLet v be an arbitraLemmaLe

58、t v be a position of T.v has descendant A-leaf and B-suffix if and only if v corresponds to a common substring of A and B.root$#ABA-suffixB-suffixv2019/10/1480Algorithms in Bioinformatics, Lecture 4LemmaLet v be a position of TQuestionDo we really need # to separate A and B in the concatenated strin

59、g A#B$?root$ABA-suffixB-suffixv2019/10/1481Algorithms in Bioinformatics, Lecture 4QuestionDo we really need # The algorithmConstruct the suffix tree T of A#B$.Output the string corresponding to a deepest internal node v such that the subtree of T rooted at v contains both A-leaf and B-leaf.Q: why no

60、t checking leaves?Q: why not checking positions of T that are not internal nodes of T?2019/10/1482Algorithms in Bioinformatics, Lecture 4The algorithmConstruct the sufIt suffices to check internal nodesIf the position v contains both kinds of descendant leaves, then so does its closest internal node

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论