Dependency Parsing by Belief Propagation - JHU Department of 依存句法分析的信念传播-约翰霍普金斯大学部_第1页
Dependency Parsing by Belief Propagation - JHU Department of 依存句法分析的信念传播-约翰霍普金斯大学部_第2页
Dependency Parsing by Belief Propagation - JHU Department of 依存句法分析的信念传播-约翰霍普金斯大学部_第3页
Dependency Parsing by Belief Propagation - JHU Department of 依存句法分析的信念传播-约翰霍普金斯大学部_第4页
Dependency Parsing by Belief Propagation - JHU Department of 依存句法分析的信念传播-约翰霍普金斯大学部_第5页
已阅读5页,还剩15页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、1David A. Smith (JHU UMass Amherst)Jason Eisner (Johns Hopkins University)Dependency Parsingby Belief PropagationOutlineEdge-factored parsingDependency parsesScoring the competing parses: Edge featuresFinding the best parseHigher-order parsingThrowing in more features: Graphical modelsFinding the be

2、st parse: Belief propagationExperimentsConclusionsNew!OldOutlineEdge-factored parsingDependency parsesScoring the competing parses: Edge featuresFinding the best parseHigher-order parsingThrowing in more features: Graphical modelsFinding the best parse: Belief propagationExperimentsConclusionsNew!Ol

3、dMODWord Dependency ParsingHe reckons the current account deficit will narrow to only 1.8 billion in September.Raw sentencePart-of-speech tagging He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .POS-tagged sentenceWord

4、 dependency parsingslide adapted from Yuji MatsumotoWord dependency parsed sentenceHe reckons the current account deficit will narrow to only billion in September .SUBJROOTS-COMPSUBJSPECMODMODCOMPCOMPWhat does parsing have to do with belief propagation?loopy belief propagationbeliefloopypropagationO

5、utlineEdge-factored parsingDependency parsesScoring the competing parses: Edge featuresFinding the best parseHigher-order parsingThrowing in more features: Graphical modelsFinding the best parse: Belief propagationExperimentsConclusionsNew!Old7Great ideas in NLP: Log-linear models (Berger, della Pie

6、tra, della Pietra 1996; Darroch & Ratcliff 1972)In the beginning, we used generative models.p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * each choice depends on a limited part of the historybut which dependencies to allow?what if theyre all worthwhile?p(D | A,B,C)?p(D | A,B,C)? p(D | A,B) * p(C | A,

7、B,D)?8Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)In the beginning, we used generative models.Solution: Log-linear (max-entropy) modelingFeatures may interact in arbitrary waysIterative scaling keeps adjusting the feature weightsuntil the m

8、odel agrees with the training data.p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * which dependencies to allow? (given limited training data)(1/Z) * (A) * (B,A) * (C,A) * (C,B) * (D,A,B) * (D,B,C) * (D,A,C) * throw them all in!Log-linear models great for n-way classificationAlso good for predicting se

9、quencesAlso good for dependency parsing9How about structured outputs?but to allow fast dynamic programming, only use n-gram featuresbut to allow fast dynamic programming or MST parsing,only use single-edge featuresfind preferred linksfindpreferredtagsvan10How about structured outputs?but to allow fa

10、st dynamic programming or MST parsing,only use single-edge featuresfind preferred linksEdge-Factored Parsers (McDonald et al. 2005)Byljasnstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingIs this a good edge?yes, lots of green .Edge-Factored Parsers (McD

11、onald et al. 2005)Byljasnstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingIs this a good edge?jasn den(“bright day)Edge-Factored Parsers (McDonald et al. 2005)Byljasnstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainth

12、estrikingIs this a good edge?jasn den(“bright day)jasn N(“bright NOUN)VAAANJNVCEdge-Factored Parsers (McDonald et al. 2005)Byljasnstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingIs this a good edge?jasn den(“bright day)jasn N(“bright NOUN)VAAANJNVCA NE

13、dge-Factored Parsers (McDonald et al. 2005)Byljasnstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingIs this a good edge?jasn den(“bright day)jasn N(“bright NOUN)VAAANJNVCA Npreceding conjunctionA NEdge-Factored Parsers (McDonald et al. 2005)Byljasnstuden

14、dubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingHow about this competing edge?VAAANJNVCnot as good, lots of red .Edge-Factored Parsers (McDonald et al. 2005)Byljasnstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestriki

15、ngHow about this competing edge?VAAANJNVCjasn hodiny(“bright clocks). undertrained .Edge-Factored Parsers (McDonald et al. 2005)Byljasnstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingHow about this competing edge?VAAANJNVCbyljasnstuddubndenahodiodbtinj

16、asn hodiny(“bright clocks”). undertrained .jasn hodi(“bright clock,stems only)Edge-Factored Parsers (McDonald et al. 2005)Byljasnstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingHow about this competing edge?VAAANJNVCjasn hodi(“bright clock,stems only)b

17、yljasnstuddubndenahodiodbtinAplural Nsingular jasn hodiny(“bright clocks”). undertrained .jasn hodiny(“bright clocks”). undertrained .Edge-Factored Parsers (McDonald et al. 2005)Byljasnstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingHow about this comp

18、eting edge?VAAANJNVCjasn hodi(“bright clock,stems only)byljasnstuddubndenahodiodbtinAplural Nsingular A N where N followsa conjunctionjasnEdge-Factored Parsers (McDonald et al. 2005)Bylstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingVAAANJNVCbyljasnstu

19、ddubndenahodiodbtinWhich edge is better?“bright day or “bright clocks?jasnEdge-Factored Parsers (McDonald et al. 2005)Bylstudendubnovdenahodinyodbjelytinctou“ItbrightcolddayAprilandclockswerethirteen”wasainthestrikingVAAANJNVCbyljasnstuddubndenahodiodbtinWhich edge is better?Score of an edge e = fea

20、tures(e)Standard algos valid parse with max total scoreour current weight vectorEdge-Factored Parsers (McDonald et al. 2005)Which edge is better?Score of an edge e = features(e)Standard algos valid parse with max total scoreour current weight vectorcant have both(one parent per word)cant have both(n

21、o crossing links)Cant have all three(no cycles)Thus, an edge may lose (or win) because of a consensus of other edges. OutlineEdge-factored parsingDependency parsesScoring the competing parses: Edge featuresFinding the best parseHigher-order parsingThrowing in more features: Graphical modelsFinding t

22、he best parse: Belief propagationExperimentsConclusionsNew!OldFinding Highest-Scoring ParseThe cat in the hat wore a stovepipe. ROOTConvert to context-free grammar (CFG)Then use dynamic programmingeach subtree is a linguistic constituent(here a noun phrase)ThecatinthehatworeastovepipeROOTlets vertic

23、ally stretch this graph drawingFinding Highest-Scoring Parseeach subtree is a linguistic constituent(here a noun phrase)ThecatinthehatworeastovepipeROOTso CKYs “grammar constant is no longer constant Convert to context-free grammar (CFG)Then use dynamic programmingCKY algorithm for CFG parsing is O(

24、n3)Unfortunately, O(n5) in this caseto score “cat wore link, not enough to know this is NPmust know its rooted at “catso expand nonterminal set by O(n): NPthe, NPcat, NPhat, .Finding Highest-Scoring Parseeach subtree is a linguistic constituent(here a noun phrase)ThecatinthehatworeastovepipeROOTConv

25、ert to context-free grammar (CFG)Then use dynamic programmingCKY algorithm for CFG parsing is O(n3)Unfortunately, O(n5) in this caseSolution: Use a different decomposition (Eisner 1996)Back to O(n3)Spans vs. constituentsTwo kinds of substring.Constituent of the tree: links to the rest only through i

26、ts headword (root).Span of the tree: links to the rest only through its endwords.The cat in the hat wore a stovepipe. ROOTThe cat in the hat wore a stovepipe. ROOTDecomposing a tree into spansThe cat in the hat wore a stovepipe. ROOTThe catwore a stovepipe. ROOTcat in the hat wore+in the hat worecat

27、 in+hat worein the hat+cat in the hat wore a stovepipe. ROOTFinding Highest-Scoring ParseConvert to context-free grammar (CFG)Then use dynamic programmingCKY algorithm for CFG parsing is O(n3)Unfortunately, O(n5) in this caseSolution: Use a different decomposition (Eisner 1996)Back to O(n3)Can play

28、usual tricks for dynamic programming parsingFurther refining the constituents or spans Allow prob. model to keep track of even more internal information A*, best-first, coarse-to-fineTraining by EM etc.require “outside” probabilitiesof constituents, spans, or links Hard Constraints on Valid TreesSco

29、re of an edge e = features(e)Standard algos valid parse with max total scoreour current weight vectorcant have both(one parent per word)cant have both(no crossing links)Cant have all three(no cycles)Thus, an edge may lose (or win) because of a consensus of other edges. talkNon-Projective Parsescant

30、have both(no crossing links)The “projectivity restriction.Do we really want it?IgiveaonbootstrappingtomorrowROOTllsubtree rooted at “talkis a discontiguous noun phraseNon-Projective ParsesistameamnoritgloriacanitiemROOTIgiveaonbootstrappingtalktomorrowROOTllthatNOMmyACCmay-knowgloryNOMgoing-grayACC

31、That glory may-know my going-gray (i.e., it shall last till I go gray)occasional non-projectivity in Englishfrequent non-projectivity in Latin, etc.Finding highest-scoring non-projective treeConsider the sentence “John saw Mary (left).The Chu-Liu-Edmonds algorithm finds the maximum-weight spanning t

32、ree (right) may be non-projective.Can be found in time O(n2).91030203001139rootJohnsawMary103030rootJohnsawMaryslide thanks to Dragomir RadevEvery node selects best parentIf cycles, contract them and repeatSumming over all non-projective treesFinding highest-scoring non-projective treeConsider the s

33、entence “John saw Mary (left).The Chu-Liu-Edmonds algorithm finds the maximum-weight spanning tree (right) may be non-projective.Can be found in time O(n2).How about total weight Z of all trees?How about outside probabilities or gradients?Can be found in time O(n3) by matrix determinants and inverse

34、s (Smith & Smith, 2007).slide thanks to Dragomir RadevGraph Theory to the Rescue!Tuttes Matrix-Tree Theorem (1948)The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at

35、node r.Exactly the Z we need!O(n3) time!36Building the Kirchoff (Laplacian) MatrixNegate edge scoresSum columns (children)Strike root row/col.Take determinantN.B.: This allows multiple children of root, but see Koo et al. 2007.37Why Should This Work?Chu-Liu-Edmonds analogy:Every node selects best pa

36、rentIf cycles, contract and recurClear for 1x1 matrix; use inductionUndirected case; special root cases for directed38OutlineEdge-factored parsingDependency parsesScoring the competing parses: Edge featuresFinding the best parseHigher-order parsingThrowing in more features: Graphical modelsFinding t

37、he best parse: Belief propagationExperimentsConclusionsNew!Old3940Exactly Finding the Best ParseWith arbitrary features, runtime blows upProjective parsing: O(n3) by dynamic programmingNon-projective: O(n2) by minimum spanning treebut to allow fast dynamic programming or MST parsing,only use single-

38、edge featuresfind preferred linksO(n4)grandparentsO(n5)grandp.+ siblingbigrams O(n3g6)POS trigrams O(2n)sibling pairs (non-adjacent)NP-hardany of the above featuressoft penalties for crossing linkspretty much anything else!41Lets reclaim our freedom (again!)Output probability is a product of local f

39、actorsThrow in any factors we want! (log-linear model)How could we find best parse?Integer linear programming (Riedel et al., 2006)doesnt give us probabilities when training or parsingMCMCSlow to mix? High rejection rate because of hard TREE constraint?Greedy hill-climbing (McDonald & Pereira 2006)T

40、his paper in a nutshell(1/Z) * (A) * (B,A) * (C,A) * (C,B) * (D,A,B) * (D,B,C) * (D,A,C) * none of these exploittree structure of parsesas the first-order methods do42Lets reclaim our freedom (again!)Output probability is a product of local factorsThrow in any factors we want! (log-linear model)Let

41、local factors negotiate via “belief propagationLinks (and tags) reinforce or suppress one another Each iteration takes total time O(n2) or O(n3)Converges to a pretty good (but approx.) global parsecertain global factors ok tooeach global factor can be handled fast via some traditional parsing algori

42、thm (e.g., inside-outside)This paper in a nutshell(1/Z) * (A) * (B,A) * (C,A) * (C,B) * (D,A,B) * (D,B,C) * (D,A,C) * Lets reclaim our freedom (again!)Training with many featuresDecoding with many featuresIterative scalingBelief propagationEach weight in turn is influenced by othersEach variable in

43、turn is influenced by othersIterate to achieve globally optimal weightsIterate to achievelocally consistent beliefsTo train distrib. over trees, use dynamic programming to compute normalizer ZTo decode distrib. over trees, use dynamic programming to compute messagesThis paper in a nutshellNew!Outlin

44、eEdge-factored parsingDependency parsesScoring the competing parses: Edge featuresFinding the best parseHigher-order parsingThrowing in more features: Graphical modelsFinding the best parse: Belief propagationExperimentsConclusionsNew!Old44First, a familiar exampleConditional Random Field (CRF) for

45、POS tagging45Local factors in a graphical modelfindpreferredtagsvvvPossible tagging (i.e., assignment to remaining variables)Observed input sentence (shaded)46Local factors in a graphical modelFirst, a familiar exampleConditional Random Field (CRF) for POS taggingfindpreferredtagsvanPossible tagging

46、 (i.e., assignment to remaining variables)Another possible taggingObserved input sentence (shaded)47Local factors in a graphical modelFirst, a familiar exampleConditional Random Field (CRF) for POS taggingfindpreferredtagsvnav 021n210a031vnav 021n210a031Binary factor that measures compatibility of 2

47、 adjacent tagsModel reusessame parametersat this position48Local factors in a graphical modelFirst, a familiar exampleConditional Random Field (CRF) for POS taggingfindpreferredtagsv 0.2n0.2a0“Unary factor evaluates this tagIts values depend on corresponding wordcant be adjv 0.2n0.2a049Local factors

48、 in a graphical modelFirst, a familiar exampleConditional Random Field (CRF) for POS taggingfindpreferredtagsv 0.2n0.2a0“Unary factor evaluates this tagIts values depend on corresponding word(could be made to depend onentire observed sentence)50Local factors in a graphical modelFirst, a familiar exa

49、mpleConditional Random Field (CRF) for POS taggingfindpreferredtagsv 0.2n0.2a0“Unary factor evaluates this tagDifferent unary factor at each positionv 0.3n0.02a0v 0.3n0a0.151Local factors in a graphical modelFirst, a familiar exampleConditional Random Field (CRF) for POS taggingfindpreferredtagsvnav

50、 021n210a031v 0.3n0.02a0vnav 021n210a031v 0.3n0a0.1v 0.2n0.2a0vanp(v a n) is proportionalto the product of allfactors values on v a n52Local factors in a graphical modelFirst, a familiar exampleConditional Random Field (CRF) for POS taggingfindpreferredtagsvnav 021n210a031v 0.3n0.02a0vnav 021n210a03

51、1v 0.3n0a0.1v 0.2n0.2a0van= 1*3*0.3*0.1*0.2 p(v a n) is proportionalto the product of allfactors values on v a n53First, a familiar exampleCRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible linksvanLocal factors in a graphical modelfindpreferredlinksFirst, a f

52、amiliar exampleCRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links54Local factors in a graphical modelfindpreferredlinkstfftffPossible parse encoded as an assignment to these varsvanFirst, a familiar exampleCRF for POS taggingNow lets do dependency parsin

53、g!O(n2) boolean variables for the possible links55Local factors in a graphical modelfindpreferredlinksfftftfPossible parse encoded as an assignment to these varsAnother possible parsevanFirst, a familiar exampleCRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possibl

54、e links(cycle)56Local factors in a graphical modelfindpreferredlinksftttfPossible parse encoded as an assignment to these varsAnother possible parseAn illegal parsevanfFirst, a familiar exampleCRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links(cycle)57Lo

55、cal factors in a graphical modelfindpreferredlinkstttPossible parse encoded as an assignment to these varsAnother possible parseAn illegal parseAnother illegal parsevant(multiple parents)ffSo what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolation5

56、8Local factors for parsingfindpreferredlinkst 2f1t 1f2t 1f2t 1f6t 1f3as before, goodness of this link can depend on entireobserved input contextt 1f8some other linksarent as goodgiven this inputsentenceBut what if the best assignment isnt a tree?59Global factors for parsingSo what factors shall we m

57、ultiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a “hard constraint: factor is either 0 or 1findpreferredlinksffffff0ffffft0fffftf0fftfft1tttttt0So what factors shall we multiply to define parse

58、probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a “hard constraint: factor is either 0 or 160Global factors for parsingfindpreferredlinksffffff0ffffft0fffftf0fftfft1tttttt0tfftff64 entries (0/1)So far, this is equivale

59、nt toedge-factored parsing(McDonald et al. 2005).Note: McDonald et al. (2005) dont loop through this table to consider exponentially many trees one at a time.They use combinatorial algorithms; so should we!optionally require the tree to be projective (no crossing links)werelegal!So what factors shal

60、l we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a “hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparent61Local factors for parsingfindpreferredli

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论