翻译以.原文和在同一文件中前_第1页
翻译以.原文和在同一文件中前_第2页
翻译以.原文和在同一文件中前_第3页
翻译以.原文和在同一文件中前_第4页
翻译以.原文和在同一文件中前_第5页
已阅读5页,还剩23页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

目标关系(例如:会议地点)RE系统的输入(Brin,1998;RiloffandJones,1999,AgichteinandGravano,2000)。这样的系统输(OIEWebOIE系统在抽取丰富的关系三元组时不需要任何和关系Web为例的目标关系没有提前给定,并且关系的数量很多的语料库,OIE是最理想的处理方式。传统的REOIE范式相比就像是基于词法的和不基于词法的语法解析器发现关系无关的模式真的可行吗?这些模式应该是什么样子的?我们首先考虑开E的准确率和召EIE系统是否可以提高表现?本文处理了上述问题并做出了如下贡献:O-CRFOIE系统,并展示其抽取各种关F值在O-NB的基础提高了63%。O-CRFRE系统,并发现在没有关系O-CRF有相同的准确率和较低的率。果。第6节给出了相关工作,并对未来的工作进行了讨论。 系无关的词/BunescuMooney开发的语料库中随机抽取了500个句子来量化这些模式出现的频率。这些观察结果让开放式的关系无关的抽取模型(Hearst1992的模式集合自动扩展出代表“是一个”和“部分-整体”关系的模式的研究。(Etzionietal.2006针对每个关系产生了一个词/句法模式。有趣的是,我们发现95%的模式都可以归类到 OIE系统的唯一输入,以及一小部分关系无关的启发式算法,用来学个通用抽取模因为如下几个原因,开放性抽取明显比传统的关系抽取任务更加。第一,因为统得到的知识应该是关系式元组(r,e_1,...,e_n)e_1,...,e_n以我们希望提取出(isheadquarteredin,,Redmond。更进一步,该系统必须确定“isheadquarteredin”和“isbasedin”都归为关系HEADQUARTERS(X,Y)的表达方式。与headquarters等单词的出现对于发现HEADQUARTERS(X,Y)有帮助的,但是对于通用的关系就失效了。另外,RE系统通常使用命名实体的类型作为辅助(例如,HEADQUARTERS的第二个元素应该是地点OIE中,关系是未知中关系的抽取系统。本节的余下部分会详细介绍O-CRF,将其与第一个OIE系统特定关系抽取的RE系统。(CRFs(Laffertetal.,2001)是一个可以最大化其条件概率的无向概率图模型。在一阶假设的前提下,RECRFs已经被应用到了很多关系抽取(Culottaetal.,2006。O-NB一样,O-CRF的训练过程也是无监督的。O-CRF利用一系列启发式方法关系中的名词短语,例如"<Einsteinreceived<theNobelPrize>in1921."。另一个用来提取负例的启发式方提取横跨了状语从句边缘的对象,例如"Hestudied<Einstein'swork>whenvisiting<Germany>."被当作条件随机场线性链的两个端点,并且都被分配一个固定的ENT。实体对周表关系的开始;I-RELO,代表当前词组不能明确的代表一个关系。图1给出了例子。和大多数其他自然语言抽取系统一样,O-CRF也有一些限制。首先,O-CRF只提O-CRF的提取结果中分析得到。给定输入语料库,O-CRF遍历一次数据,利用短语切分工具实现实体识别。然后,得到抽取结果后,O-CRFRESOLVER算法(YatesEtzioni,2007)来寻找关系的同义词,即表达同一个关系的不同形式的文本。基于关系特征,RESOLVER以件随机场的抽取器。下边称该系统为R1-CRF。被认为是R的实例。 个基于融合的综合关系抽取方法,合并了O-CRF开放式抽取系统的无监督学习和R1-CRF的监督学习观点。前人研究(TingWitten,1999;ZenkoDzeroski,2002;Sigletosetal.,2005)O-CRFR1-CRF的输出当作黑盒子,学习判断实体对之间的词组是否代表一个关H-CRFO-CRFR1-CRF给出的可能关系指示词作为特征。为了得到条件随McCallum2004H-CRF还计算了O-CRF和R1-CRF得到的关系之间的MongeElkan(Monge和Elkan,1996H-CRF使用的元特征是对于一个实体对两个抽取器其中的一个或者全部是否输出“没有关系”。除了这些数值特征,H-CRF也使用了 节评估了当关系的数量很大并且种类未知时,O-CRF确定关系实例的未知的能力。我们表明在没有关系相关的输入时,O-CRF抽取二元关系的准确率很高并且召回率几乎是O-NB的两倍。5.25.3O-CRF和传统抽取器以及融合的关系抽取器在抽取一小部分已知关系时的表现。我们发现虽然R1-CRF实现的传统抽取器可以达到较高的率,但O-CRFR1-CRF融合后得到的抽取器在准确率和F值上比单独的关系抽取系统有所提高。IE系统在使用句子样例测试前都经过了训练,所以在句子样例测试集上的率几乎是O-NB的两倍。O-CRF可以抽取出四种最常见到的关系种类:动词,名词+介词,动词+介词以及和率。表格3展示了O-CRF——在没有任何关系相关数据的前提下——有75%之高的准曲率。使用标注训练数据的R1-CRF的准确率略低为73.9%。对于每种关系,R1-CRF需要多少训练数据才能达到有竞争力的准确率呢?我们调R1-CRF使用的训练集规模,发现四个关系中有三个关系需要成百上千的标注数据。在句子“YahooToAcquireInktomi"中,"Acquire"被误认为了一个名词,并且缺少其他充虽然RESOLVER可以提高O-CRF近50%的率,对每种关系,O-CRF可以发现的6.5个同义词,而R1-CRF可以找到16.25个。综合上述发现,使用开放式关系抽取和传统关系抽取的取舍主要如下。开放式IE且种类未知时,开放式的IE系统是必要的。当对于一小部分指定关系的率要求较高时,传统的RE系统则更加合适。但是这种情况下须付出标注数据的成本,可同时率只有些微的减少。总体上看,F1值从65.2%增加到了66.2%。 作为第一个开放式IE系统,TEXTRUNNER是一系列希望避免关系相关信息的工Shinyama和Sekine的抢占式IE系统(2006)从一系列相关的文章中发现关系。RE工作是基于单个关系的。一般来说,RE被当作一个二分类Mooney2005基于分类的框架用来同时发现预料中名实体和关系。章之间的关系的词分配给当前实体。单个方法,以牺牲率的代价得到更高的准确率(Feldmanetal.,2005)表明了一个 IE范式进行关系无关抽取的前提。我们展示了二元关系可以被一小部分词/O-CRF,一个基于条件随机场的关系的数量和种类未知时,开放式IE系统是必要的。传统IE系统相比,开放式IE系统的率较低。但是,即使是在定向抽取任务中,开IE4O-CRF的在未来工作中,通过加强识别同义词的能力,O-CRF的率可以被进一步改进。我们也计划探索使用开放式IE系统自动提供标注训练数据的可能,用在更适合使用传参考文献E.AgichteinandL.Gravano.2000.Snowball:Extractingrelationsfromlargein-textcollections.InProcs.oftheFifthACMInternationalConferenceonDigitalLibraries.M.Banko,M.Cararella,S.Soderland,M.Broadhead,andO.Etzioni.2007.Openinformationextractionfromtheweb.InProcs.ofIJCAI.S.Brin.1998.ExtractingPatternsandRelationsfromtheWorldWideWeb.InWebDBWorkshopat6thInternationalConferenceonExtendingDatabaseTechnology,EDBT’98,pages172–183,Valencia,Spain.R.BunescuandR.Mooney.2005.Subsequencekernelsforrelationextraction.InInProcs.ofNeuralInformationProcessingSystems.R.BunescuandR.Mooney.2007.Learningtoextractrelationsfromthewebusingminimalsupervision.InProc.ofACL.A.CulottaandA.McCallum.2004.Confidenceestimationforinformationextraction.InProcsofHLT/NAACL.A.Culotta,A.McCallum,andJ.Betz.2006.Integratingprobabilisticextractionmodelsanddataminingtodiscoverrelationsandpatternsintext.InProcsofHLT/NAACL,pages296–P.Domingos.1996.Unifyinginstance-basedandrulebasedinduction.MachineLearning,O.Etzioni,M.Cafarella,D.Downey,S.Kok,A.Popescu,T.Shaked,S.Soderland,D.Weld,andA.Yates.2005.Unsupervisednamed-entityextractionfromtheweb:Anexperimentalstudy.ArtificialInligence,165(1):91–134.R.Feldman,B.Rosenfeld,andM.Fresko.2005.Teg-ahybridapproachtoinformationextraction.KnowledgeandInformationSystems,9(1):1–18.D.Freitag.2000.Machinelearningforinformationextractionininformals.MachineLearning,39(2-3):169–202.R.Girju,A.Badulescu,andD.Moldovan.2006.Automaticdiscoveryofpart-wholerelations.ComputationalLinguistics,32(1).M.Hearst.1992.Automaticacquisitionofhyponymsfromlargetextcorpora.InProcs.ofthe14thInternationalConferenceonComputationalLinguistics,pages539–545.D.KleinandC.Manning.2003.Accurateunlexicalizedparsing.InJ.Lafferty,A.McCallum,andF.Pereira.2001.Conditionalrandomfields:Probabilisticmodelsforsegmentingandlabelingsequencedata.InProcs.ofICML.A.McCallum.2002.Mallet:Amachinelearningforlanguagetoolkit..A.E.MongeandC.P.Elkan.1996.Thefieldmatchingproblem:Algorithmsandapplications.InProcs.ofKDD.E.RiloffandR.Jones.1999.LearningDictionariesforInformationExtractionbyMulti-levelBoot-strap.InProcs.ofAAAI-99,pages1044–1049.D.RothandW.Yih.2004.Alinearprogammingformulationforglobalinferenceinnaturallanguagetasks.InProcs.ofCoNLL.S.Sekine.2006.On-demandinformationextraction.InProc.ofY.ShinyamaandS.Sekine.2006.Preemptiveinformationextractionusingunrestrictedrelationdiscovery.InProc.oftheHLT-NAACL.G.Sigletos,G.Paliouras,C.D.Spyropoulos,andM.Hatzopoulos.2005.Combininginfomationextractionsystemsusingvotingandstackedgeneralization.JournalofMachineLearningResearch,6:1751,1782.R.Snow,D.Jurafsky,andA.Ng.2005.Learningsyntacticpatternsforautomatichypernymdiscovery.InAdvancesinNeuralInformationProcessingSystems17.MITPress.K.M.TingandI.H.Witten.1999.Issuesinstackedgeneralization.ArtificialInligenceResearch,10:271–289.D.Wolpert.1992.Stackedgeneralization.NeuralNetworks,A.YatesandO.Etzioni.2007.Unsupervisedresolutionofobjectsandrelationsontheweb.InProcsofNAACL/HLT.D.Zelenko,C.Aone,andA.Richardella.2003.Kernelmethodsforrelationextraction.JMLR,3:1083–1106.B.ZenkoandS.Dzeroski.2002.Stackingwithanextendedsetofmeta-levelattributesandmlr.InProc.ofECML.TheTradeoffsBetweenOpenandTraditionalRelationMicheleBankoandOrenTuringCenterBox352350TraditionalInformationExtraction(IE)takesarelationnameandhand-taggedexamplesofthatrelationasinput.OpenIEisarelation-independentextractionparadigmthatistai-loredtomassiveandheterogeneouscorporadiversesetofrelationaltuplesfromtextwith-outanyrelation-specificinput.HowisOpenIEpossible?WeyzeasampleofEnglishsentencestodemonstratethatnumerousrela-tionshipsareexpressedusingacompactsetofrelation-independentlexico-syntacticpat-terns,whichcanbelearnedbyanOpenIEsys-WhatarethetradeoffsbetweenOpenIEandtraditionalIE?Weconsiderthisquestioninthecontextoftwotasks.First,whenthenumberofrelationsismassive,andtherela-tionsthemselvesarenotpre-specified,wear-anewmodelforOpenIEcalledO-CRFandshowthatitachievesincreasedprecisionandnearlydoubletherecallthanthemodelem-ployedbyTEXTRUNNER,thepreviousstate-of-the-artOpenIEsystem.Second,whenthenumberoftargetrelationsissmall,andtheirnamesareknowninadvance,weshowthatO-CRFisabletomatchtheprecisionofatra-ditionalextractionsystem,thoughatsubstan-tiallylowerrecall.Finally,weshowhowtocombinethetwotypesofsystemsintoahy-bridthatachieveshigherprecisionthanatra-ditionalextractor,withcomparablerecall.

RelationExtraction(RE)isthetaskofrecognizingormoreentitiesintext.Typically,thetargetrelation(e.g.,seminarlocation)isgiventotheREsystemasinputalongwithhand-craftedextractionpatternsorpatternslearnedfromhand-labeledtrainingexam-ples(Brin,1998;RiloffandJones,1999;AgichteinandGravano,2000).Suchinputsarespecifictothetargetrelation.Shiftingtoanewrelationrequiresatomanuallycreatenewextractionpatternsortospecifynewtrainingexamplesbyhand.Thismanuallaborscaleslinearlywiththenumberoftar-getrelations.In2007,weintroducedanewapproachtotheREtask,calledOpenInformationExtraction(OpenIE),whichscalesREtotheWeb.AnOpenIEsys-temextractsadiversesetofrelationaltupleswithoutrequiringanyrelation-specifichumaninput.OpenIE’sextractionprocessislinearinthenumberofsinthecorpus,andconstantinthenum-berofrelations.Thus,OpenIEisideallysuitedtocorporasuchastheWeb,wherethetargetrelationsarenotknowninadvance,andtheirnumberismas-TherelationshipbetweenstandardREsystemsandthenewOpenIEparadigmisogoustotherelationshipbetweenlexicalizedandunlexicalizedparsers.Statisticalparsersareusuallylexicalized(i.e.theymakeparsingdecisionsbasedonn-gramstatisticscomputedforspecificlexemes).However,KleinandManning(2003)showedthatunlexical-izedparsersaremoreaccuratethanpreviouslylieved,andcanbelearnedinanunsupervisedman-ner.KleinandManningyzethetradeoffsbe-tweenthetwoapproachestoparsingandarguethatstate-of-the-artparsingwillbenefitfromemployingbothapproachesinconcert.Inthispaper,weexam-inethetradeoffsbetweenrelation-specific(“lexical-ized”)extractionandrelation-independent(“unlexi-calized”)extractionandreachanogousconclu-considerthetaskofopenextraction,inwhichthegoalistoextractrelationshipsfromtextwhentheirnumberislargeandidentityunknown.Wethencon-siderthetargetedextractiontask,inwhichthegoalistolocateinstancesofaknownrelation.HowdoestheprecisionandrecallofOpenIEcomparewiththatofrelation-specificextraction?IsitpossibletocombineOpenIEwitha“lexicalized”REsystemtoimproveperformance?ThispaperaddressestheWepresentO-CRF,anewOpenIEsystemthatusesConditionalRandomFields,anddemon-strateitsabilitytoextractavarietyofrela-tionswithaprecisionof88.3%andrecallof45.2%.WecompareO-CRFtoO-NB,theex-tractionmodelpreviouslyusedbyTEXTRUN-NER(Bankoetal.,2007),astate-of-the-artOpenIEsystem.WeshowthatO-CRFachievesarelativegaininF-measureof63%overO-NB.Weprovideacorpus-basedcharacterizationofhowbinaryrelationshipsareexpressedinEn-glishtodemonstratethatlearningarelation-Englishlanguage.Inthetargetedextractioncase,wecomparetheperformanceofO-CRFtoatraditionalREsys-temandfindthatwithoutanyrelation-specificinput,O-CRFobtainsthesameprecisionwithtrainedusinghundreds,andsometimesthou-sands,oflabeledexamplesperrelation.

lexicalizedandunlexicalizedREsystemsandachievesa10%relativeincreaseinprecisionwithcomparablerecallovertraditionalRE.Theremainderofthispaperisorganizedasfol-lows.Section2assessesthepromiseofrelation-independentextractionfortheEnglishlanguagebycharacterizinghowasampleofrelationsisex-pressedintext.Section3describesO-CRF,anewOpenIEsystem,aswellasR1-CRF,astandardREtion4.Section5reportsonourexperimentalresults.Section6considersrelatedwork,whichisthenfol-lowedbyadiscussionoffuturework.TheNatureofRelationsinHowarerelationshipsexpressedinEnglishsen-tences?Inthissection,weshowthatmanyrela-tionshipsareconsistentlyexpressedusingacom-terns,andfytheirfrequencybasedonasam-pleof500sentencesselectedatrandomfromanIE2007).1Thisobservationhelpstoexinthesuc-cessofopenrelationextraction,whichlearnsarelation-independentextractionmodelasdescribedinSection3.1.Previousworkhasnotedthatdistinguishedre-lations,suchashypernymy(is-a)andmeronymyberoflexico-syntacticpatterns(Hearst,1992).Themanualidentificationofthesepatternsinspiredabodyofworkinwhichthisinitialsetofextractionpatternsisusedtoseedabootstrapprocessthatautomaticallyacquiresadditionalpatternsforis-aorpart-wholerelations(Etzionietal.,2005;Snowetal.,2005;Girjuetal.,2006),Itisquitenaturalthentoconsiderwhetherthesamecanbedoneforallbi-naryrelationships.Tocharacterizehowbinaryrelationshipsareex-pressed,oneoftheauthorsofthispapercarefullystudiedthelabeledrelationinstancesandproducedalexico-syntacticpatternthatcapturedtherelationforeachinstance.Interestingly,wefoundthat95%ofthepatternscouldbegroupedintothecategorieslistedinTable1.Note,however,thattheWepresentH-CRF,anensemble-based torthatlearnstocombinetheoutputof

1Forsimplicity,werestrictourstudytobinaryXestablishedE1NPPrepXsettlementwith1VerbPrepXmovedto1 nstoacquire1Verb2XisY1(and|,|-|:)2NPX-YXestablishedE1NPPrepXsettlementwith1VerbPrepXmovedto1 nstoacquire1Verb2XisY1(and|,|-|:)2NPX-Y1(and|,)2VerbX,YE1NP(:|,)?Xhometown: Table1:TaxonomyofBinaryRelationships:Nearly95%of500randomlyselectedsentencesbelongstooneoftheeightcategoriesabove.showninTable1aregreatlysimplifiedbyomittingtheexactconditionsunderwhichtheywillreliablyproduceacorrectextraction.Forinstance,whilemanyrelationshipsareindicatedstrictlybyaverb,detailedcontextualcuesarerequiredtodetermine,exactlywhich,ifany,verbobservedinthecontextthem.Inthenextsection,weshowhowwecanuseaConditionalRandomField,amodelthatcanbede-scribedasafinitestatemachinewithweightedtran-sitions,tolearnamodelofhowbinaryrelationshipsareexpressedinEnglish.

oftherelationshipamongthem.Forexample,fromthesentence,“isheadquarteredinbeau-tifulRedmond”,weexpecttoextract(isheadquar-teredin,,Redmond).Moreover,followingextraction,thesystemmustidentifyexactlywhichrelationstringsrcorrespondtoageneralrelationofinterest.Toensurehigh-levelsofcoverageonaper-relationbasis,weneed,forexampletodeducethat“’sheadquartersin”,“isheadquarteredin”and“isbasedin”aredifferentwaysofexpressingmakesitdifficulttoleveragethefullsetoffeaturestypicallyusedwhenperformingextractiononere-lationatatime.Forinstance,thepresenceofthe andheadquarterswillbeusefulindetectinginstancesoftheHEADQUARTERS(X,Y)relation,butarenotusefulfeaturesforidentifyingrelationsingeneral.Finally,REsystemstypicallyusenamed-entitytypesasaguide(e.g.,thesecondargumenttoHEADQUARTERSshouldbeaLOCA-TION).InOpenIE,therelationsarenotknowninadvance,andneitheraretheirargumenttypes.TheuniquenatureoftheopenextractiontaskhasledustodevelopO-CRF,anopenextractionsys-temthatusesthepowerofgraphicalmodelstoiden-RelationGivenarelationname,labeledexamplesofthere-lation,andacorpus,traditionalRelationExtraction(RE)systemsoutputinstancesofthegivenrelationfoundinthecorpus.Intheopenextractiontask,re-lationnamesarenotknowninadvance.ThesoleinputtoanOpenIEsystemisacorpus,alongwith

areusedtolearnageneralmodelofextractionforallrelationsatonce.tifyrelationsintext.TheremainderofthissectiondescribesO-CRF,andcomparesittotheextractionmodelemployedbyTEXTRUNNER,thefirstOpenIEsystem(Bankoetal.,2007).WethendescribeR1-calone-relation-at-a-timesetting.RandomFieldsTEXTRUNNERinitiallytreatedOpenIEasaclas-sificationproblem,usingaNaiveBayesclassifiertoFigure1:RelationExtractionasSequenceLabeling:ACRFisusedtoidentifytherelationship,bornin,betweenKafkaandPraguetwoentitiesindicatedarelationshipornot.Fortheremainderofthispaper,werefertothismodelasO-NB.Whereasclassifierspredictthelabelofasin-

scribedusingfeaturesthatcanbeextractedwithoutsyntacticorsemanticysisandusedtotrainaCRF,asequencemodelthatlearnstoidentifyspansoftokensbelievedtoindicateexplicitmentionsofrelationshipsbetweenentities.O-CRFfirstappliesaphrasechunkertoeachdoc-ument,andtreatstheidentifiednounphrasesascan-didateentitiesforextraction.Eachpairofenti-tiesappearingnomorethana umnumberofwordsapartandtheirsurroundingcontextarecon-sideredaspossibleevidenceforRE.Theentitypairservestoanchoreachendofalinear-chainCRF,andbothentitiesinthepairareassignedafixedlabelofENT.Tokensinthesurroundingcontextaretreatedaspossibletextualcuesthatindicatearelation,variable,graphicalmodelsmodelmultiple,terdependentvariables.ConditionalRandomFieldscalmodelstrainedto abilityofafinitesetoflabelsYgivenasetofinputobservationsX.Bymakingafirst-orderMarkovas-sumptionaboutthedependenciesamongtheoutputvariablesY,andarrangingvariablessequentiallyinalinearchain,REcanbetreatedasasequencela-belingproblem.Linear-chainCRFshavebeenap-pliedtoavarietyofsequentialtextprocessingtasksincludingnamed-entityrecognition,part-of-speechtagging,wordsegmentation,semanticroleidentifi-cation,andrecentlyrelationextraction(Culottaetal.,2006).AswithO-NB,O-CRF’strainingprocessisself-supervised.O-CRFappliesahandfulofrelation-independentheuristicstothePennTreebankandob-tainsasetoflabeledexamplesintheformofrela-tionaltuples.Theheuristicsweredesignedtocap-turedependenciestypicallyobtainedviasyntacticparsingandsemanticrolelabelling.Forexample,aheuristicusedtoidentifypositiveexamplesistheextractionofnounphrasesparticipatinginasubject-<theNobelPrize>in1921.”Anexampleofaheuristicthatlocatesnegativeexamplesistheex-tractionofobjectsthatcrosstheboundaryofanad-verbialclause,e.g.“Hestudied<Einstein’swork>whenvisiting<Germany>.”Theresultingsetoflabeledexamplesare

canbeassignedoneofthefollowinglabels:B-REL,indicatingthestartofarelation,I-REL,indicatingthecontinuationofapredictedrelation,orO,indi-catingthetokenisnotbelievedtobepartofanex-plicitrelationship.AnillustrationisgiveninFig-ure1.ThesetoffeaturesusedbyO-CRFislargelysimilartothoseusedbyO-NBandotherstate-of-the-artrelationextractionsystems,Theyin-cludepart-of-speechtags(predictedusingasepa-raytrained pressions(e.g.detectingcapitalization,punctuation,etc.),contextwords,andconjunctionsoffeaturesoccurringinadjacentpositionswithinsixwordstotheleftandsixwordstotherightoftheword.AuniqueaspectofO-CRFisthatO-CRFusescontextwordsbelongingonlytoclosedclasses(e.g.prepositionsanddeterminers)butnotfunctionwordssuchasverbsornouns.Thus,unlikemostREsystems,O-CRFdoesnottrytorecognizesemanticclassesofentities.O-CRFhasanumberoflimitations,mostofwhicharesharedwithothersystemsthatperformextrac-tionfromnaturallanguagetext.First,O-CRFonlyextractsrelationsthatareexplicitlymentionedinthetext;implicitrelationshipsthatcouldinferredfromthetextwouldneedtobeinferredfromO-CRFextractions.Second,O-CRFfocusesonrela-tionshipsthatareprimarilyword-based,andnotin-dicatedsolelyfrompunctuationor-levelfeatures.Finally,relationsmustoccurbetweenen-titynameswithinthesamesentence.O-CRFwasbuiltusingtheCRFprovidedbyMALLET(McCallum,2002),aswellaspart-of-speechtaggingandphrase-chunkingtoolsavailablefromOPENNLP.2Givenaninputcorpus,O-CRFmakesasinglepassoverthedata,andperformsentityidentificationus-ingaphrasechunker.TheCRFisthenusedtolabelinstancesrelationsforeachpossibleentitypair,sub-jecttotheconstraintsmentionedpreviously.Followingextraction,O-CRFappliestheRE-SOLVERalgorithm(YatesandEtzioni,2007)tofindrelationsynonyms,thevariouswaysinwhichare-lationisexpressedintext.RESOLVERusesaprob-abilisticmodeltopredictiftwostringsrefertothesameitem,basedonrelationalfeatures,inanunsu-pervisedmanner.InSection5.2wereportthatRE-SOLVERbooststherecallofO-CRFby50%.extractiontorelation-specific,or“lexicalized”ex-traction,wedevelopedaCRF-basedextractorunderthetraditionalREparadigm.WerefertothissystemasR1-CRF.AlthoughthegraphicalstructureofR1-CRFisthesameasO-CRFR1-CRFdiffersinafewways.AgivenrelationRisspecifiedapriori,andR1-CRFistrainedfromhand-labeledpositiveandnegativein-stancesofR.Theextractorisalsopermittedtousealllexicalfeatures,andisnotrestrictedtoclosed-classwordsasisO-CRF.SinceRisknowninad-vance,ifR1-CRFoutputsatupleatextractiontime,thetupleisbelievedtobeaninstanceofR.SinceO-CRFandR1-CRFhavecomplementaryviewsoftheextractionprocess,itisnaturaltowon-powerfulextractor.Inavarietyofmachinelearningsettings,theuseofanensembleofdiverseclassi-fiersduringpredictionhasbeenobservedtoyieldhigherlevelsofperformancecomparedtoindivid-orhybridapproachtoREthatleveragesthediffer-2

inO-CRF,andlexicalized,supervisedextractioninStackedgeneralization,orstacking,(Wolpert,eralbase-levelclassifiers.Thetrainingsetusedtotrainthemeta-classifierisgeneratedusingaleave-one-outprocedure:foreachbase-levelalgorithm,aandthenusedtogenerateapredictionfortheleft-outexample.Themeta-classifieristrainedusingthepredictionsofthebase-levelclassifiersasfeatures,andthetruelabelasgivenbythetrainingdata.Previousstudies(TingandWitten,1999;ZenkoandDzeroski,2002;Sigletosetal.,2005)haveshownthattheprobabilitiesofeachclassvalueasestimatedbyeachbase-levelalgorithmareeffectivefeatureswhentrainingmeta-learners.Stackingwasshowntobeconsistentlymoreeffectivethanvoting,anotherpopularensemble-basedmethodinwhichtheoutputsofthebase-classifiersarecombinedei-therthroughmajorityvoteorbytakingtheclassvaluewiththehighestaverageprobability.Weusedthestackingmethodologytobuildanensemble-basedextractor,referredtoasH-CRF.TreatingtheoutputofanO-CRFandR1-CRFasblackboxes,H-CRFlearnstopredictwhich,ifany,tokensfoundbetweenapairofentities(e1,e2),in-dicatesarelationship.DuetothesequentialnatureofourREtask,H-CRFemploysaCRFasthemeta-learner,asopposedtoadecisiontreeorregression-basedclassifier.H-CRFusestheprobabilitydistributionoverthesetofpossiblelabelsaccordingtoeachO-CRFandR1-CRFasfeatures.Toobtaintheprobabilityateachpositionofalinear-chainCRF,theconstrainedforward-backwardtechniquedescribedin(CulottaandMcCallum,2004)isused.H-CRFalsocomputestheMongeElkandistance(MongeandElkan,1996)betweentherelationspredictedbyO-CRFandR1-CRFandincludestheresultinthefeatureset.Anadditionalmeta-featureutilizedbyH-CRFindicateswhethereitherorbothbaseextractorsreturn“nore-lation”foragivenpairofentities.InadditionPRPRPRPR000000thesenumericfeatures,H-CRFusesasubsetofthebasefeaturesusedbyO-CRFandR1-CRF.Ateachgivenpositionibetweene1ande2,thepresenceofthewordobservedatiasafeature,aswellasthepresenceofthepart-of-speech-tagati.ThefollowingexperimentsdemonstratethebenefitsofOpenIEfortwotasks:openextractionandtar-getedextraction.Section5.1,assessestheabilityofO-CRFtolo-cateinstancesofrelationshipswhenthenumberofrelationshipsislargeandtheiridentityisunknown.CRFextractsbinaryrelationshipswithhighprecisionandarecallthatnearlydoublesthatofO-NB.Sections5.2and5.3compareO-CRFtotradi-tionalandhybridREwhenthegoalistolocatein-stancesofasmallsetofknowntargetrelations.Wefindthatwhilesingle-relationextraction,asembod-iedbyR1-CRF,achievescomparativelyhigherlev-elsofrecall,ittakeshundreds,andsometimesthou-sands,oflabeledexamplesperrelation,forR1-CRFtoapproachtheprecisionobtainedbyO-CRF,whichisself-trainedwithoutanyrelation-specificinput.Wealsoshowthatthecombinationofunlex-icalized,openextractioninO-CRFandlexicalized,supervisedextractioninR1-CRFimprovesprecisionOpenThissectioncontraststheperformanceofO-CRFwiththatofO-NBonanOpenIEtask,and

thatO-CRFachievesbothdoubletherecallandin-creasedprecisionrelativetoO-NB.Forthisexper-iment,weusedthesetof500sentences3describedinSection2.BothIEsystemsweredesignedandtrainedpriortotheexaminationofthesamplesen-tences;thustheresultsonthissentencesamplepro-videafairmeasurementoftheirperformance.WhiletheTEXTRUNNERsystemwaspreviouslyfoundtoextractover7.5milliontuplesfromacor-pusof9millionWebpages,theseexperimentsarethefirsttoassessitstruerecalloveraknownsetofrelationaltuples.AsreportedinTable2,O-CRFex-tractsrelationaltupleswithaprecisionof88.3%andarecallof45.2%.O-CRFachievesarelativegaininF1of63.4%overtheO-NBmodelemployedbyTEXTRUNNER,whichobtainsaprecisionof86.6%andarecallof23.2%.TherecallofO-CRFnearlydoublesthatofO-NB.O-CRFisabletoextractinstancesofthefourmostfrequentlyobservedrelationtypes–Verb,Noun+Prep,Verb+PrepandInfinitive.Threeofthefourremainingtypes–Modifier,CoordinatenandCoordinatev–whichcompriseonly8%ofthesam-ple,arenothandledduetosimplifyingassumptionsmadebybothO-CRFandO-NBthattokensindicat-ingarelationoccurbetweenentitymentionsintheO-CRFvs.R1-CRFTocompareperformanceoftheextractorswhenasmallsetoftargetrelationshipsisknowninad-vance,weusedlabeleddataforfourdifferentre-lations–corporateacquisitions,birthces,inven-torsofproductsandawardwinners.ThefirsttwodatasetswerecollectedfromtheWeb,andmadeavailablebyBunescuandMooney(2007).Toaug-mentthesizeofourcorpus,weusedthesametech-handoverallcollections.Foreachofthefourre-lationsinourcollection,wetrainedR1-CRFfromlabeledtrainingdata,andraneachofR1-CRFandO-CRFovertherespectivetestsets,andcomparedtheprecisionandrecallofalltuplesoutputbyeach P PRPRTable3:Precision(P)andRecall(R)ofO-CRFand P Table4:For4relations,aminimumof4374hand-taggedexamplesisneededforR1-CRFtoapproximaytheprecisionofO-CRFforeachrelation.A“∗”indicatestheuseofallavailabletrainingdata;inthesecases,CRFwasunabletomatchtheprecisionofO-Table3showsthatfromthestart,O-CRFachievesahighlevelofprecision–75.0%–withoutanyrelation-specificdata.Usinglabeledtrainingdata,theR1-CRFsystemachievesaslightlylowerpreci-sionof73.9%.ExactlyhowmanytrainingexamplesperrelationdoesittakeR1-CRFtoachieveacomparablelevelofprecision?Wevariedthenumberoftrainingex-amplesgiventoR1-CRF,andfoundthatin3outof4casesittakeshundreds,ifnotthousandsoflabeledexamplesforR1-CRFtoachieveacceptablelevelsofprecision.Intwocases–acquisitionsandinven-tions–R1-CRFisunabletomatchth

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论