翻译以.原文和在同一文件中前_第1页
翻译以.原文和在同一文件中前_第2页
翻译以.原文和在同一文件中前_第3页
翻译以.原文和在同一文件中前_第4页
翻译以.原文和在同一文件中前_第5页
免费预览已结束,剩余27页可下载查看

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

基于的开放域事件抽信息的工作主要集中在专线文本;的独特的特性表明了开放域事件抽取的新的和机会。这篇描述了TWICAL——的第一个开放域事件抽取及归类系统。我们证明了从中准确地抽取出有意义的事件是确基线上达到了F1值最大14%的增长。我们系统的持续更新和展示可以在 查看;我们的自然语言处理工具可以/aritter/_nlp获取到I.2.7自然语言处理]:语言H.2.8数据库管理]:数据库应用——数据挖掘引和等社交充满了的信息和有关。目前每天发tweet数量已经超过了2亿条,但是其中的大多数可能是冗余的[57],或者只有有限的内容,导致了信息超过负荷。很显然,我们能够从由很多独立的tweet中合成的并且做出结构化展示的事件中获得好处。之前的在事件抽取[21,1,54,18,43,11,7]方面的工作主要集中在文章上面,这类文本已经成为当前事件的最佳来源,与此同时,和等社交网很多研究关于社交中趋势和[26,29],但是很少的工作集中在从短文本或非正式文本中抽取结构化表示的事件的上。 ;重要的就没问题。个人tweet也十分精炼,经常缺少能够把它们分类到感用非正式风格写的tweets造成设计用来处理编辑过的文本的自然语言处理工具机会:tweets前和之后)的复杂推理对于精确地联系事件和时间表达式通常是必须的[328]。针对充满噪声的风格处理,我们采用了自然语言处理在噪声文本处理上最近的工作[463119]tweets构成的语料集,它们会用户谈论了大量的,这造成了预先不能够清楚地确定准确的事件类型集合用于分类。为了处理上讨论的事件的多样性问题,在一个新一个要标注的准确的类型集合然后标注在上找到的大量的事件语料。任何不连续的数据,剩下的就是通过有用的标注的数据{注释:这种标注和自动发现的事件类型}3中列出了。结果14%的F1得分。系统概期和类型(见表1。这种表示方法和在中提到的重要事件紧密相关。1TWICAL1基于我们的基于的事件抽取系统中的大多数部件在图1中有一个预览。tweets流,我们的系统将抽取命名实体和事件短语以及有意义的tweets的数量估量它们之间的关联度,为了确定自然语言处理工具,比如被设计用来处理编辑过的文本(如文章)名实体识别和词性标注工具由于的噪声和独特的格式在处理文本内数据训练名实体标注和词性标注工具。我们也开发了一个由域内标注数据训练得到的事件标注工具,在第4章有说明。命名实体分自然语言处理工具,如被设计来处理编辑文本(如文章)名实体分割和词性标注工具当应用到的文本上时由于它的噪声和独特的样式表现例如,在中,大写是命名实体抽取的一个关键特征,但是这个特征在字母都此外140个字符限制和它的用户的创造性拼写,tweets中的未见词占了很高的一个比例,为了解决这个问题,我们利用了一个在之前工作[46]中做的基于域内tweets训练极大地提高了命名实体分割的2中可以看到和经过训练的最先进的的斯坦福命名实体识别器[17]的对比数据。我们的系统相对于斯坦福标注器在分割命名实体上取得了F1值52%的增长。表2通过在域内数据上训练,我们相对于斯坦福标注器在分割命名实体上取得了F1值52%增长抽取事件描为了从的充满噪声的文本中抽取事件描述,我们首先标注了一个已经建立的方法完成噪声文本中的序列标注工作的时候,这是在中抽取动词:苹果公司在10月5日发布5?是的名词:5的发布会在10月4日这些短语提供了重要的上下文,例如抽取SteveJobs和事件短语died与10月5日有关,比简单地提取SteveJobs包含的信息。总而言之,事件陈述对于上游工作非常有用如确定事件类别,这点具体在第6章说明。为了建立一个标注器用来识别事件,我们参照Timebank[43]中的事件标注了包含事件短语的1000条tweets(有19484词。识别事件触发词这个下文,字典,正则特征,还包括基于我们的-tuned词性标注工具[46]以及从Sauri等的WordNet[50]中收集到的事件短语。EVENT,取得的F值为0.64为了证明域内数据训练的我们用Timebank表抽取和解析时间表达除了抽取事件和相关名实体,我们也需要抽取出发生时间。通常,人们天”或者“昨天”都可以表示同一个日期,这取决于tweets的时间。我们采用TempEx[33]来解析时间表达式,它输入一个参照时间,一些文本和词性(从数据训练得到的词性标注工具)得到一个具有清楚时间的事件表达的准确率(94%-从268个抽取样本中估计得到)对于我们的使用已经足够高了。TempEx’s的在Tweets上的高准确率可以被解释为时间表达式相对精确造成的。尽管仍然有通过手动处理噪声时间表达式的方式提高上抽取时间率的空间(参考Ritter等[46]的“明天”的50中拼写变化调整 事件类型分的数据集可能仅仅包含这些类别中的几个,这就导致分类非常。符合在这个类型事件发生的时间集d上的一个分布。我们的模型中包含时间可为同样的事件的有区别的应该有同样的类型。LinkLDA[15]1中有所说明。这种方法具有事件些数据的不同的概率请求。这在处理如分类事件的时候很有用。图评分布,进行了1000词迭代,保持隐含变量作业可以在最后一个抽样中看到。某个类型具有很高概率的事件词给他们分配了诸如SPORTS,POLITICS,MUSICRELEASE等。通过这100中类型,我们找到了52中有意义的事件一个短语的中包含了如applied,call,contact,jobinterview等,这些与用户在是简单地标记为OTHER,一个完整的集合用来标注自动发现的事件类型如图2所示,并且图2中有每种类型的覆盖率。主义这个类型确定的工作只需方向是自动标注和评估自动发现的事件类型类似最近在话题模型[38,25]上的工作。图为了评估我们的模型在分类事件上的能力,我们集合了数据中所有出现20的500对(参看第7章。最大熵基线来进它使用了采用10折交叉验证标注的500份事图4中通过变化最可能的类型的概率阈值得到的一个准确率-率曲线对比了我们的无监督方法和监督方法基线。此外,表4中对比了在最F值点的准F1值14%的增长。图表排列事McDonalds就和日历时间关联出现得相当频繁了。重要的事件可以根据和G2对数似然统计。G2被争论比卡方更适合用来进行文本分析工作[12]。尽管的精确检验能够达1011,那很G2检验是基于实体在日期上的条件模型和实体和日期之间的独立模型的似然概率。给定一个实体e和日期d,这个统计量可以通过下面的公司计算得到这里Oe,d是观察到的tweets中同时包含e和d的部分,Oe,-d是观察到的tweetse而不包d的部分,等等。同样的,Ee,d假设一个模型独立期tweets中包含e和d的部分。实3日的两周大的未来窗口出现的前100、5001000的日历条目。12011113tweets(通过新的数据上进行事件类型预测我们进行了50次迭代进行抽样,同时保持原始数据的隐藏变量不变。这种流式方法推导和Yao等[56]展示的工作类似。取了100、500、1000个中50个事件。我4个分开1-3中的每一准确?也就是说,这些事件包含正确的实体,日期,事被标注为确。我们与没有采用Ritter等名实体识别工具或者我们的事件识别工具作为基线件数量)G2统计量的阈值得到。主义,基线仅仅与第三表BreakingDawn”中的一部分,尽管“Breaking”这个词和11月18日有很强的关好了(90%)-相对于n元模型基线达到了80%的增长。和预期一样,当日期条目足够高了。除了表现得更不容易产生抽取错误之外,高排位的实体/日期对更可5图“Yikes实体和日期之间的弱关联在某些情况下,实体被正确地分,但是也没如“相关工在我们第一个研究上的开放域事件抽取的时候,主要有两条研究的线索:从上抽取特定事件类型和从上抽取开放域事件[43]。最近,有很多研究基于进行信息抽取和事件判别。Benson等[5]使用监督来训练关系抽取器,它判别用户列出了地点如纽约的tweets近也有也有基于的关于事件检测和的工作,它们不抽取结构化的信息,但具有不受有限域限制的优点。Petrovic等研究了一种使用局部敏感散列函数判别谁是第一家突发事件的流式方法[40]。Becker等[3],Popescu等[42,41]Lin等[28]研究发现对事件相关词语或者tweets聚类正在进行之中。和之前的基于的事件判别作为对比,我们的方法是独立于事件类型和域的,也因章对比时,的状态消息代表的既是独特的也是机会。的噪声线上的复杂的推理,但在包含叙述的长文本中往往就是需要的[51]。此外BowlPartyonFeb5th。,最后,我们注意到应用自然语言处理技术到如这样的短文本中的研 工人自然后的信息泛滥[35,27,36]。,结我们人工评估发现这些事件的质量相对了ngrams基线性能上有了一个明显的的14%。我们的系统的一个持续更新的示范可以在http://s /aritter/_nlp获得。致我们非常感谢LukeZettlemoyer和的评审者对之前草稿的有用的反馈。这个研究受NSF(IIS-0803481)和ONR支持(N00014-08-1-0431,在大学中心实施。参考文J.Allan,R.Papka,andV.Lavrenko.On-lineneweventdetectionandtracking.InSIGIR,M.Banko,M.J.Cafarella,S.Soderl,M.Broadhead,andO.Etzioni.Openinformationextractionfromtheweb.InInIJCAI,2007.H.Becker,M.Naaman,andL.Gravano.Beyondtrendingtopics:Real-worldeventidentication .InICWSM,2011.C.Bejan,M.Titsworth,A.Hickl,andS.Harabagiu.Nonparametricbayesianmodelsforunsupervisedeventcoreferenceresolution.InNIPS.2009.E.Benson,A.Haghighi,andR.Barzilay.Eventdiscoveryinsocialmediafeeds.InACL,S.BethardandJ.H.Martin.Identifcationofeventmentionsandtheirsemanticclass.InEMNLP,N.ChambersandD.Jurafsky.Template-basedinformationextractionwithoutthetemplates.InProceedingsofACL,2011.N.Chambers,S.Wang,andD.Jurafsky.Classifyingtemporalrelationsbetweenevents.InACL,C.Danescu-Niculescu-Mizil,M.Gamon,andS.Dumais.Markmywords!Linguisticstylemodationinsocialmedia.InProceedingsofWWW,pages745{754,2011.G.Doddington,A.Mitc ,M.Przybocki,L.Ramshaw,S.Strassel,andR.Weischedel.TheAutomaticContentExtraction(ACE)Program{Tasks,Data,andEvaluation.LREC,2004.T.Dunning.Accuratemethodsforthestatisticsofsurpriseand J.Eisenstein,B.O'Connor,N.A.Smith,andE.P.Xing.Alatentvariablemodelforgeographiclexicalvariation.InEMNLP,2010.J.Eisenstein,N.A.Smith,andE.P.Xing.Discoveringsociolinguisticassociationswithstructuredsparsity.InACL-HLT,2011.E.Erosheva,S.Fienberg,andJ.Lafferty.Mixed-membershipmodelsofscientificpublications.PNAS,2004.A.Fader,S.Soderland,andO.Etzioni.Identifyingrelationsforopeninformationextraction.InEMNLP,2011.J.R.Finkel,T.Grenager,andC.Manning.Incorporatingnon-localinformationintoinformationextractionsystemsbygibbssampling.InACL,2005.E.Gabrilovich,S.Dumais,andE.Horvitz.Newsjunkie:providing alizednewsfeedsviaysisofinformationnovelty.InWWW,2004.K.Gimpel,N.Schneider,B.O'Connor,D.Das,D.Mills,J.Eisenstein,M.Heilman,D.Yogatama,J.Flanigan,andN.A.Smith.Part-of-speechtaggingfor:Annotation,features,andexperiments.InACL,2011.T.L.GriffithsandM.Steyvers.Findingscientifictopics.ProcNatlAcadSciUSA,101Suppl1,2004.R.GrishmanandB.Sundheim.Messageunderstandingconference-6:Abriefhistory.InProceedingsoftheInternationalConferenceonComputationalLinguistics,1996.Z.KozarevaandE.Hovy.LearningargumentsandsupertypesofsemanticrelationsG.KumaranandJ.Allan.Textclassifcationandnamedentitiesforneweventdetection.InSIGIR,2004.J.D.Lafferty,A.McCallum,andF.C.N.Pereira.Conditionalrandomfields:Probabilisticmodelsforsegmentingandlabelingsequencedata.InICML,2001.J.H.Lau,K.Grieser,D. ,andT.Baldwin.Automaticlabellingoftopicmodels.InACL,J.Leskovec,L.Backstrom,andJ.Kleinberg.Meme-trackingandthedynamicsofthenewscycle.InKDD,2009.W.Lewis,R.Munro,andS.Vogel.Crisismt:Develoacookbookformtincrisissituations.InProceedingsoftheSixthWorkshoponStatisticalMachineTranslation,2011.C.X.Lin,B.Zhao,Q.Mei,andJ.Han.PET:astatisticalmodelforpopulareventstrackinginsocialcommunities.InKDD,2010.J.Lin,R.Snow,andW.Morgan.Smoothingtechniquesforadaptiveonlinelanguagemodels:Topictrackingintweetstreams.InKDD,2011.X.LingandD.S.Weld.Temporalinformationextraction.InAAAI,X.Liu,S.Zhang,F.Wei,andM.Zhou.Recognizingnamedentitiesintweets.InACL,I.Mani,M.Verhagen,B.Wellner,C.M.Lee,andJ.Pustejovsky.Machinelearningoftemporalrelations.InACL,2006.I.ManiandG.Wilson.Robusttemporalprocessingofnews.InACL,R.C.Moore.Onlog-likelihood-ratiosandthesignificanceofrareevents.InEMNLP,R.Munro.SubwordandspatiotemporalmodelsforidentifyingactionableinformationinHaitianKreyol.InCoNLL,2011.G.Neubig,Y.Matsubayashi,M.Hagiwara,andK.Murakami.Safetyinformationmining-whatcanNLPnadisaster-. LP,2011.D.,A.U.Asuncion,P.Smyth,andM.Welling.Distributedinferenceforlatentdirichletallocation.InNIPS,2007.D.,J.H.Lau,K.Grieser,andT.Baldwin.Automaticevaluationoftopiccoherence.InHLT-NAACL,2010.D.OSeaghdha.Latentvariablemodelsofselectionalpreference.InACL,ACL'10,S.Petrovic,M.Osborne,andV.Lavrenko.Streamingrststorydetectionwithapplicationto.InHLT-NAACL,2010.A.-M.PopescuandM.Pennacchiotti.Dancingwiththestars,nbagames,politics:Anexplorationof users'responsetoevents.InICWSM,2011.A.-M.Popescu,M.Pennacchiotti,andD.A.Paranjpe.Extractingeventsandeventdescriptionsfrom.InWWW,2011.J.Pustejovsky,P.Hanks,R.Sauri,A.See,R.Gaizauskas,A.Setzer,D.Radev,B.D.Day,L.Ferro,andM.Lazo.TheTIMEBANKcorpus.InProceedingsofCorpusLinguistics2003,A.Ritter,C.Cherry,andB.Dolan.Unsupervisedmodelingof conversations.InHLT-NAACL,2010.A.Ritter,C.Cherry,andW.B.Dolan.Data-drivenresponsegenerationinsocialmedia.InEMNLP,2011.A.Ritter,S.Clark,Mausam,andO.Etzioni.Namedentityrecognitionintweets:A.Ritter,Mausam,andO.Etzioni.Alatentdirichletallocationmethodforselectionalpreferences.InACL,2010.K.RobertsandS.M.Harabagiu.Unsupervisedlearningofselectionalrestrictionsanddetectionofargumentcoercions.InEMNLP,2011.T.Sakaki,M.Okazaki,andY.Matsuo.Earthquakeshakes users:real-timeeventdetectionbysocialsensors.InWWW,2010.R.Sauri,R.Knippen,M.Verhagen,andJ.Pustejovsky.Evita:arobusteventrecognizerforqasystems.InHLT-EMNLP,2005F.SongandR.Cohen.Tenseinterpretationinthecontextofnarrative.InProceedingsoftheninthNationalconferenceonArtificialinligence-Volume1,AAAI'91,1991.B.VanDurmeandD.Gildea.Topicmodelsforcorpus-centricknowledgegeneralization.InTechnicalReportTR-946,DepartmentofComputerScience,UniversityofRochester,Rochester,D.S.Weld,R.Homann,andF.Wu.UsingWikipediatobootstrapopeninformationextraction.SIGMODRec.,2009.Y.Yang,T.Pierce,andJ.Carbonell.Astudyofretrospectiveandon-lineeventdetection.InProceedingsofthe21stannualinternationalACMSIGIRconferenceonResearchanddevelopmentininformationretrieval,SIGIR'98,1998.L.Yao,A.Haghighi,S.Riedel,andA.McCallum.Structuredrelationdiscoveryusinggenerativemodels.InEMNLP,2011.collections.InKDD,2009.F.M.Zanzotto,M.Pennaccchiotti,andK.Tsioutsiouliklis.Linguisticredundancyin.InEMNLP,2011. EventExtractionAlanComputerSci.&Eng.Seattle,WA

ComputerSci.&Eng.Seattle,WADecide,Inc.Seattle,WA

OrenEtzioniComputerSci.&Eng.Seattle,WAEventEventSteveAmandaTweetsarethemostup-to-dateandinclusivestreamofin-formationandcommentaryoncurrentevents,buttheyarealsofragmentedandnoisy,motivatingtheneedforsystemsthatcanextract,aggregateandcategorizeimportantevents.open-eventextraction.ThispaperdescribesTwiCal—thefirst event-extractionandsystemfor .Wedemonstratethcurayex-tractinganopen- calendarofsignificanteventsfromisindeedfeasible.Inaddition, resentanovelapproachfordiscoveringimportanteventcategoriesandclas-sifyingextractedeventsbasedonlatentvariablemodels.Byachievesa14%increaseinumF1overasupervisedbaseline.Acontinuouslyupdatingdemonstrationofoursys-temcanbeviewedathttp://s NLPtoolsareavailableat _.I.2.7[NaturalLanguageProcessing]:Languagepars-ingandunderstanding;H.2.8[DatabaseManagement]:Databaseapplications—dataminingGeneralAlgorithms,themostup-to-dateinformationandbuzzabout∗workwasconductedattheUniversityofPermissiontomakedigitalorhardcopiesofallorpartofthisworkforalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesnotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.KDD’12,August12–16,2012,Beijing,Copyright2012ACM978-1-4503-1462-6/12/08

Table1:Examplesofeventsextractedbyevents.Yetthenumberoftweetsposteddailyhasrecentlyexceededtwo-hundredmillion,manyofwhichareeitherre-dundant[57],oroflimitedinterest,leadingtoinformationoverload.1Clearly,wecanbenefitfrommorestructuredrep-resentationsofeventsthataresynthesizedfromindividualPreviousworkineventextraction[21,1,54,18,43,11,7]hasfocusedlargelyonnewsarticles,ashistoricallythisgenreoftexthasbeenthebestsourceofinformationoncur-rentevents.Intheme,socialnetworkingsitessuchasand haveeanimportantcom-plementarysourceofsuchinformation.Whilestatusmes-sagescontainawealthofusefulinformation,theyareveryaggregationandcategorization.Althoughtherehasbeenextractingstructuredrepresentationsofeventsfromshortorinformaltexts.thisdisorganizedcorpusofnoisytextisachallengingprob-lem.Ontheotherhand,individualtweetsareshortandself-containedandarethereforenotcomposedofcomplexdiscoursestructureasisthecasefortextscontainingnar-eventextractionfromisindeedfeasible,forexam-accurateasdemonstratedin§8.challengesandopportunitiesforthetaskofopen-eventextraction.Challenges:usersfrequentlymentionmundaneeventsintheirdailylives(suchaswhattheyateforlunch)whichareonlyofinteresttotheirimmediatesocialnetwork.Incontrast,ifaneventismentionedinnewswiretext,it1 200-mlio-et-per-lPRF1Stanford-T-issafetoassumeitisofgeneralimportance.Individualtweetsarealsoveryterse,oftenlackingsufficientcontexttocategorizethemintotopicsofinterest(e.g.Spots,Pol-itics,ProductReleaseetc...).FurtherbecausePRF1Stanford-T-

Table2:Bytrainingon

data,weadvancewhichsetofeventtypesareappropriate.Finally,designedforeditedtextstoperformextremelypoorly.Opportunities:Theshortandself-containednatureoftweetsmeanstheyhaveverysimplediscourseandpragmaticsystems.Forexampleinnewswire,complexreasoningaboutrelationsbetweenevents(e.g.beforeandafter)isoftenre-quiredtoaccurayrelateeventstotemporalexpressions[32,8].ThevolumeofTweetsisalsomuchlargerthanthebeexploitedmoreeasily.onNLPinnoisytext[46,31,19],annotatingacorpusofsequence-labelingmodelstoidentifyeventmentionsinmil-lionsofmessages.Becauseoftheterse,sometimesmundane,buthighlyre-dundantnatureoftweets,weweremotivatedtofocusonextractinganaggregaterepresentationofeventswhichpro-videsadditionalcontextfortaskssuchaseventcategoriza-tion,andalsofiltersoutmundaneeventsbyexploitingre-dundancyofinformation.roposeidentifyingimportanteventsasthosewhosementionsarestronglyassociatedwithreferencestoauniquedateasopposedtodateswhichareevenlydistributedacrossthecalendar.usersdiscussawidevarietyoftopics,makingitateforcategorization.Toaddressthediversityofeventsdiscussedon,weintroduceanovelapproachtodis-coveringimportanteventtypesandcategorizingaggregateeventswithinanew.rizationwouldrequirefirstdesigningannotationguidelines(includingselectinganappropriatesetoftypestoannotate),thenannotatingalargecorpusofeventsfoundin.Thisapproachhasseveraldrawbacks,asitisaprioriunclearwhatsetoftypesshouldbeannotated;alargeamountofeffortwouldberequiredtomanuallyannotateacorpusofeventswhilesimultaneouslyrefiningannotationstandards.roposeanapproachtoopen-eventcatego-rizationbasedonlatentvariablemodelsthatuncoversanappropriatesetoftypeswhiatchthedata.Theau-filteroutanywhichareincoherentandtherestareanno-tatedwithinformativelabels;2examplesoftypesdiscoveredusingourapproacharelistedinfigure3.Theresultingsetoftypesarethenappliedtocategorizehundredsofmillionsofextractedeventswithouttheuseofanymanuallyannotatedourapproachresultsina14%improvementinF1scoreover2Thisannotationandfilteringtakesminimaleffort.Oneoftheauthorsspentroughly30minutesinspectingandanno-tatingtheautomaticallydiscoveredeventtypes.

a52%improvementinF1scoreovertheStanfordNamedEntityRecognizeratsegmentingentitiesinTweets[46].TwiCalextractsa4-tuplerepresentationofeventswhichincludesanamedentity,eventphrase,calendardate,andcloselymatchthewayimportanteventsaretypicallymen-tionedin.Anoverviewofthevariouscomponentsofoursystemforextractingeventsfrom ispresentedinFigure1.Givenarawstreamoftweets,oursystemextractsnamedentitiesinassociationwitheventphrasesandunambigu-ousdateswhichareinvolvedinsignificantevents.FirstthetweetsarePOStagged,thennamedentitiesandeventtheextractedeventsarecategorizedintotypes.Finallywemeasurethestrengthofassociationbetweeneachnameden-tityanddatebasedonthenumberoftweetstheyco-occurNLPtools,suchasnamedentitysegmentersandpartofspeechtaggerswhichweredesignedtoprocesseditedtexts(e.g.newsarticles)performverypoorlywhenappliedtotextduetoitsnoisyanduniquestyle.Toaddresstheseissues,weutilizeanamedentitytaggerandpartofspeechtaggertrainedonin-datapresentedinpreviouswork[46].Wealsodevelopaneventtaggertrainedonin- annotateddataasdescribedin§4.NAMEDENTITYNLPtools,suchasnamedentitysegmentersandpartofspeechtaggerswhichweredesignedtoprocesseditedtexts(e.g.newsarticles)performverypoorlywhenappliedtotextduetoitsnoisyanduniquestyle.Forinstance,capitalizationisakeyfeaturefornameden-tityextractionwithinnews,butthisfeatureishighlyun-reliableintweets;wordsareoftencapitalizedsimplyforemphasis,andnamedentitiesareoftenleftalllowercase.Inaddition,tweetscontainahigherproportionofout-of-thecreativespellingofitsusers.Toaddresstheseissues,weutilizeanamedentitytag-gertrainedonin- datapresentedinpreviouswork[46].3Trainingontweetsvastlyimprovesperformanceatseg-mentingNamedEntities.Forexample,performancecom-paredagainstthestate-of-the-artnews-trainedStanfordNamedobtainsa52%increaseinF1scoreovertheStanfordTaggeratsegmentingnamedentities.EXTRACTINGEVENTInordertoextracteventmentionsfrom’stext,wefirstannotateacorpusoftweets,whichis /atepSSMTWTF POSCalendarEventFigure1:Processingpipelineforextractingeventsfrom.Newcomponentsdevelopedaspartofthisworkareshadedingrey.usedtotrainsequencemodelstoextractevents.Whileweapplyanestablishedapproachtosequence-labelingtasksinnoisytext[46,31,19],thisisthefirstworktoextractevent-referringphrasesin.Eventphrasescanconsistofmanydifferentpartsofasillustratedinthefollowing

TwiCal-No3:Precisionandrecallateventphrase AppletoAnnounce5onOctober4th?!YES!Nouns:5announcementcomingOctAdjectives:WOOOHOONEWCAN’TThesephrasesprovideimportantcontext,forexampleex-tractingtheentity,SteveJobsandtheeventphrasediedinconnectionwithOctober5th,ismuoreinformativethansimplyextractingSteveJobs.Inaddition,eventmentionsarehelpfulinupstreamtaskssuchascategorizingeventsintotypes,asdescribedin§6.Inordertobuildataggerforrecognizingevents,weanno-lowingannotationguidelinessimilartothosedevelopedfortheEventtagsinTimebank[43].Wetreattheproblemofrecognizingeventtriggersasasequencelabelingtask,us-ingConditionalRandomFieldsforlearningandinference[24].LinearChainCRFsmodeldependenciesbetweenpredictedlabelsofadjacentwords,whichisbeneficialforex-tionary,andorthographicfeatures,andalsoincludefeaturesTheprecisionandrecallatsegmentingeventphrasesarereportedinTable3.Ourclassifier,TwiCal-vt,obtainstrainingdata,wecompareagainstabaselineoftrainingoursystemontheTimebankcorpus.PORALEXPRESSIONSInadditiontoextractingeventsandrelatednamedenti-ties,wealsoneedtoextractwhentheyoccur.Ingeneraltherearemanydifferentwaysuserscanrefertothesame“tomorrow”or“yesterday”couldallrefertothesameday,dependingonwhenthetweetwaswritten.Toresolvetem-poralexpressionswemakeuseofTempEx[33],whichtakes

traction.Allresultsarereportedusing4-foldcrossvalidationoverthe1,000manuallyannotatedtweets(about19Ktokens).Wecompareagainstasystemwhichdoesn’tmakeuseoffeaturesgeneratedbasedonourtrainedPOSTagger,inadditiontoasystemtrainedontheTimebankcorpuswhichusesthesamesetoffeatures.asinputareferencedate,sometext,andpartsofspeech(fromour-trainedPOStagger)andmarkstempo-ralexpressionswithunambiguouscalendarreferences.Al-thoughthismostlyrule-basedsystemwasdesignedforuseonnewswiretext,wefinditsprecisiononTweets(94%-hightobeusefulforourpurposes.TempEx’shighprecisiononTweetscanbeexplainedbythefactthatsometempo-ralexpressionsarerelativelyunambiguous.Althoughthereappearstoberoomforimprovingtherecalloftemporalextractiononbyhandlingnoisytemporalexpres-sions(forexampleseeRitteret.al.[46]foralistofoveradaptingtemporalextractiontoaspotentialfutureCLASSIFICATIONOFEVENTTocategorizetheextractedeventsintotypesroposeanapproachbasedonlatentvariablemodelswhichinfersanappropriatesetofeventtypestomatchourdata,andalsoclassifieseventsintotypesbyleveraginglargeamountsofunlabeleddata.Supervisedorsemi-supervisedclassificationofeventcat-prioriunclearwhichcategoriesareappropriatefor.Secondly,alargeamountofmanualeffortisrequiredtoan-notatetweetswitheventtypes.Third,thesetofimportantcategories(andentities)islikelytoshiftovertime,orwithinafocuseduserdemographic.Finallymanyimportantcat-egoriesarerelativelyinfrequent,soevenalargeannotateddatasetmaycontainjustafewexamplesofthesecategories,makingclassificationdifficult.ForthesereasonsweweremotivatedtoinvestigateFigure2:Compleistofautomaticallydiscoveredeventtypeswithpercentageofdovered.Inter-pretabletypesrepresentingsignificanteventscoverroughlyhalfofthedata.supervisedapproachesthatwillautomaticallyinduceeventtypeswhiatchthedata.Weadoptanapproachbasedonlatentvariablemodelsinspiredbyrecentworkonmodelinginformationextraction[4,55,mightappearaspartofeitheraPoliticalEvent,oraSpotEvnt.Eachtypecorrespondstoadistributionovernamedentitiesninvolvedinspecificinstancesofthetype,additiontoadistributionoverdatesdonwhicheventsofthetypeoccur.Includingcalendardatesinourmodelhastheeffectofencouraging(thoughnotrequiring)eventswhichoccuronthesamedatetobeassignedthesametype.Thisishelpfulinguidinginference,becausedistinctreferencestothesameeventshouldalsohavethesametype.[15],andispresentedasAlgorithm1.Thisapproachhastheadvantagethatinformationaboutaneventphrase’stypealsonaturallyp.Inaddition,becausetheapproachisbasedongenerativeaprobabilisticmodel,itisstraightfor-thedata.Thisisusefulforexamplewhencategorizingag-gregateevents.ForinferenceweusecollapsedGibbsSampling[20]whereeachhiddenvariable,zi,issampledinturn,andparametersareintegratedout.ExampletypesaredisplayedinFigure3.Toestimatethedistributionovertypesforagivenevent,asampleofthecorrespondinghiddenvariablesistakenfromfornewdataisperformedusingastreamingapproachtoinference[56].

Top5EventPhrasestailgate-scrimmage-tailgating ing-regularconcert-presale-per-forms-concerts-tick-matinee-musical- newseason-seasonfi-nale-finishedseason-episodes-neisodewatchlove-dialoguetheme-inception-hallpass-movieinning-innings-pitched-homered-ialdebateosama- candidate-republi-candebate-debatenetworknewsbroad-cast-airing-prime-timedrama-channel-unveils-unveiled-an-nounces-launches-wrapsoffshowstrading-hall-mtg-zoning-briefingstocks-tumbled-trad-ingreport-openedhigher-tumblesmaths-englishtest-exam-revise-physicsinstores-albumout-debutalbum-dropson-hitsvotedoff-idol--idolseason-dividend-sermon-preaching-preached-worship-declaredwar-warsing-openedfire-senate-legislation-re-peal-budget-electionwinners-lottoresults-enter-winner-contestbailplea-murdertrial-sentenced-plea-con-filmfestival-screening-starring-film-goslingliveforever-passedaway-sadnews-con-dolences-burriedaddinto-50%off-up- -saveupdonate-tornadorelief-disasterrelief--raise

Top5espn-ncaa-tigers-ea-gles-varsitytaylorswift-toronto-britneyspears--shrek-lesmis-leeevans-wicked-broad-jerseyshore-true-glee-dvr-netflix-blackswan-in-sidious-tron-scottpil-mlb-redsox--twins-obamaobama-gop-cnn-nbc-espn-abc-fox-applecrosoft-uk-townhall-cityhall-club-commerce-whitereuters-newyork-china-english-maths-ger-man-bio-itunes-ep-uk-ladygaga-americanidol-america-beyoncechurch-jesus-pastor-faith-godlibya-afghanistan-#syria-syria-natosenate-house-obama-gopipad-award-goodluck-caseyanthony--newdelhi-supremecourthollywood-nyc-la-losangeles-newyork lennon-young-peacegroupon-earlybird-@etsy-etsyjapan-redcross-joplinjune- Figure3:ExampleeventtypesdiscoveredbyToevaluatetheabilityofourmodeltoclassifysignificantevents,wegathered65millionextractedeventsofthe

model.Foreachtypet,welistthetop5entitieswhichhavehighestprobabilitygivent,andthe5eventphraseswhichassignhighestprobabilitytot.Algorithm1Generativestoryforourdatainvolvingeventtypesashiddenvariables.BayesianInferencetechniquesforeacheventtypet=1...TGenerateβnaccordingtosymmetricDirichlet

TwiCal-Table4:Precisionandrecallofeventtypecatego-rizationatthepointofumFscore. SupervisedBaselineGenerateβtdaccordingtosymmetricDirichletdistributionSupervisedBaselineendforeachuniqueeventphrasee=1...|E|GenerateθeaccordingtoDirichletdistributionforeachentitywhichco-occurswithe,i=1...NeGenerateznfromGeneratetheentityne,ifromMultinomial(βznendforeachdatewhichco-occurswithe,i=1...NdGeneratezdfromGeneratethedatede,ifromMultinomial(βznendforendforlistedinFigure1(notincludingthetype).WethenranGibbsSamplingwith100typesfor1,000iterationsofburn-in,keethehiddenvariableassignmentsfoundinthelastOneoftheauthorsmanuallyinspectedtheresultingtypesandassignedthemlabelssuchasSports,Politics,Musi-cReleaseandsoon,basedontheirdistributionoverenti-ties,andtheeventwordswhichassignhighestprobabilitytotocoherenteventtypeswhichreferredtosignificanttheothertypeswereeitherincoherent,orcoveredtypesofeventswhicharenotofgeneralinterest,forexampletherewasaclusterofphrasessuchasapplied,call,contact,jobinterview,etc...whichcorrespondtousersdiscussingeventsrelatedtosearchingforajob.Sucheventtypeswhichdonotcorrespondtosignificanteventsofgeneralinterestwerewiththecoverageofeachtypeislistedinfigure2.Notethatthisassignmentoflabelstotypesonlyneedstobedoneonceandproducesalabelingforanarbitrarilylargenumberofeventinstances.Additionallythesamesetoftypescaneas-inferencetechniques[56].Oneinterestingdirectionforfu-tureworkisautomaticlabelingandcoherenceevaluationofautomaticallydiscoveredeventtypesogoustorecentworkontopicmodels[38,25].aggregateevents,wegroupedtogetherall(entity,date)pairswhichoccur20ormo

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论