翻译.原文和在同一文件中前_第1页
翻译.原文和在同一文件中前_第2页
翻译.原文和在同一文件中前_第3页
翻译.原文和在同一文件中前_第4页
翻译.原文和在同一文件中前_第5页
已阅读5页,还剩22页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

ETH8092

KenoAlbrechtETHZurichRoger

ETHZurichDistributedComputing8092Zurich,SwitzerlandBuzzTrack,一个电子邮件客户端扩展,它可以帮助用户处理电子邮件超来概括其内容。最后,我们评估聚类方案中题检测与(TDT)现有的工作范围内的文章:我们的算法表现出对电子邮件作为文本当前的工作类似的性能。我们相信,BuzzTrack的组织结构,它可以不花钱给最终用户来获得,将用于管理大量收ACM等级:H5.2.f信息接口和演示]:用户界面-图形用户界面。H4.3.c与系统应用]:通信应用-电子信箱。H3.3.a.[信息和检索]:信息搜索和检索-聚概述::电子邮件,组织,聚类,检测,,发送者的行简有电子邮件通讯或电子邮件交流和同事研究的想法。并不等同于线程。一个线多个。如Outlook或Mozilla雷鸟,和显示电子邮件文件夹层次结构。用户必须我们如何按照组织电子邮件?我们的技术是受“话题检测与”(TDT),一施比如按比例回答邮件,联系基于电子邮件数量、过去的行为,和回复时间。我们NIST的措施。我们研究用相同的标题或者内容词标识多线程或当标题质量不足的情况。相关研电子邮件过载的问题[23]现在被广泛承认和得到大量关注甚至是大众的关注。现在已经提出了多种解决方案,其中包括基于任务的[22],基于活动[10],基于优先级的[12],接收发送者[3]组织方案。例如,Dredze以及其他人[10]提供成功的算法来识多用户不信任这些方法,因为它们可能会将邮件移出视线,再也没有出现[3]检测与文章[18]。等人研究使得普通用户更容易使用这项技术[2],但我们不知道TDT应用于电子邮件的任何工作。件。两个例子,一个是TaskMaster[5],有助于用户组织电子邮件、附件、和其他信基于用户手动去联系的重重。相比之下,我们自动显示组织而无需任何手动输应用概BuzzTrack被实现为一个扩展至Mozilla雷鸟1.5,但聚类和标记在Python中执 上。它的实现包含大约3000行JavaScript/XUL代码和14000行Python代码。这两个组件是独立于平台的。图1显示了一个使用BuzzTrack典型的截图:列表在添加栏左侧,右侧是按照同样,用户可以手动重命名的名字。图2栏组件和支持的操作过邮件过滤器集成电子邮件客户端或过滤解决方案,如Spamato[1]。聚我们使用一个聚类算法,我们感的是立即处理传入的电子邮件。映射到任意数量的最靠近的话题。我们尝试用不同的方法从功能生成决定分数值,包括预处外国字符转换为规范形式HTML标记、样式表、标点和数字。这用的语言是大家所知道的,我们应用Porter算法以所有单词变成它们的基本形式。我们目前处理英语和德语。这两种语言占99.6%的语料库。我们不使用语聚类特m1,……,mn和M个现有的群体这代表一个包含一个或多个邮件,表示为C1,…Cm。存在一组n是词语出现在所有电子邮件。这些词语称为t1,…,tn。于一个预处理邮件mi,词与ti的词频率定义为tfi,j,文档频率定义为dfj。我们现在定义术语权重wi,j如下:(1+log𝑤𝑖,𝑗{

))𝑙𝑜𝑔

𝑖,𝑗≥ 𝑖𝑓𝑡𝑓𝑖,𝑗=第二个测量也是基于文本和测试相似性:它计算两封邮件行里Si和Sj之间的组词:对于每个集群Ck,都有一组发送方ppl(Ck),包含所有所有邮件集群中的电子邮件SimSubsetSimOverlap度量引进[10]ppl集里删除用户,送方地址的部分,simssubset和simsoverlop。他们帮助识别邮件来自不同的人题。几乎所有的联系人使用的语料库中现代的电子邮件客户端并使用这些消息头。Simthread措施给集群中的电子邮件的比例,在同一个线程作为新电子邮件:T集包含所有mi∈Ck在同一线程mi。特性,通过发送者数量乘以simsubject和simtext。计数:标题字段里在过去邮件的数量生成决定分练一个线性支持向量机设定的发展序列最小优化[19](SMO),在Weka工具箱实现[24]时间窗单键集群时,我们只考虑集群一直活跃在过去60天。从开发语料库统计数据当我们不感处理用户的整个电子邮件,话题比60天久的就下降。选择这种设计的集介绍在[21],我们不能花太多的处理能力获取集。每次添加一个新的电子邮件一个的重新计算。话题。在以前的工作[2,21],只有较低的情况下被显示。28”,“FTDNewsletter——2006年9月4”将设为“FTDNewsletter”,用3occurs-afterhigh-tf·idf术语之间的关系。括号里的数字给平均第一的位置。在这里,your“with”和“shipped”是名词和不是high-tf·idff的词。由于词干提取,“item”和“items”映射到相同的术语。生成的将“Amazonorderstatusitems通过选择高权值遍历整个图。如果两个共享相同的最高量,我们选择这个词的请注意,我们需要保持一个额外的tf*idf值:我们把整个电子邮件集群作为一个词没有出现在集群。这个词具有描述性的集群,但是每个文本tf*idf测量,其值可能评价方评估我们的集群方案,我们遵循的指导方针由NIST评价提供话题检测与的新子邮件,讨论一个新的。对于每个电子邮件流,输出可以“yes”(电子邮件是一种新的)或“no”(电子邮件不是一个新话题)。检测质量特征是错过的概率和假警报错误,pmisspFA。这些错误概率结合成一个Cmiss和CFA分别是一个丢失和一个假的成本pmiss和pFA分别是一个丢失和一个假的条件概率Ptarget和Pnontarget目标先验概率(Pnontarget=1−Ptarget)的评估,我们认为,错过比假警报更重要:我们选择Cmiss=1.0CFA=0.1。最后,Cdet0和一个小系统,总是发出“是语料为什么不使用或调整现有的语料库?一个考虑因素是使用Enron语料库[14]和手工一直在努力重建线程结构基于消息内容和Exchange头文件[25]。最后,4定义是非常的。下面是一些典型语料库例子:电子邮件从一个同事询问的经验与一个特定品牌的数码相机。在第二个线程,两个是线程漂移和漂移。线漂移意味着包含几个线程,在数码相机的例子。2.44的平均发展集由线程。漂移意味着相同的线程包含不同话题的信息:只是寻找过去的邮件语料库的所有者,达到“回复”,有时,只有删除的文本。这每个线程的平均数量的话题在开发语料库只是1.04图 信息增益特性的从开发语料库数据:新话题检测和任务评价结5显示了每个功能的信息增益来执行这两个任务的开发主体。而文本、基于线程和基于人的属性是有用的在这两个任务,其他属性似乎更有价值。一个让人的结果是,我们使用检测错误的权衡(检波器)曲线[18]可视化之间的权衡错过检出率pmisspFA和假警报率。权值的曲线是由通过系统空间决定成绩。在分数空间每一点,pmisspFA估计和绘制线连接。我们纪念曲线上的点的检测Cdet成本最小化。图6、7、8和9显示结果TDT7TT任务:这些结果与性能的当前工作在TDT文章[11]:相比之下,我们的TT质量有点差,NTD能好一点。这表明在电子邮件的文本质量低的文章相比,我们能够利用额外的信息出现在电子邮件。该模型的是高度的定义和依赖用户的口味。8Cdet,norm,NTDTTTT[9]标识每个需要大约100ms,这样短的延迟是可以接受的电子邮件客户端用户。未来的我们还想尝试BuzzTrack的用户界面并介绍有用的补充。例如,这里分组方案,而是inbox视图可以提供上下文通过展示视觉映射相关的电子邮件一个当前选中。结件夹或创建一个根集学习文件夹内容:新自动识别。启发了我们的聚类方法和评估工作在TDT(检测与)社区,我们还发现,这些技术对电子邮件数据表现良好。我们开发了BuzzTrack插件的一个受欢迎的电子邮件客户端此功能通过一个用户友9质量措施最低成本点NTDTT致作者要感谢MarkusEgli,ErolKoc,FabianSiegeMichaelKuhn,RolandFlury,AaronHarnly,MattBrezina,AdamSmith,和AbhayPuri,他们对该进行过有帮助的讨论,也要感谢一些评论者的建议。参考文[1].KenoAlbrecht,NicolasBurri,andRogerWattenhofer.Spamato-anextendablespamfiltersystem.InProc.ConferenceonandAnti-Spam(CEAS)’05,2005.[2].JamesAllan,StephenHarding,DavidFisher,AlvaroBolivar,SergioGuzman-Lara,andPeterAmstutz.Takingtopicdetectionfromevaluationtopractice.InProc.HawaiiInternationalConferenceonSystemSciences(HICSS)’05,page101.1,2005.[3].OlleBalterandCandaceSidner.¨Bifrostinboxorganizer:givinguserscontrolovertheinbox.InProc.Nordicconferenceon puterinteraction(NordiCHI)’02,pages111–118,2002.[4].RonBekkerman,AndrewMcCallum,andGaryHuang.Automaticcategorizationofintofolders:BenarkexperimentsonEnronandSRIcorpora.TechnicalIR-418,CIIR,UMassAmherst,[5].VictoriaBellotti,NicolasDucheneaut,MarkHoward,andIanSmith.Takingtotask:thedesignandevaluationofataskmanagementcenteredtool.InProc.ConferenceonHumanFactorsinComputingSystems(CHI)’03,pages345–352,2003.[6].DavidBleiandJohnLafferty.Correlatedtopicmodels.InAdvancesinInformationProcessingSystems,pages147–154.MITPress,Cambridge,MA,2006.[7].GaryBoone.ConceptfeaturesinRe:Agent,aninligentagent.InProc.ConferenceonAutonomousAgents(AGENTS)’98,pages141–148,[8].WilliamB.CavnarandJohnM.Trenkle.N-gram-basedtextcategorization.InSymposiumon ysisandInformationRetrieval(SDAIR)’94,pages161–175,[9].GaborCselle.Organizing.Master’sthesis,ETHZurich,[10].MarkDredze,TessaLau,andNicholasKushmerick.Automaticallyclassifyingsintoactivities.InProc.ConferenceonInligentUserInterfaces(IUI)’06,pages77,[11].JonathanFiscusandBarbaraWheatley.OverviewoftheTDT2004evaluationandresults.NationalInstituteofStandardsandTechnology,2004. Horvitz, and Attention-sensitivealerting.InProc.ConferenceonUncertaintyandArtificialIn (UAI)’99,pages305–313,1999.[13].YoramM.KalmanandSheizafRafaeli.chronemics:Unobtrusiveprofilingresponsetimes.InProc.HawaiiInternationalConferenceonSystemSciences(HICSS)’05,page108.2,2005.[14].BryanKlimtandYimingYang.IntroducingtheEnroncorpus.InProc.ConferenceonandAnti-Spam(CEAS)’04,2004.[15].Naturallanguagetoolkit.[16].CarmanNeustaedter,A.J.BernheimBrush,andMarcA.Smith.Beyond“from”and“received”:Exploringthedynamicsof triage.InProc.CHI’05,pages1977–1980,[17].CarmanNeustaedter,A.J.BernheimBrush,MarcA.Smith,andDanyelFisher.Thesocialnetworkandrelationshipfinder:Socialsortingfortriage.InProc.ConferenceonandAnti-Spam(CEAS)’05,2005.[18].NIST.The2004topicdetectionandtracking(TDT2004)taskdefinitionevaluationplan.Technicalreport,NationalInstituteofStandardsandTechnology,2004.[19].JohnC.Platt.Sequentialminimaloptimization:Afastalgorithmfortrainingsupportvectormachines.Technicalreport,Research,[20].RichardSegalandJeffreyO.Kephart.Mailcat:Anin ligentassistantfororganizing[21].ArunC.Surendran,JohnC.Platt,andErinRenshaw.Automaticdiscoveryofaltopicstoorganize.InProc.Conference andAnti-Spam(CEAS)[22].SteveWhittaker,VictoriaBellotti,andJacekGwizdka.in informationmanagement.CommunicationsoftheACM,49(1):68–73,2006.[23].SteveWhittakerandCandaceSidner.overload:exploring managementof.InProc.CHI’96,pages276–283,1996.[24].IanH.WittenandFrankEibe.DataMining:PracticalMachineLearningToolsTechniques(SecondEdition).MorganKaufmann,[25].Jen-YuanYehandAaronHarnly.threadreassemblyusingsimilarityInProc.ConferenceonandAnti-Spam(CEAS)’06,GaborETHZurichDistributedComputingGroup8092Zurich,Switzerland

KenoETHZurichDistributedComputingGroup8092Zurich,Switzerland

RogerETHZurichDistributedComputingGroup8092Zurich,SwitzerlandWepresentBuzzTrack,anclientextensionthathelpsusersdealwithoverload.Thispluginenhancestheinterfacetopresentmessagesgroupedbytopic,insteadofthetraditionalapproachoforganizinginfolders.Wediscussaclusteringalgorithmthatcreatesthetopic-basedgrou,andaheuristicforlabelingtheresultingclusterstosummarizetheircontents.Lastly,weevaluatethecluster-ingschemeinthecontextofexistingworkontopicdetectionandtracking(TDT)fornewsarticles:OuralgorithmexhibitsWebelievethatBuzzTrack’sorganizationstructure,whichcanbeobtainedatnocosttotheenduser,willbehelpfulformanagingthemassiveamountsofthatlandintheinboxeveryday.ACMClassification:H5.2.f.[Informationinterfacesandpresentation]:UserInterfaces.-Graphicaluserinterfaces.H4.3.c.[Informationtechnologyandsystemsapplications]:Communicationsapplications.-Electronicmail.H3.3.a.[Informationstorageandretrieval]:InformationSearchandRetrieval.-Clustering.Generalterms:Algorithms,Measurement,:,organization,clustering,topicdetection,topictracking,senderbehaviorTohelpusersdealwiththegrowingamountsoftheyre-ceive,newstructuresoforganizationareneeded.Aparadigmsearchfunctionalitywhenlookingforoldsintheirarchive.However,sensiblyorganizingtheunstructuredin-boxisstillachallenge.Typically,inboxdataisviewedinalistsortedbyarrivaltime:Thereisnosenseofimportance,coherence,orcontent.Inthispaper,weaddressthischallengebygrou intotopics.Atopicisacohesivestreamofinformationthatisrelevanttotheuser–here,itconsistsofanumberofPermissiontomakedigitalorhardcopiesofallorpartofthisworkforalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.Copyright2007ACM1-59593-481-2/07/0001...$5.00.

swhichdiscussorrelatetothesameidea,action,event,task,orquestion,amongothers.Examplesfortopicsarese-quencesofsinwhichameetingisorganizedandtheresultsarediscussed,allsinanewsletter,oranexchangewithacoworkeraboutaresearchidea.Topicsarenotequivalenttothreads.Athreadconsistsofsinthesamereplysequence.Atopicmayspanseveralthreadsandathreadmaybedistributedoverseveraltopics.Severalcomparableapproaches,suchastask-,activity-,ormoregeneralinthatitcoversallkindsof-baseddis-cussionsinsteadofjustbusinessprocessesortask-relateditems.Anotherdifferenceisthatweseektocomplement,notreplaceexistingclientfunctionality:Theuser’sin-boxstaysuntouched–wesimplyprovideaviewonthedata,whichweintegrateintotheclientasaplugin,showninFigure1.Figure1:InboxviewwithBuzzTrackplugin;topicside-barontheleft,groupedinboxviewontheanizationhasbeenstuckinthefile-system-likefolderparadigmforalongtime.Standardclients,such OutlookorMozillaThunderbird,storeanddis- foldersandmovesintothemmanuallyorviauser-definedfilterrules.Whiletherehasbeenmuchworkonau-used.Onereasonforthisisthedistrustofusersintheunder-lyingclassificationalgorithms:Theyfearthatamisplacedmessagemayneverreceivetheattentionitdeserves.Incon-trast,wenevermovemessagesoutoftheinbox,butprovideaviewonthedata.withjustafewfoldersandfilterrulesfornewsletters[16].Ouraimistocomplementthisflatstructurewithaviewthatprovidesafullyautomaticgrouofsbasedontop-ics,withnochangetotheinboxdataandnoadditionalcosttotheuser.Howdowegroupsintotopics?Ourtechniquesareinspiredby“topicdetectionandtracking”(TDT),aseriesofNISTcompetitions[18]aimedatorganizingnewsarti-clesintocoherenttopics.AsinTDTfornewsarticles,thecoreofourclusteringalgorithmisasimpletextsimilaritymeasure.Althoughtextisoftenoflowerqualitythannewspapercontent,itisricherinnon-textualinformation:havioralmeasuressuchasthepercentageofreplieds,contactrankingsbasedonvolume,pastbehavior,andreplytiming.Thefocusofourworkwasdesigningaclus-teringalgorithmwhichorganizeswell,accordingtoteringmethodsandfeatures,andpresentourfindingsinthesectiononclustering.Anotherproblemweaddressisthatoftopiclabeling:Afterweidentifiedtopicgroups,welabelthemsotheuserunder-standstheircontentsataglance.Webrieflypresentaheuris-ticwhichusesthesubjectlineforsingle-threadtopicsandcommonsubjectorcontentwordsformulti-threadtopicsorcaseswherethequalityofthesubjectlineisinsufficient.Therestofthispaperissplitintotwoparts:Afterreviewingrelatedwork,afirstsectionexplainsourclientpluginanditsfunctionality,whiletherestdiscussesourclusteringandlabelingalgorithms.Theproblemofoverload[23]isnowwidelyacknowl-edgedandhasfoundalotofattentioneveninthepopularmedia.Avarietyofsolutionshavebeenproposed,includingtask-based[22],activity-based[10],priority-based,[12],andsender-based[3]organizationschemes.Forexample,Dredzeetal.[10]providesuccessfulalgorithmstorecognizesthatbelongtoparticularactivities,suchasorganizingacon-ference,reviewingpapers,orpurchasingequipment.Ourap-proachismoregeneralasitcoversallkindsofdiscussions,butcomesatthepriceofslightlydecreasedaccuracy.Mostsimilartoourworkarethe“altopics”proposedbySurendranetal.[21].However,themechanismspresentedthereprovidearetrospectiveviewofpasts,whereasourprocessingisanon-lineschemethatisupdatedasnewscomein.Wehadtoapplysimplerbutlesscomputationallyintensiveschemesoftopicclusteringandlabeling.Automaticfoldering[4,7,14,20]alsooffershelpinorganiz-ing,buttheuserneedstomanuallycreatefoldersandseedthemwithexampledata–alaborioustaskifmanyfine-grainedtopicsneedtobedifferentiated.Manyusersdistrust

theseschemesbecausetheymightmove soutofsight,nevertobeseenagain[3].Whileothermodelsoftext-basedtopicidentificationhavebeenproposed[6],theworkpresentedhereusesthesametechniquesastopicdetectionandtrackingfornewsarticles[18].Allanetal.investigatedmakingthistechnologymoreaccessiblefornormalusers[2],butwearenotawareofanyworkthatappliedTDTto.Thereexistsaconsiderablebodyofworkonredesigninguserinterfaces,andgrou sbyvariousat-tributes.TwoexamplesareTaskMaster[5],whichhelpsusersgroups,attaents,links,andotherinforma-butingeniousideaofgrou sbasedontheimpor-tancethattheusermanuallyassignedtocontacts.Incontrast,manualinput.Theusercanstillmakemanualcorrections,ifAPPLICATIONThissectionpresentsthefunctionalityofBuzzTrack,itsim-plementation,andusage.BuzzTrackisimplementedasanextensiontoMozillaThun-derbird1.5,buttopicclusteringandlabelingisperformedinPython.Wewillmakethissoftwareavailablefordownload Theimplementationcontainsabout3,000linesofJavaScript/XULcodeand14,000linesofPythoncode.Bothcomponentsareplatform-independent.topiclistwasaddedasasidebarontheleft,andthelistontherightsidegroupssbytopic.Whenanewarrives,itisprocessedbytheclusteringcomponent,whichreturnseitherthedecisiontocreateanewtopicfortheoraddsittooneoftheexistingtopics.Dependingonapplicationsettings,each canalsobeaddedtomultipleclosesttopics.atopic:Thetopiclabel,acountandthenamesofallpeopleinvolvedinthetopic,andthenumberofunreadandtotalmessages.Inanexpandedview,thefullnamesofalltopictheuserclicksonasidebarentry,thelistscrollstotheTopicsinthesidebararegenerallysortedby .Theusercan“star”importanttopicstopullthemtothetopofthelist:Thissolvesacommonuserproblemthatimportantsarequicklyforgottenoncetheydropoutofthefirstfewscreensoftheinbox[23].Thelistig-noresstarringtomirrorthetraditionalorganizationschemeofmanyusers.needtomanuallycreateoredittopiccontents.However,theycanstillmanuallyfixmistakesmadebytheclusteringalgo-orthroughcontexts.Similarly,userscanmanuallyre-nametopicclusters.Figure2:Componentsandsupportedoperationsinthetopicsidebar.Wearealsotestingtwoexperimentalfeatures:Auseful“re-plyexpected”indicatormarkstopicsinwhichthenewestwasaddressedtotheuserandhasnotyetbeenrepliedto.Theotherfeatureis“expandtopic”/“contracttopic”–thesetwooptionspulllessormoreintoacertaintopiciftheuserthinksthatthetopicistoolargeorthatimportantmessageshavebeenleftout.Theseoperationsretroactivelymodifytheclusteringthresholdforagiventopic.Atpresent,weconcernourselvessolelywith ingintheinbox.Weassumethatallspamhasalreadybeenfil-teredout,eitherbythespamfilterintegratedintheclientorafilteringsolutionsuchasSpamato[1].NotethattheusercanexittheBuzzTrackviewandreturntothetraditionalthree-panesetupbyclickingabuttonintheThissectionexplainsourclusteringalgorithmfor

Inafirststep,weparseeach’sheaders,body,andattaents.Werememberheaderinformationbutonlykeepthefilenamesofattaents.Forthetextsimilaritymetricatthecoreofouralgorithm,weneedtocleanandtokenizethebodyandsubjectintotermsofonewordeachandperformthefollowingoperations:WeconvertforeigncharactersintocanonicalWeremovewordswithspecialcharacteristics.Thisin-ThesewilloftencontaininformationnotrelevanttotopicIdentifyparts-of-speech.Fortopiclabeling,whichoccurslateron,weapplyapart-of-speechtaggerfromtheNLTKLitetoolkitforPython[15].WerunthefulltextofsubjectandbodythroughText-Cat[8],alanguageguesser.Ifthelanguageinwhichthewaswrittenisknown,weapplythePorterstemmingrentlyhandleEnglishandGerman.Thesetwolanguagesmakeup99.6%ofourcorpus.Wedonotusetheinforma-tionfromthelanguageguesserforanyotherpurpose.Wenowreviewthefeatureswehaveconstructedfromthedata.WedenotetheNsintheinboxwithm1,...,mNandtheMexistingclusters,whichrepresentonetopiceachandcontainoneormores,withC1,...,CMThereexistsasetofallnstemmedtermsthatappearins.Werefertothesetermsast1,...,Thefirstfeatureisatextsimilaritymetric.Weregardsmj,werefertothetermfrequencyoftermtiastfi,j,andtothefrequencyoftiasdfi.Wenowdefinethetermweightwi,jasfollows: (1+log(tfi,j))log iftfi,j≥intotopics.Itissimilartomethodsusedforclusteringnewsmessages.ismuchricherininformationcontentthanthetextofnewsarticles,andweareabletousealarge

j

fiif

i,j=numberoffeaturesformatchingtogether s,whichwewilldiscussindetail.Thebasicalgorithmissingle-linkclusteringwitha

Todeterminetextsimilarity,weuseastandardcosinemea-sure.Giventwosmiandmj,wedefinethetextsimi-laritymeasureasfollows:metricwhichconsistsofatf·idftextsimilarity

nn

·andseveralnon-textualattributes.Weuseanon-linecluster-ingalgorithm,asweareinterestedinimmedia

simtext(mi,mj)

k=1kjingTheoutputofourclusteringalgorithmisadecisionscorethatdescribeswhetheranshouldbematchedtoatopicor

similarity:ItcalculatestheoverlapbetweenthesetofwordsSi,Sjinthesubjectlinesoftwos:not.Whenthisscoreisbelowaclusteringthresholdfor 2|S∩Sexistingclusters,theismappedtoanewtopic.Else,ismappedtoanynumberofclosesttopics.Wehaveexperi-mentedwithdifferentmethodsforgeneratingdecisionscoresfromfeaturevalues,includingbrute-forceguessing,linearregressionmodels,andlinearsupportvectormachines.

Thissectiondescribesthepreprocessingstepsforsandgivesdetailsofthefeaturesusedingeneratingdecision|

Next,weusetwopeoplesimilaritymetricsthatcomparethesetofpeopleparticipatinginatopicwiththesetofpeopletosetppl(mi)withalladdressesintheFrom,To,andCcheaders.Similarly,foreachtopicclusterCk,thereisasetofsendersppl(Ck)whichcontains addressesfromsinthecluster.Wedefinetwopeople-basedsimilaritymeasuresasfollows:

gain,guessappropriaterangesfortheirweights,andrunabrute-forceevaluatoroverthedevelopmentset.Forthesec-ondmethod,wedeterminedfeaturevaluesfortwelvesimpeople,subset(mi,Ck)simpeople,overlap(mi,Ck)

|l(i∩i|l(i)∩(i+

rankfeaturesinthedevelopmentcorpus,andtrainedalinearsupportvectormachineonthedevelopmentsetbysequentialminimaloptimization[19](“SMO”),asimplementedintheWekatoolbox[24].TimeTheseareequivalenttotheSimSubsetandSimOverlapmetricsintroducedin[10].Wealsoremovetheuserfromthepplsets,asheorsheisbydefinitionpresentonthereceiverlistofeverymessage.1Inaddition,weintroducetwovariantsoftheseindicatorsthatoperateonthenamepartsofsenderaddressesonly,sims,subsetandsims,overlap.Theyhelpinrecognizing scomingfromdifferentpeopleinthesameorganizationorcompany.tioncontainedinthe“References”and“In-Reply-To”headers.Almostallcontactsinthecorpususedmodernclientsthatemployedtheseheaders.Thesimthreadmeasuregivesthepercentageof sintheclusterwhichareinthesamethreadasthenew simthread(mi,Ck)=|TThesetTcontainsallmj∈Ckwhichareinthesamethreadasmi.Wealsoaddanumberofsimplerfeatures,partlytakenfromexistingliterature[10,17],andenrichedwithadditions:Senderrank:Arankingofcontactsbythenumberofsreceivedfromthem.Wederivetwoadditionalfea-turesbymultiplyingthesenderrankwithsimsubjectSenderpercentage:Fractionoftotal sininboxwhichcamefromthesamesenderinthepast.Senderanswers:Percentageof sfromthesamesen-derwhichhavebeenansweredbytheuserinthepast.timehaspassedsincethelastinthetopic.ofanReferencecount:Numberofreferencestoprevious intheheaderfields.Knownpeople:Numberandpercentageof intheTo/Ccheaderswhichhavebeenseenbefore.Knownreferences:Numberandpercentageof sinthereferencesfieldwhoseIDmatchesan sentorreceived.Clustersize:ClustersizeofthetopicbeingHasattaent:1if hasanattaent,0Weusetwomethodstoconstructdecisionscoresfromfea-tures.Bothemployalinearcombinationofthefeatureval-ues,whichwenormalizeintoarangeof[0,1].Inthe“man-ual”method,wetakefourfeatureswithhighinformation1Thedrawbackofthischoiceisthatweneedalistofall addressesoftheuser–thisisnotimmediayavailablefromthe client,astheremayexistmanyaliasesorforwardingaddresseswhichthe notknowabout.

whichhavebeenactiveinthelast60days.Statisticsgener-atedfromthedevelopmentcorpusandexistingresearch[13]previousornever.However,westillneedtobeabletocatchtopicswithlonginter-arrivaltimes:Forexample,re-mindersornewsletterssentonlyoncepermonth.Asarchive,topicsolderthanthan60daysaredropped.Agreatbenefitofthisdesignchoiceisthatprocessingtimepernewlyarrivingissignificantlyreduced.Weuseasimpleheuristicforlabelingclusters.Inshort,weusesignificantcommonwordsfromsubjectlines,ifavailable,andresorttocontentwordsifthesubjectwordsarenotdescriptive.Whilemoreelaborateschemeshavebeenpresented[21],wecannotaffordtospendmuchprocessingpoweronderivingclusterlabelson-line.Atopic’slabelisrecalculatedeverytimeanew isaddedtoit.Forthistask,weuseinformationfromseveralsources:Inthepreprocessingstep,weuseataggertoidentifynouns.Duringclustering,wederiveatf·idfvalueforeachstemmedwordtermineach.Foreachoftheseterms,wekeeptrackofitsmostpopularnon-stemmedversionandthemostpopularcapitalizedversion.Capitalizationisanimportantfactor:Usersarewellawareofthefavoredcapitalizationofcertainterms,whichareoftenverydescriptivenamesoridentifiers–“TDT,”“Sarah,”andthelike.Thesewordsalsotendtohavehightf·idfvalues,andlikelytoappearinthetopiclabel.Inpreviouswork[2,Ifthetopicconsistsofsfromjustonethreadwithatleasttwowordsinthecommonsubject,andthesewordshavesufficientlyhightf·idfweights,setthelabeltothecommonsubject(with“Re,”“Fwd,”andsimilarprefixesremoved).Thefirstconstraintensuresthatwecoverallsinatopic,whilethelasttwoconstraintsseektoguaranteesufficientdescriptivenessfortheIfthereismorethanonethread,trytofindasubsetofatleasttwosubjectwordswhichoccurinevery’ssub-ject,andhavesufficientlyhightf·idfweights.Ifsuchwordsarefound,setthemasthetopiclabel.Thismethodisveryusefulforfindingnewsletterlabels,astheyhavesub-jectsoftheform“FTDNewsletter-28Aug2006,”“FTDNewsletter-4Sep2006,”andsoon.Inthisexample,the3

1232331

SubjectlinesinYourorderwithAmazonAmazonorderstatus:youritemsAmazonorderstatus:shippedAmazonorder:itemstatus

thefirstthatdiscussesanewtopic.Foreachinthestream,theoutputcanbe“yes”(the isanewtopic)or“no”(theisnotanewtopic).TheTopicTrackingTask(TT)isdefinedtobethetaskofas- ingswithtopicsthatareknowntothesystem.Atopicis“known”byitsassociationwith thatdiscussit.EachtargettopicisdefinedbythefirstFigure3:Examplegraphofoccurs-afterrelationshipsbetweenhigh-tf·ifterms.Thenumbersinparenthe-sesgivetheaveragefirstposition.Here,“your,”“with,”and“shipped”arenotnounsandnothigh-tf·idfwords.Duetostemming,“item”and“items”maptothesameterm.Theresultinglabelwouldbe“Amazonorderstatusitems”.topiclabelwillbe“FTDNewsletter,”agooddescriptionofthetopiccontents.Ifthefirsttwomethodsfail,wetakethe3highest-tf·nounwordsfortheAfterhavingselectedwordsforthelabel,howdoweorderthem?Wegobacktothetokenizedmailstructureandlookupthefirstoccurrencesofeachwordineach’ssub-jectplusbody.Dependingontheirrelativeorder,wecon-structadirected,weightedgraphof“occurs-after”relations.Eachedgeinthegraphhasanassociatedcountwiththenum-getword.Asthefirstword,wechoosetheonewiththehighestvaluefortotaloutgoingweightminustotal ingweight.Wethentraversethegraphbychoosingthehighest-weightlinkeach.Iftwolinkssharethesamehighestweight,wechoosethewordwiththelowestaveragefirstpositionoftheword.Figure3providesanexamplegraphandresult.Weregardthesoftheentireclusterasoneascomparedtotheentirecollection.Thisisusefulifmanysinsidetheclustercontainthesameword,butthewordneveroccursoutsidethecluster.Thiswordisverydescriptiveforthecluster,butwithaper- measure,itsvaluemaybetoolow.Wefoundthatthismethodproduceslabelsofsufficientqual-ity.Wedidnotspecificallycoverlabelqualityinourevalu-ations.Incasethegeneratedlabelisbad,theusercanstillmanuallyrenamethetopicasalastresort.EVALUATIONForevaluatingourclusteringscheme,wefollowtheguide-linesprovidedbyNISTfortheevaluationoftopicdetectionandtrackingonnewsarticles[18].WegiveashortoverviewDetectionandTopicTracking.Ineffect,theclusteringeval-uationisrecastintotwodetectiontasks.TheNewTopicDetectionTask(NTD)isdefinedtobethetaskofdetecting,inachronologicallyorderedstream

thatdiscussesit.Thetrackingtaskisthentoclassifycor-rectlyallsubsequentsastowhetherornottheydiscussoutputcanbe“yes”(thebelongstothetopic)or“no”(thedoesnotbelongtothetopic).Foreachdecisionineachtask,thesystemmustoutputthedecisionscorewhichdescribesthelevelofconfidencewithwhichtheclassificationwasmade.Thesescoreswilllaterbeusedtofindathresholdwhichpresentstheoptimaltrade-offbetweenmissesandfalsealarms.Detectionqualityischaracterizedintermsoft

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论