跨语言信息检索技术_第1页
跨语言信息检索技术_第2页
跨语言信息检索技术_第3页
跨语言信息检索技术_第4页
跨语言信息检索技术_第5页
已阅读5页,还剩72页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

11三月2024跨语言信息检索技术RoadMapCrossLingualIRMotivationDefinitionGeneralIssuesWithCLIRBasicApproachestoCLIRCLIRevaluationCLIRapplicationsInformationRetrievalSinglelanguage:boththeuser’squeryanddocumentstobesearchedareinsamelanguage.Crosslanguage:documentswritteninalanguagedifferentfromthelanguageoftheuser'squerydocumentsquery3/11/202432000-2010年世界各大洲网络语言使用增长率(数据更新时间:2010年6月30日)TheInternetBigPictureWorldRegionsPopulationInternetUsersPenetration(%population)Users%ofTableGrowth2000-2015Africa1,158,355,663313,257,07427.0%9.6%6,839%Asia4,032,466,8821,563,208,14338.8%47.8%1,268%Europe821,555,904604,122,38073.5%18.5%475%MiddleEast236,137,235115,823,88249.0%3.5%3,426%NorthAmerica357,172,209313,862,86387.9%9.6%191%LatinAmerica617,776,105333,115,90853.9%10.2%1,743%Oceania/Australia37,157,12027,100,33472.9%0.8%256%WorldTotal7260,621,1183,270,490,58445%100%806%WorldInternetUsersand2015PopulationStats3/11/202443/11/202453/11/20246Usageofcontentlanguagesforwebsites20022015English72%English54.5%German7%Russian5.9%Japanese6%German5.7%Spanish3%Japanese5.0%French3%Spanish4.7%Italian2%French4.1%Dutch2%Portuguese2.6%Chinese2%Chinese2.2%Korean1%Italian2.1%Russian1%Polish1.9%Portuguese1%Turkish1.6%Source:/technologies/overview/content_language/all/research/activities/wcp/stats/intnl.html3/11/20247CrossLanguageIRMotivationInformationunavailabilityinsomelanguagesLanguagebarrierDefinition:Cross-languageinformationretrieval(CLIR)

isasubfieldof

informationretrieval

dealingwithretrievinginformationwritteninalanguagedifferentfromthelanguageoftheuser'squery(wikipedia)Example:AusermayaskqueryinChinesebutretrieverelevantdocumentswritteninEnglish.WhydoweneedCLIRsystems?Needstechnologiesthatenableaccesstoinforegardlessofgeographic/languagebarriers.Tofind,retrieveandunderstandrelevantinformationinwhateverlanguage/form.CLIRhasbecomeoneofthekeyfactorsaffectingknowledgesharingallovertheworld.

GeneralIssuesWithCLIRMultilingualtextaccess(charactersets,etc.)Differencesbetweenlanguages

-stemming,compoundwords,breaksbetweenwords,etc.TermambiguitybetweenlanguagesWhattotranslate(queryvs.document)andhowMatchingstrategiesNotranslation(1)CognatematchingTranslation(2)Querytranslation(3)Documenttranslation(4)Interlingualtechniques3/11/202411Cognatematching(同源匹配)Inthecaseofthemostnaivecognatematching,untranslatabletermssuchaspropernounsortechnicalterminologyareleftunchangedthroughthestageoftranslation.Theunchangedtermcanbeexpectedtomatchsuccessfullywithacorrespondingterminanotherlanguageifthetwolanguageshaveacloselinguisticrelationship.(forexample,generationinEnglishandFrench)Whentwolanguagesareverydifferent,byexploringamethodformeasuringsimilaritybetweentransliterationanditsoriginalword,wemaymakecognatematchingfeasible(音译)..3/11/202412Querytranslation搜索引擎翻译系统法语查询法语文档结果中文查询选择浏览法语文档集合过程:将中文查询翻译成法语检索法语文档集合将检索结果翻译成中文3/11/202413querytranslationQuerytranslationisthemostwidelyusedmatchingstrategyforCLIRduetoitstractability.theretrievalsystemdoesnothavetochangeitsinvertedfilesofindextermsinanywayagainstqueriesinanylanguage.ItislesscomputationallycostlytoprocessthetranslationofaquerythanthatofalargesetofdocumentsChallenge:termambiguity‘queriesareoftenshortandshortqueriesprovidelittlecontextfordisambiguation’Termdisambiguationwillbediscussedlater.3/11/202414查询翻译优缺点优点简单容易操作灵活节约时间、空间,效率高缺点缺乏上下文对于短查询式,翻译歧义性大3/11/202415Documenttranslation中文查询法语文档集合搜索引擎翻译系统中文文档集合结果选择浏览过程:将整个法语文档翻译成中文文档直接用中文文档检索3/11/202416DocumenttranslationDocumenttranslationhasoppositeadvantagesanddisadvantagesfromquerytranslation.InCLIRexperiments,thisapproachisnotusuallyutilized,andquerytranslationisdominant.However,someresearchershaveusedittotranslatelargesetsofdocumentssincemorevariedcontextwithineachdocumentisavailablefortranslation,whichcanimprovetranslationquality.OardandHackett(1998)reportedthatautomaticmachinetranslationofasetofdocumentsusingacommercialMTsystemoutperformsquerytranslationinanexperimentofCLIRfromGermantoEnglish3/11/202417文档翻译优缺点优点只翻译一次文档提供的上下文比较丰富文档可以线下事先翻译好缺点翻译速度慢占用大量空间、时间,效率低依赖机器翻译系统的质量3/11/202418查询翻译vs.文档翻译取决于特定语言资源通常查询翻译使用更广两种方法都提出了“交互性”挑战3/11/202419Interlingualapproachanintermediatespaceofsubjectrepresentationintowhichboththequeryandthedocumentsareconvertedisusedtocomparethem.Onetypeofinterlingualapproachistousethe‘‘synsets’’providedinWordNet,whichisawellknownmachine-readablethesaurus.Forexample,Diekema,Oroumchian,Sheridan,andLiddy(1999)employedtheWordNetsynsetnumbersaslanguage-independentrepresentationsforCLIR.Sinceasynsetnumber(label)representingaconceptiscorrespondedtoasetofconcretewordsineachoflanguagessupported(e.g.,EnglishandFrench),itispossiblethataqueryterminthesourcelanguagesislinkedtowordsinthetargetlanguageviathesynsetnumber.3/11/202420TranslationtechniquesDictionary-basedmethodsParallelcorpora-basedmethodUseofWWWresources3/11/202421Dictionary-basedmethodsUsingabilingualMachineReadableDictionary(MRD).mostretrievalsystemsarestillbasedonso-called‘‘bag-of-words’’architectures,inwhichbothquerystatementsanddocumenttextsaredecomposedintoasetofwords(orphrases)throughaprocessofindexing.Thuswecantranslateaqueryeasilybyreplacingeachquerytermwithitstranslationequivalentsappearinginabilingualdictionaryorabilingualtermlist.3/11/202422bilingualdictionary人工构建的双语词典printedMerriam-Webster'sDictionariesLongmanDictionarieselectronicFreedictat/

Travlangat/

问题HastobeprocessedtobereadablebymachineLimitedvocabularyDictionarytranslationsareinherentlyambiguousandaddextraneousinformation机器自动构建的词典称为机读词典MachineReadableDictionaries(MRD)3/11/202423Termtranslationoilpetroleumprobesurveytakesamples选哪个翻译?没有翻译!restraincymbidiumgoeringii分词错误oilpetroleumprobesurveytakesamples3/11/202424SomeissuesintermtranslationCompoundwords,forexampleGermandecompositionNoboundarybetweenwords,e.g.ChinesesegmentationSpecializedvocabularynotcontainedinthedictionary,e.g.namedentity3/11/202425ExamplesCompounddecomposition(复合词分解)chinesewordsegmentation新西兰花新西兰花 NewZealandflowers新西兰花 freshbroccolis3/11/202426Corpora-basedmethodParallel(双语平行语料库)orcomparablecorpora(双语可比语料库)areusefulresourcesenablingustoextractbeneficialinformationforCLIR.Forexample,inordertotranslateEnglishqueriesintoSpanish,DavisandDunning(1995)extractedmoderatelyfrequentSpanishtermsfromSpanishdocumentsalignedwithEnglishdocumentswhichhadbeensearchedusinganEnglishquery(sourcequery).3/11/202427ParallelcorporaAparallelcorpus(pl.corpora)isadocumentcollectioncomposedoftwoormoredisjointsubsets,eachwritteninadifferentlanguage,suchthatdocumentsineachsubsetaretranslationsofdocumentsineachothersubset.Veryhighaccuracy3/11/202428象形文字古埃及文字希腊文3/11/202429罗塞塔石碑罗塞塔石碑(RosettaStone,也译作罗塞达碑),高1.14米,宽0.73米,是一块制作于公元前196年的大理石石碑,原本是一块刻有埃及国王托勒密五世(PtolemyV)诏书的石碑。石碑上用希腊文字、古埃及文字和当时的通俗体文字刻了同样的内容。由于这块石碑刻有三种不同语言版本,使得近代的考古学家得以有机会对照各语言版本的内容后,解读出已经失传千余年的埃及象形文之意义与结构,而成为今日研究古埃及历史的重要里程碑。3/11/202430Moreparallelcorporanews:DE-News(German-English)Hong-KongNews,XinhuaNews(Chinese-English)Governmentdocuemtns:Canadian-Hansards(French-English)Europarl(Danish,Dutch,English,Finnish,French,German,Greek,Italian,Portugese,Spanish,Swedish)UNTreaties(Russian,English,Arabic,…)Bible(many,manylanguages)3/11/202431ExamplesEnglishGermanDivergingopinionsaboutplannedtaxreformUnterschiedlicheMeinungenzurgeplantenSteuerreformThediscussionaroundtheenvisagedmajortaxreformcontinues.DieDiskussionumdievorgesehenegrosseSteuerreformdauertan.TheFDPeconomicsexpert,GrafLambsdorff,todaycameoutinfavorofadvancingtheenactmentofsignificantpartsoftheoverhaul,currentlyplannedfor1999.DerFDP-WirtschaftsexperteGrafLambsdorffsprachsichheutedafueraus,wesentlicheTeilederfuer1999geplantenReformvorzuziehen.3/11/202432ComparablecorporaAcomparablecorpusisapairofcorporaintwodifferentlanguages,whichcomefromthesamedomain.TalkingthesametopicParallelsentencesmayalsobeminedfromcomparablecorporasuchasnewsstorieswrittenonthesametopicindifferentlanguages.Someresearchersextractphrasepairsfromcomparablecorporausingaclassifierapproach.3/11/202433Example3/11/202434UseofWWWresourcesTheWWWcanproviderichandubiquitousmachine-readableresources,fromwhichwemaybeabletoautomaticallyextractinformationusefulforCLIR.Forexample,Chen(2002)andChenandGey(2003)madeuseofageneralsearchengineontheInternetandtriedtofindEnglishtranslationequivalentsofChineseorJapaneseterms(mainlypropernouns)byanalyzingcontextsofthesetermsinChineseandJapaneseWebdocumentsreturnedbytheengine.3/11/202435Termdisambiguationtechniques(翻译歧义性)Disambiguationfromamongmultiplealternativetermtranslations,多个翻译如何选择?e.g.,Apple,BankUseofpart-of-speech(POS)tags.Useofparallelcorpus.Useofco-occurrencestatisticsinthetargetcorpus.Useofthequeryexpansiontechnique.3/11/202436Useofpart-of-speechtagsThebasicideaofusingpart-of-speech(POS)tagsfortranslationdisambiguationistoselectonlytranslationshavingthesamePOSwiththatofthesourcequeryterm.ThismethodrequiresthatPOStaggingsoftwareisavailableforbothlanguages.3/11/202437Parallelcorpus-baseddisambiguationAparallelcorpuswasusedfordeterminingthe‘‘best’’translationorsetoftranslationsbyDavis(1997,1998),whereasingletranslationforeachsourcetermwasselectedfromasetoftranslationslistedinanMRDaccordingtotheresultofsearchingaparallelcorpus.3/11/202438Translationprobability探测survey试探样品测量(p=0.4)(p=0.3)(p=0.25)(p=0.05)多个翻译翻译概率3/11/202439Disambiguationbasedonco-occurrencestatisticsthecorrecttranslationsofquerytermsshouldco-occurintargetlanguagedocumentsandincorrecttranslationsshouldtendnottoco-occur.First,thetwomostrelatedtermsinthequeryweredeterminedbasedoncooccurrencestatisticsinthesourcelanguagecorpus,andthenthe‘‘best’’translationswereselectedfromallpairsoftranslationsofthesetwotermsaccordingtoco-occurrencestatisticsinthetargetlanguagecorpus.Itshouldbenotedthatthesetwocorporadonothavetobeparallelorcomparable.3/11/202440QueryexpansionfordisambiguationPseudorelevancefeedback(PRF),alsoknownasblindfeedback,iswidelyrecognizedasaneffectivetechniqueforenhancingperformanceofinformationretrieval.PRFalsoworkseffectivelyforCLIRtasks.InthecaseofCLIR,twokindsofPRFarefeasible:Pre-translationfeedbackandPost-translationfeedback3/11/202441Pre-translationfeedbackDocumentsfromacorpusinthesourcelanguagecanberetrievedpriortotranslationinordertoaddasetofnewtermstothesourcequery(pre-translationfeedback)ifsuchacorpusisavailable.Pre-translationfeedbackmaycontributetoimprovementofprecision.ThisisduetothefactthatthePRFisbasicallydoneusingtheentirequery––noteachsourcetermrespectively.Thatis,synonymsorrelatedtermscorrespondingtothe‘‘correct’’meaningofeachsourcetermwithinacontextofthequeryareexpectedtobeautomaticallyaddedthroughthePRFprocess.3/11/202442Post-translationfeedbackAftertranslation,standardPRFcanbeappliedusingthetargetdocumentcollection(post-translationfeedback).post-translationfeedbackcanbeconsideredadeviceforimprovingrecallratio,asshowninstandardexperimentsofmonolingualretrieval.InCLIR,twowell-knownmethodsforweightingtermsinthetop-rankeddocumentsareoftenutilizedforselecting‘‘good’’terms,i.e.,theRocchiomethodandtheprobabilisticmethod.3/11/202443bi-directionaltranslationBoughanemetal.(2002),exploreda‘‘bi-directionaltranslation’’techniqueinwhichaformofbackwardtranslationisusedforrankingtranslationcandidates.SupposethatweneedtotranslateEnglishquerytermsintoFrenchones.In‘‘bi-directionaltranslation,’’firstasetofFrenchequivalentsforanEnglishtermisfoundinanEnglish–Frenchdictionary.Next,usingaFrench–Englishdictionary,eachFrenchequivalentisreverselytranslatedintoasetofEnglishterms.Basically,ifthesetincludestheoriginalsourceterm,theFrenchtranslationequivalentischosenasapreferredtranslation.3/11/202444跨语言检索评价信息检索评价给定一个检索主题,一个文档集合,一些人工判断好的相关文献对系统返回的检索结果进行判断TRECCLIR(96-02):英语到其他语言CLEF(00-):欧洲语言之间NTCIR(99-):亚洲语言与英语3/11/202445跨语言检索评价模型3/11/202446ApplicationsofCLIR472.1CrosslanguageSearchEngineApril25,2006:Europeansearchengine“Quaero”

FrenchPresidentannounced90million-eurosupport.May16,2007:GoogleTranslateProvideCLIRfor12languagesGoal:take"alltheWeb&translateintomultiplelangs."May5,2008:YahooBabelFishProvideCLIRbetween12languagesItwasAltaVista'sproject,laterboughtbyYahoo3/11/202448GoogleTranslate

3/11/2024493/11/202450YahooBabelFish

3/11/2024513/11/2024523/11/202453提问请比较Google和Yahoo!的跨语言搜索引擎的区别,分析各自的优缺点Google:一步完成(translate&search),检索结果翻译回源语言。优点:快速,便于用户理解检索结果。缺点:用户无法修改翻译。Yahoo!:两步完成(translate+search),检索结果未翻译。优点:有中间步骤,用户可以修改翻译。缺点:复杂,检索结果无法识别。3/11/2024542.2数字图书馆的跨语言检索2010年6月11日在芬兰首都赫尔辛基举行的ICSTI(国际科技信息理事会)夏季会议上发布的世界科学跨语言检索平台WorldWideScience3/11/202455WorldWideScience

/multilingual联盟的成员单位都是专业图书情报机构或科技信息事业的领导机构,如美国能源部科技信息局(OSTI)、美国国会图书馆、大英图书馆、加拿大科技信息研究所、韩国科技信息研究所、中国科技信息研究所等。该平台还可以自动进行跨语言跨库检索3/11/202456WorldWideScience

/multilingual3/11/2024572.3跨语言专利检索根据世界知识产权组织(WorldIntellectualPropertyOrganization,WIPO)报导,专利文件包含全世界90%~95%的科研成果,而其他技术文件(论文或期刊等)中只含5%~10%的研发成果。在研究工作中若能善于利用专利检索可以缩短60%的研发时间,同时减少40%的研发经费。3/11/202458PATENTSCOPE

/patentscope/search/en/clir/clir.jsp2010年5月,世界知识产权组织WIPO发布了跨语言专利检索系统PATENTSCOPE的测试版,标志着跨语言信息检索在专利检索中的应用从实验室走向实用化。该系统只能提供英语、法语、德语、日语、西班牙语5种语言之间的跨语言专利检索。3/11/202459PATENTSCOPE

/patentscope/search/en/clir/clir.jsp3/11/202460PATENTSCOPE

/patentscope/search/en/clir/clir.jsp3/11/2024612.4跨语言图像检索目前,已走向实用化的跨语言图像检索的代表是由华盛顿大学开发的一个跨语言图像搜索引擎PanImages(/)PanImages提供100多种语言的翻译用户输入关键字并选择其隶属于哪种语言,通过机器翻译将关键词转换成各个国家的语言,将翻译的关键词在Google图片搜索和Flickr图片搜索中进行搜索3/11/202462PanImages

/3/11/202463PanImages

/3/11/2024642.5电子商务中的应用CINDOR是目前比较成功的一个商业跨语言信息检索系统CINDOR系统拥有概念中间语言(ConceptualInterlingua)、语言分析(LanguageAnalysis)、搜索管理(SearchManagement)三大核心技术。CINDOR目前支持英语、法语、西班牙语,正在研制简体中文、俄语、阿拉伯语。3/11/202465CINDOR

/home.html3/11/202466CINDOR

/home.html3/11/202467ReferenceKazuakiKishida.Technicalissuesofcross-languageinformationretrieval:areview.InformationProcessingandManagement.2005(41),pp433-455.葛运东;跨语言信息检索查询翻译技术研究[D];苏州大学;2010王序文.基于主题伪相关反馈的跨语言信息检索技术研究[D];北京邮电大学,2014彭琳.汉语词语语义相似度度量及其在跨语言信息检索中的应用研究[D];复旦大学,20103/11/202468对“交互”的挑战CLIRposessomeuniquechallengesforinteractionHowdoyouhelpusersselecttranslatedqueryterms?Howdoyouhelpusersse

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论