跨语言信息检索技术_第1页
跨语言信息检索技术_第2页
跨语言信息检索技术_第3页
跨语言信息检索技术_第4页
跨语言信息检索技术_第5页
已阅读5页,还剩73页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

CrossLanguageInformationRetrievalRoadMapCrossLingualIRMotivationDefinitionGeneralIssuesWithCLIRBasicApproachestoCLIRCLIRevaluationCLIRapplications2024/1/123InformationRetrievalSinglelanguage:boththeuser’squeryanddocumentstobesearchedareinsamelanguage.Crosslanguage:documentswritteninalanguagedifferentfromthelanguageoftheuser'squerydocumentsquery2024/1/1242000-2021年世界各大洲网络言语运用增长率(数据更新时间:2021年6月30日)TheInternetBigPictureWorldRegionsPopulationInternetUsersPenetration(%population)Users%ofTableGrowth2000-2015Africa1,158,355,663313,257,07427.0%9.6%6,839%Asia4,032,466,8821,563,208,14338.8%47.8%1,268%Europe821,555,904604,122,38073.5%18.5%475%MiddleEast236,137,235115,823,88249.0%3.5%3,426%NorthAmerica357,172,209313,862,86387.9%9.6%191%LatinAmerica617,776,105333,115,90853.9%10.2%1,743%Oceania/Australia37,157,12027,100,33472.9%0.8%256%WorldTotal7260,621,1183,270,490,58445%100%806%WorldInternetUsersand2021PopulationStats2024/1/1252024/1/126Usageofcontentlanguagesforwebsites2024/1/12720022015English72%English54.5%German7%Russian5.9%Japanese6%German5.7%Spanish3%Japanese5.0%French3%Spanish4.7%Italian2%French4.1%Dutch2%Portuguese2.6%Chinese2%Chinese2.2%Korean1%Italian2.1%Russian1%Polish1.9%Portuguese1%Turkish1.6%Source:w3techs/technologies/overview/content_language/all/research/activities/wcp/stats/intnl.htmlCrossLanguageIRMotivationInformationunavailabilityinsomelanguagesLanguagebarrierDefinition:Cross-languageinformationretrieval(CLIR)

isasubfieldof

informationretrieval

dealingwithretrievinginformationwritteninalanguagedifferentfromthelanguageoftheuser'squery(wikipedia)Example:AusermayaskqueryinChinesebutretrieverelevantdocumentswritteninEnglish.WhydoweneedCLIRsystems?Needstechnologiesthatenableaccesstoinforegardlessofgeographic/languagebarriers.Tofind,retrieveandunderstandrelevantinformationinwhateverlanguage/form.CLIRhasbecomeoneofthekeyfactorsaffectingknowledgesharingallovertheworld.

GeneralIssuesWithCLIRMultilingualtextaccess(charactersets,etc.)Differencesbetweenlanguages -stemming,compoundwords,breaksbetweenwords,etc.TermambiguitybetweenlanguagesWhattotranslate(queryvs.document)andhowMatchingstrategiesNotranslation(1)CognatematchingTranslation(2)Querytranslation(3)Documenttranslation(4)Interlingualtechniques2024/1/1211Cognatematching(同源匹配〕Inthecaseofthemostnaivecognatematching,untranslatabletermssuchaspropernounsortechnicalterminologyareleftunchangedthroughthestageoftranslation.Theunchangedtermcanbeexpectedtomatchsuccessfullywithacorrespondingterminanotherlanguageifthetwolanguageshaveacloselinguisticrelationship.(forexample,generationinEnglishandFrench)Whentwolanguagesareverydifferent,byexploringamethodformeasuringsimilaritybetweentransliterationanditsoriginalword,wemaymakecognatematchingfeasible(音译〕..2024/1/12122024/1/1213Querytranslation搜索引擎翻译系统法语查询法语文档结果中文查询选择阅读法语文档集合过程:将中文查询翻译成法语检索法语文档集合将检索结果翻译成中文2024/1/1214querytranslationQuerytranslationisthemostwidelyusedmatchingstrategyforCLIRduetoitstractability.theretrievalsystemdoesnothavetochangeitsinvertedfilesofindextermsinanywayagainstqueriesinanylanguage.ItislesscomputationallycostlytoprocessthetranslationofaquerythanthatofalargesetofdocumentsChallenge:termambiguity‘queriesareoftenshortandshortqueriesprovidelittlecontextfordisambiguation’Termdisambiguationwillbediscussedlater.2024/1/1215查询翻译优缺陷优点简单容易操作灵敏节约时间、空间,效率高缺陷缺乏上下文对于短查询式,翻译歧义性大2024/1/1216Documenttranslation中文查询法语文档集合搜索引擎翻译系统中文文档集合结果选择阅读过程:将整个法语文档翻译成中文文档直接用中文文档检索2024/1/1217DocumenttranslationDocumenttranslationhasoppositeadvantagesanddisadvantagesfromquerytranslation.InCLIRexperiments,thisapproachisnotusuallyutilized,andquerytranslationisdominant.However,someresearchershaveusedittotranslatelargesetsofdocumentssincemorevariedcontextwithineachdocumentisavailablefortranslation,whichcanimprovetranslationquality.OardandHackett(1998)reportedthatautomaticmachinetranslationofasetofdocumentsusingacommercialMTsystemoutperformsquerytranslationinanexperimentofCLIRfromGermantoEnglish2024/1/1218文档翻译优缺陷优点只翻译一次文档提供的上下文比较丰富文档可以线下事先翻译好缺陷翻译速度慢占用大量空间、时间,效率低依赖机器翻译系统的质量2024/1/1219查询翻译vs.文档翻译取决于特定言语资源通常查询翻译运用更广两种方法都提出了“交互性〞挑战Interlingualapproachanintermediatespaceofsubjectrepresentationintowhichboththequeryandthedocumentsareconvertedisusedtocomparethem.Onetypeofinterlingualapproachistousethe‘‘synsets’’providedinWordNet,whichisawellknownmachine-readablethesaurus.Forexample,Diekema,Oroumchian,Sheridan,andLiddy(1999)employedtheWordNetsynsetnumbersaslanguage-independentrepresentationsforCLIR.Sinceasynsetnumber(label)representingaconceptiscorrespondedtoasetofconcretewordsineachoflanguagessupported(e.g.,EnglishandFrench),itispossiblethataqueryterminthesourcelanguagesislinkedtowordsinthetargetlanguageviathesynsetnumber.2024/1/1220TranslationtechniquesDictionary-basedmethodsParallelcorpora-basedmethodUseofWWWresources2024/1/1221Dictionary-basedmethodsUsingabilingualMachineReadableDictionary(MRD).mostretrievalsystemsarestillbasedonso-called‘‘bag-of-words’’architectures,inwhichbothquerystatementsanddocumenttextsaredecomposedintoasetofwords(orphrases)throughaprocessofindexing.Thuswecantranslateaqueryeasilybyreplacingeachquerytermwithitstranslationequivalentsappearinginabilingualdictionaryorabilingualtermlist.2024/1/12222024/1/1223bilingualdictionary人工构建的双语词典printedMerriam-Webster'sDictionariesLongmanDictionarieselectronicFreedictatfreedict/Travlangatdictionaries.travlang/问题HastobeprocessedtobereadablebymachineLimitedvocabularyDictionarytranslationsareinherentlyambiguousandaddextraneousinformation机器自动构建的词典称为机读词典MachineReadableDictionaries(MRD)2024/1/1224Termtranslationoilpetroleumprobesurveytakesamples选哪个翻译?没有翻译!restraincymbidiumgoeringii分词错误oilpetroleumprobesurveytakesamples2024/1/1225SomeissuesintermtranslationCompoundwords,forexampleGermandecompositionNoboundarybetweenwords,e.g.ChinesesegmentationSpecializedvocabularynotcontainedinthedictionary,e.g.namedentity2024/1/1226ExamplesCompounddecomposition(复合词分解)chinesewordsegmentation新西兰花新西兰花 NewZealandflowers新西兰花 freshbroccolis2024/1/1227Corpora-basedmethodParallel(双语平行语料库)orcomparablecorpora(双语可比语料库)areusefulresourcesenablingustoextractbeneficialinformationforCLIR.Forexample,inordertotranslateEnglishqueriesintoSpanish,DavisandDunning(1995)extractedmoderatelyfrequentSpanishtermsfromSpanishdocumentsalignedwithEnglishdocumentswhichhadbeensearchedusinganEnglishquery(sourcequery).2024/1/1228ParallelcorporaAparallelcorpus(pl.corpora)isadocumentcollectioncomposedoftwoormoredisjointsubsets,eachwritteninadifferentlanguage,suchthatdocumentsineachsubsetaretranslationsofdocumentsineachothersubset.Veryhighaccuracy2024/1/1229象形文字古埃及文字希腊文2024/1/1230罗塞塔石碑罗塞塔石碑〔RosettaStone,也译作罗塞达碑〕,高1.14米,宽0.73米,是一块制造于公元前196年的大理石石碑,本来是一块刻有埃及国王托勒密五世〔PtolemyV〕诏书的石碑。石碑上用希腊文字、古埃及文字和当时的通俗体文字刻了同样的内容。由于这块石碑刻有三种不同言语版本,使得近代的考古学家得以有时机对照各言语版本的内容后,解读出曾经失传千余年的埃及象形文之意义与构造,而成为今日研讨古埃及历史的重要里程碑。2024/1/1231Moreparallelcorporanews:DE-News(German-English)Hong-KongNews,XinhuaNews(Chinese-English)Governmentdocuemtns:Canadian-Hansards(French-English)Europarl(Danish,Dutch,English,Finnish,French,German,Greek,Italian,Portugese,Spanish,Swedish)UNTreaties(Russian,English,Arabic,…)Bible(many,manylanguages)2024/1/1232ExamplesEnglishGermanDivergingopinionsaboutplannedtaxreformUnterschiedlicheMeinungenzurgeplantenSteuerreformThediscussionaroundtheenvisagedmajortaxreformcontinues.DieDiskussionumdievorgesehenegrosseSteuerreformdauertan.TheFDPeconomicsexpert,GrafLambsdorff,todaycameoutinfavorofadvancingtheenactmentofsignificantpartsoftheoverhaul,currentlyplannedfor1999.DerFDP-WirtschaftsexperteGrafLambsdorffsprachsichheutedafueraus,wesentlicheTeilederfuer1999geplantenReformvorzuziehen.2024/1/1233ComparablecorporaAcomparablecorpusisapairofcorporaintwodifferentlanguages,whichcomefromthesamedomain.TalkingthesametopicParallelsentencesmayalsobeminedfromcomparablecorporasuchasnewsstorieswrittenonthesametopicindifferentlanguages.Someresearchersextractphrasepairsfromcomparablecorporausingaclassifierapproach.2024/1/1234ExampleUseofWWWresourcesTheWWWcanproviderichandubiquitousmachine-readableresources,fromwhichwemaybeabletoautomaticallyextractinformationusefulforCLIR.Forexample,Chen(2002)andChenandGey(2003)madeuseofageneralsearchengineontheInternetandtriedtofindEnglishtranslationequivalentsofChineseorJapaneseterms(mainlypropernouns)byanalyzingcontextsofthesetermsinChineseandJapaneseWebdocumentsreturnedbytheengine.2024/1/12352024/1/1236Termdisambiguationtechniques(翻译歧义性)Disambiguationfromamongmultiplealternativetermtranslations,多个翻译如何选择?e.g.,Apple,BankUseofpart-of-speech(POS)tags.Useofparallelcorpus.Useofco-occurrencestatisticsinthetargetcorpus.Useofthequeryexpansiontechnique.Useofpart-of-speechtagsThebasicideaofusingpart-of-speech(POS)tagsfortranslationdisambiguationistoselectonlytranslationshavingthesamePOSwiththatofthesourcequeryterm.ThismethodrequiresthatPOStaggingsoftwareisavailableforbothlanguages.2024/1/1237Parallelcorpus-baseddisambiguationAparallelcorpuswasusedfordeterminingthe‘‘best’’translationorsetoftranslationsbyDavis(1997,1998),whereasingletranslationforeachsourcetermwasselectedfromasetoftranslationslistedinanMRDaccordingtotheresultofsearchingaparallelcorpus.2024/1/12382024/1/1239Translationprobability探测survey试探样品丈量(p=0.4)(p=0.3)(p=0.25)(p=0.05)多个翻译翻译概率Disambiguationbasedonco-occurrencestatisticsthecorrecttranslationsofquerytermsshouldco-occurintargetlanguagedocumentsandincorrecttranslationsshouldtendnottoco-occur.First,thetwomostrelatedtermsinthequeryweredeterminedbasedoncooccurrencestatisticsinthesourcelanguagecorpus,andthenthe‘‘best’’translationswereselectedfromallpairsoftranslationsofthesetwotermsaccordingtoco-occurrencestatisticsinthetargetlanguagecorpus.Itshouldbenotedthatthesetwocorporadonothavetobeparallelorcomparable.2024/1/1240QueryexpansionfordisambiguationPseudorelevancefeedback(PRF),alsoknownasblindfeedback,iswidelyrecognizedasaneffectivetechniqueforenhancingperformanceofinformationretrieval.PRFalsoworkseffectivelyforCLIRtasks.InthecaseofCLIR,twokindsofPRFarefeasible:Pre-translationfeedbackandPost-translationfeedback2024/1/1241Pre-translationfeedbackDocumentsfromacorpusinthesourcelanguagecanberetrievedpriortotranslationinordertoaddasetofnewtermstothesourcequery(pre-translationfeedback)ifsuchacorpusisavailable.Pre-translationfeedbackmaycontributetoimprovementofprecision.ThisisduetothefactthatthePRFisbasicallydoneusingtheentirequery––noteachsourcetermrespectively.Thatis,synonymsorrelatedtermscorrespondingtothe‘‘correct’’meaningofeachsourcetermwithinacontextofthequeryareexpectedtobeautomaticallyaddedthroughthePRFprocess.2024/1/1242Post-translationfeedbackAftertranslation,standardPRFcanbeappliedusingthetargetdocumentcollection(post-translationfeedback).post-translationfeedbackcanbeconsideredadeviceforimprovingrecallratio,asshowninstandardexperimentsofmonolingualretrieval.InCLIR,twowell-knownmethodsforweightingtermsinthetop-rankeddocumentsareoftenutilizedforselecting‘‘good’’terms,i.e.,theRocchiomethodandtheprobabilisticmethod.2024/1/1243bi-directionaltranslationBoughanemetal.(2002),exploreda‘‘bi-directionaltranslation’’techniqueinwhichaformofbackwardtranslationisusedforrankingtranslationcandidates.SupposethatweneedtotranslateEnglishquerytermsintoFrenchones.In‘‘bi-directionaltranslation,’’firstasetofFrenchequivalentsforanEnglishtermisfoundinanEnglish–Frenchdictionary.Next,usingaFrench–Englishdictionary,eachFrenchequivalentisreverselytranslatedintoasetofEnglishterms.Basically,ifthesetincludestheoriginalsourceterm,theFrenchtranslationequivalentischosenasapreferredtranslation.2024/1/12442024/1/1245跨言语检索评价信息检索评价给定一个检索主题,一个文档集合,一些人工判别好的相关文献对系统前往的检索结果进展判别TRECCLIR(96-02):英语到其他言语CLEF(00-):欧洲言语之间NTCIR(99-):亚洲言语与英语2024/1/1246跨言语检索评价模型47ApplicationsofCLIR2024/1/12482.1CrosslanguageSearchEngineApril25,2006:Europeansearchengine“Quaero〞FrenchPresidentannounced90million-eurosupport.May16,2007:GoogleTranslateProvideCLIRfor12languagesGoal:take"alltheWeb&translateintomultiplelangs."May5,2021:YahooBabelFishProvideCLIRbetween12languagesItwasAltaVista'sproject,laterboughtbyYahoo2024/1/1249GoogleTranslate

translate.google2024/1/12502024/1/1251YahooBabelFish

babelfish.yahoo2024/1/12522024/1/12532024/1/1254提问请比较Google和Yahoo!的跨言语搜索引擎的区别,分析各自的优缺陷Google:一步完成〔translate&search〕,检索结果翻译回源言语。优点:快速,便于用户了解检索结果。缺陷:用户无法修正翻译。Yahoo!:两步完成〔translate+search〕,检索结果未翻译。优点:有中间步骤,用户可以修正翻译。缺陷:复杂,检索结果无法识别。2.2数字图书馆的跨言语检索2021年6月11日在芬兰首都赫尔辛基举行的ICSTI〔国际科技信息理事会〕夏季会议上发布的世界科学跨言语检索平台WorldWideScience2024/1/1255WorldWideScience

/multilingual联盟的成员单位都是专业图书情报机构或科技信息事业的指点机构,如美国能源部科技信息局〔OSTI〕、美国国会图书馆、大英图书馆、加拿大科技信息研讨所、韩国科技信息研讨所、中国科技信息研讨所等。该平台还可以自动进展跨言语跨库检索2024/1/1256WorldWideScience

/multilingual2024/1/12572.3跨言语专利检索根据世界知识产权组织〔WorldIntellectualPropertyOrganization,WIPO〕报导,专利文件包含全世界90%~95%的科研成果,而其他技术文件〔论文或期刊等〕中只含5%~10%的研发成果。在研讨任务中假设能擅长利用专利检索可以缩短60%的研发时间,同时减少40%的研发经费。2024/1/1258PATENTSCOPE

/patentscope/search/en/clir/clir.jsp2021年5月,世界知识产权组织WIPO发布了跨言语专利检索系统PATENTSCOPE的测试版,标志着跨言语信息检索在专利检索中的运用从实验室走向适用化。该系统只能提供英语、法语、德语、日语、西班牙语5种言语之间的跨言语专利检索。2024/1/12592024/1/1260PATENTSCOPE

/patentscope/search/en/clir/clir.jsp2024/1/1261PATENTSCOPE

/patentscope/search/en/clir/clir.jsp2.4跨言语图像检索目前,已走向适用化的跨言语图像检索的代表是由华盛顿大学开发的一个跨言语图像搜索引擎PanImages〔/〕PanImages提供100多种言语的翻译用户输入关键字并选择其隶属于哪种言语,经过机器翻译将关键词转换成各个国家的言语,将翻译的关键词在Google图片搜索和Flickr图片搜索中进展搜索2024/1/12622024/1/1263PanImages

/2024/1/1264PanImages

/2.5电子商务中的运用CINDOR是目前比较胜利的一个商业跨言语信息检索系统CINDOR系统拥有概念中间言语〔ConceptualInterlingua〕、言语分析〔LanguageAnalysis〕、搜索管理〔SearchManagement〕三大中心技术。CINDOR目前支持英语、法语、西班牙语,正在研制简体中文、俄语、阿拉伯语。2024/1/1265CINDOR

cindorsearch/home.html2024/1/1266CINDOR

cindorsearch/home.html2024/1/1267ReferenceKazuakiKishida.Technicalissuesofcross-languageinformationretrieval:areview.InformationProcessingandManagement.2005(41),pp433-455.葛运东;跨言语信息检索查询翻译技术研讨[D];苏州大学;2021王序文.基于主题伪相关反响的跨言语信息检索技术研讨[D];北京邮电大学,2021彭琳.汉语词语语义类似度度量及其在跨言语信息检索中的运用研讨[D];复旦大学,20212024/1/12682024/1/1269对“交互〞的挑战CLIRposessomeuniquechallengesforinteractionHowdoyouhelpusersselecttranslatedqueryterms?Howdoyouhelpuserssel

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论