版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
跨語言資訊檢索導論Hsin-HsiChen(陳信希)DepartmentofComputerScienceandInformationEngineeringNationalTaiwanUniversityHsin-HsiChen1OutlineMultilingualEnvironmentsWhatisCross-LanguageInformationRetrieval?MajorProblemsinCLIRMajorApproachesinCLIRCaseStudy:CLIRinNPDMSummaryHsin-HsiChen2MultilingualCollectionsThereare6,703languageslistedintheEthnologueDigitallibrariesOCLCOnlineComputerLibraryCenterservesmorethan17,000librariesin52countriesandcontainsover30millionbibliographicrecordswithover500millionrecordsownershipattachedinmorethan370languagesWorldWideWebAround40%ofInternetusersdonotspeakEnglish,however,80%ofWebsitesarestillinEnglishHsin-HsiChen3真實世界語言使用人口(/faq.htm)中文英語印度語西班牙語葡萄牙語孟加拉語俄語阿拉伯語日語Hsin-HsiChen4(StatisticsfromEuro-MarketingAssociates,1998)西班牙語德語日語法語中文荷蘭語葡萄牙語義大利語瑞典語韓文Hsin-HsiChen5(StatisticsfromEuro-MarketingAssociates,1999)中文人口百分比(6.1%)<法文人口百分比(8.8%)(1998年)Hsin-HsiChen6網路世界語言使用人口Hsin-HsiChen7網際網路內容(NetworkWizardsJan99InternetDomainSurvey)英語日語德語法語荷蘭語芬蘭語西班牙語中文瑞典語33,8781,6871,68465454647345843254640%旳Internet使用者不懂英文,但是80%旳Internet內容是英文Hsin-HsiChen8(Source:)Hsin-HsiChen9WhatisCross-LanguageInformationRetrieval?Definition:Selectinformationinonelanguagebasedonqueriesinanother.TerminologiesCross-LanguageInformationRetrieval
(ACMSIGIR96WorkshoponCross-LinguisticInformationRetrieval)TranslingualInformationRetrieval
(DefenseAdvancedResearchProjectAgency-DARPA)Hsin-HsiChen10Generalization:
Multi-&Cross-LingualInformationAccessHsin-HsiChen11MLIRApplicationsMultilingualinformationaccessinmultilingualcountry,organization,enterprise,etc.Cross-languageinformationretrievalforuserswhoreadasecondlanguage(largepassivevocabulary)butarenotabletoformulategoodqueries(smallactivevocabulary).Monolingualusersmayretrieveimagesbytakingadvantageofmultilingualcaptions.Monolingualusersmayretrievedocumentsandhavethemtranslated(automaticallyormanually)intheirlanguage.Hsin-HsiChen12WhyisCross-LanguageInformationRetrievalImportant?MoreinformationworkerswithlesstimerequirefastaccesstoglobalresourcesglobalB2Binteractions(virtualenterprises)globalB2Cinteractions(onlinetrading,travelling)timecriticalinformation(translationcomestoolate)Hsin-HsiChen13History1970SaltonrunsretrievalexperimentswithasmallEnglish/Germandictionary1972PevznershowsforEnglishandRussianthatacontrolledthesauruscanbeusedeffectivelyforquerytermtranslation1978ISOStandard5964fordevelopingmultilingualthesauri(revisedin1985)1990LatentSemanticIndexing(LSI)appliedtoCLIRHsin-HsiChen14History(Continued)19941stPhDthesisonCLIRbyKhaledRadwan1996SimilaritythesaurusappliedtoCLIR(ETHZurich)1996DictionarybasedretrievalappliedtoCLIR(Umass&XEROXGrenoble)1997GeneralizedVectorSpaceModel(GVSM)appliedtoCLIR(CMU)Hsin-HsiChen15History(Continued)1997CLIR(Cross-LanguageInformationRetrieval)trackstartswithinTREC1998NTCIRstartsinJapan1999TIDES(TranslingualInformationDetection,Extraction,andSummarization)startsinU.S.2023CLEFstartsinEuropeHsin-HsiChen16AnArchitectureofMultilingualInformationAccessHsin-HsiChen17MajorProblemsofCLIRQueriesanddocumentsareindifferentlanguages.translationWordsinaquerymaybeambiguous.disambiguationQueriesareusuallyshort.expansionHsin-HsiChen18MajorProblemsofCLIR(Continued)Queriesmayhavetobesegmented.segmentationAdocumentmaybeintermsofvariouslanguages.languageidentificationHsin-HsiChen19EnhancingTraditional
InformationRetrievalSystemsWhichpart(s)shouldbemodifiedforCLIR?DocumentsQueriesDocumentRepresentationQueryRepresentationComparison(3)(1)(2)(4)Hsin-HsiChen20EnhancingTraditionalInformationRetrievalSystems(Continued)(1):texttranslation(2):vectortranslation(3):querytranslation(4):termvectortranslation(1)and(2),(3)and(4):interlingualformHsin-HsiChen21WhataretheProblems?Ambiguousterms(e.g.,performance)Multiwordphrasesmaycorrespondtosingle-wordphrases(e.g.SouthAfrica=>南非,Südafrika)CoverageofthevocabularyThereisnotaone-to-onemappingbetweentwolanguagesTranslatingqueriesautomatically(lackofsyntax)Translatingdocumentsautomatically(performance,…)ComputingmixedresultlistsHsin-HsiChen22Cross-LanguageInformationRetrievalHsin-HsiChen23QueryTranslationBasedCLIREnglishQueryTranslationDeviceChineseQueryMonolingualChineseRetrievalSystemRetrievedChineseDocumentsHsin-HsiChen24Translatingthe400Million
non-EnglishPagesoftheWWW...wouldtake100’000days(300years)ononefastPC.Or,1monthon3’600PC’s.Hsin-HsiChen25Knowledge-BasedExamplesSubjectThesaurusHierarchicalandassociativerelations.Uniquetermassignedtoeachnode.ConceptListTermspacepartitionedintoconceptspaces.TermListListofcross-languagesynonyms.LexiconMachinereadablesyntaxand/orsemantics.Hsin-HsiChen26Ontology-BasedApproachesExploitcomplexknowledgerepresentationse.g.,EuroWordNetAProposalforConceptualIndexingusingEuroWordNetHsin-HsiChen27Dictionary-BasedApproachesExploitmachine-readabledictionaries.
Problemstranslationambiguity+targetpolysemycoverage(unknownwords,abbreviations,...)Hsin-HsiChen28Dictionary-BasedApproaches
(Continued)Issue1:selectionstrategySelectall.SelectNrandomly.SelectbestN.Issue2:whichlevelwordphraseHsin-HsiChen29SelectionStrategy:SelectAllHullandGrefenstette1996Takeconcatenationofalltermtranslation.
E:politicallymotivatedcivildisturbances
F:troublescivilsacaracterepolitique
trouble-turmoil,discord,trouble,unrest,disturbance,disorder
civil-civil,civilian,courteous
caractere-character,nature
politique-political,diplomatic,politician,policyOriginalEnglish(0.393)vs.Automaticword-basedtransferdictionary(0.235):59.8%.errors:multi-wordexpressionsandambiguityHsin-HsiChen30SelectionStrategy:SelectAll
(Continued)Davis1997(TREC5)ReplaceeachEnglishquerytermwithallofitsSpanishequivalenttermsfromtheCollinsbilingualdictionary.Monolingual(0.2895)vs.All-equivalentsubstitution(0.1422):49.12%Hsin-HsiChen31EvaluationMethodAveragePrecision(5-,9-,11-points)ModelSpanishQueryMonoIREngineEnglishQueryBilingualDictionaryMonoIREngineTRECSpanishCorpusSpanishEquivalentsEnglishQueryMonoIREngineTRECSpanishCorpusSpanishEquivalentsbyPOSPOSBilingualDictionaryTRECSpanishCorpusHsin-HsiChen32SelectionStrategy:SelectNSimpleword-by-wordtranslationEachquerytermisreplacedbythewordorgroupofwordsgivenforthefirstsenseoftheterm’sdefinition.50-60%dropinperformance(averageprecision)Hsin-HsiChen33SelectionStrategy:SelectN
(Continued)word/phrasetranslationTakeatmostthreetranslationsofeachword,onefromeachofthefirstthreesenses.Takephrasetranslationifappearingindictionary.30-50%worsethangoodtranslationWell-translatedphrasescangreatlyimproveeffectiveness,butpoorlytranslatedphrasesmaynegatetheimprovements.WBW(0.0244),phrasal(0.0148),goodphrasal(0.0610)
-39.3%+150.3%Hsin-HsiChen34SelectionStrategy:SelectBestNHayashi,KikuiandSusaki1997searchforadictionaryentrycorrespondingtothelongestsequenceofwordsfromlefttorightchoosethemostfrequentlyusedword(orphrases)inatextcorpuscollectedfromWWWnoreportforthisquerytranslationapproachDavis1997(TREC5)POSdisambiguationMonolingual(0.2895)vs.All-equivalentsubstitution(0.1422)vs.POSdisambiguation(0.1949):near67.3%Hsin-HsiChen35Corpus-BasedApproachesCategorizationTerm-AlignedSentence-AlignedDocument-Aligned(Parallel,Comparable)UnalignedUsageSetupThesaurusVectorMappingHsin-HsiChen36Term-AlignedCorporaFine-grainedalignmentinparallelcorporaOard1996Termalignmentisachallengingproblem.ParallelBinlingualCorpusCooccurranceStatisticsTranslationTablesMachineTranslationSystemEnglishQuerySpanishQueryHsin-HsiChen37Sentence-AlignedCorporaDavis&Dunning1996(TREC4)High-frequencyTermsHsin-HsiChen38BriefSummarydictionary-basedmethodsSpecializedvocabularynotinthedictionarieswillnotbetranslated.Ambiguitieswilladdextraneoustermstothequery.parallel/comparablecorpora-basedmethodsParallelcorporaarenotalwaysavailable.Availablecorporatendtoberelativesmallortocoveronlyasmallnumberofsubjects.Performanceisdependentonhowwellthecorporaarealigned.Hsin-HsiChen39BriefSummary(Continued)Dictionariesareveryuseful.Achieve50%ontheirownParallelcorporahavelimitations.DomainshiftsTermalignmentaccuracyDictionariesandcorporaarecomplementary.Dictionariesprovidebroadandshallowcoverage.Corporaprovidenarrow(domain-specific)butdeep(moreterminology)coverageofthelanguage.Hsin-HsiChen40HybridMethodsWhatknowledgecanbeemployed?lexicalknowledgecorpusknowledge...Hsin-HsiChen41HybridMethods(Continued)QueryExpansionIssue1:contextpseudorelevancefeedback(localfeedback)::
Aqueryismodifiedbytheadditionoftermsfoundinthetopretrieveddocuments.localcontextanalysis::
Queriesareexpandedbytheadditionofthetoprankedconceptsfromthetoppassages.Hsin-HsiChen42HybridMethods(Continued)Issue2:whenbeforequerytranslationafterquerytranslationHsin-HsiChen43HybridMethods(Continued)Ballesteros&Croft1997OriginalSpanishTRECQuerieshumantranslationEnglish(BASE)QueriesSpanishQueriesautomaticdictionarytranslationEnglishQueriesqueryexpansionSpanishQueriesqueryexpansionSpanishQueriesautomaticdictionarytranslationINQUERYHsin-HsiChen44HybridMethods(Continued)PerformanceEvaluationpre-translation
MRD(0.0823)vs.LF(0.1099)vs.LCA10(0.1139)
+33.5%+38.5%post-translation
MRD(0.0823)vs.LF(0.0916)vs.LCA20(0.1022)
+11.3%+24.1%combinedpre-andpost-translation
MRD(0.0823)vs.LF(0.1242)vs.LCA20(0.1358)
+51.0%+65.0%32%belowamonolingualbaselineHsin-HsiChen45Cross-LanguageEvaluationForumAcollaborationbetweentheDELOSNetworkofExcellenceforDigitalLibrariesandtheUSNationalInstituteforStandardsandTechnology(NIST)ExtensionofCLIRtrackatTREC(1997-1999)Hsin-HsiChen46MainGoalsPromoteresearchincross-languagesystemdevelopmentforEuropeanlanguagesbyprovidinganappropriateinfrastructurefor:CLIRsystemevaluation,testingandtuningComparisonanddiscussionofresultsHsin-HsiChen47CLEF2023TaskDescriptionFourevaluationtracksinCLEF2023multilingualinformationretrievalbilingualinformationretrievalmonolingual(non-English)informationretrievaldomain-specificIRHsin-HsiChen48CaseStudy:CLIRforNPDMHsin-HsiChen493MinDigitalLibraries/MuseumsMulti-mediaSelectingsuitablemediatorepresentcontents
Multi-lingualityDecreasingthelanguagebarriersMulti-cultureIntegratingmultipleculturesHsin-HsiChen50NPDMProjectPalaceMuseum,Taipei,oneofthefamousmuseumsintheworldNSCsupportsapioneerstudyofadigitalmuseumprojectNPDMstartingfrom2023EnamelsfromtheMingandCh’ingDynastiesFamousAlbumLeavesoftheSungDynastyIllustrationsinBuddhistScriptureswithRelativeDrawingsHsin-HsiChen51DesignIssuesStandardizationAstandardmetadataprotocolisindispensablefortheinterchangeofresourceswithothermuseums.Multimedia
Asuitablepresentationschemeisrequired.InternationalizationtosharethevaluableresourcesofNPDMwithusersofdifferentlanguagestoutilizeknowledgepresentedinaforeignlanguageHsin-HsiChen52TranslingualIssue
CLIRtoallowuserstoissuequeriesinonelanguagetoaccessdocumentsinanotherlanguagethequerylanguageisEnglishandthedocumentlanguageisChineseTwocommonapproachesQuerytranslationDocumenttranslationHsin-HsiChen53ResourcesinNPDMpilotanenamel,acalligraphy,apainting,oranillustrationMICI-DCMetadataInterchangeforChineseInformationAccessiblefieldstousersShortdescriptionsvs.fulltextsBilingualversionsvs.ChineseonlyFieldsformaintenanceonlyHsin-HsiChen54SearchModesFreesearchusersdescribetheirinformationneedusingnaturallanguages(ChineseorEnglish)Specifictopicsearchusersfillinspecificfieldsdenotingauthors,titles,dates,andsoon
Hsin-HsiChen55ExampleInformationneedRetrieval“TravelersAmongMountainsandStreams,FanK‘uan”(“范寬谿山行旅圖”)
PossiblequeriesAuthor:FanKuan;Kuan,Fan
Time:SungDynasty
Title:MountainsandStreams;Travelamongmountains;Travelamongstreams;Mountainandstreampainting
Freesearch:landscapepainting;travelers,hugemountain,Nature;scenery;ShensiprovinceHsin-HsiChen56EnglishNamesChineseNamesMachineTransliterationEnglishTitlesChineseTitlesDocumentTranslationNameSearchTitleSearchEnglishQueryQueryDisambiguationSpecificBilingualDictionaryGenericBilingualDictionaryChineseQueryQueryTranslationChineseIRSystemNPDMCollectionResultsECIRinNPDMHsin-HsiChen57SpecificTopicSearchpropernamesareimportantquerytermsCreatorssuchas“林逋”(LinP’u),“李建中”(LiChien-chung),“歐陽脩”(Ou-yangHsiu),etc.
Emperorssuchas“康熙”(K'ang-hsi),“乾隆”(Ch'ien-lung),“徽宗”(Hui-tsung),etc.Dynastysuchas”宋”(Sung),“明”(Ming),“清”(Ch’ing),etc.Hsin-HsiChen58NameTransliteration
ThealphabetsofChineseandEnglisharetotallydifferent
Wade-Giles(WG)andPinyinaretwofamoussystemstoromanizeChineseinlibraries
backwardtransliterationTransliteratetargetlanguagetermsbacktosourcelanguageones
Chen,Huang,andTsai(COLING,1998)LinandChen(ROCLING,2023)Hsin-HsiChen59NameMappingTableDivideanameintoasequenceofChinesecharacters,andtransformeachcharacterintophonemesLookupphoneme-to-WG(Pinyin)mappingtable,andderiveacanonicalformforthenameExample“林逋”“ㄌㄧㄣㄆㄨ”
“LinP’u”(WG)
Hsin-HsiChen60NameSimilarityExtractnamedentityfromthequerySelectthemostsimilarnamedentityfromnamemappingtableNamingsequence/schemeLastNameFirstName1,e.g.,ChuHsi(朱熹)FirstName1LastName,e.g.,HsiChu(朱熹)LastNameFirstName1-FirstName2,e.g.,HsuTao-ning(許道寧)FirstName1-FirstName2LastName,e.g.,Tao-ningHsu(許道寧)Anyorder,e.g.,TaoNingHsu(許道寧)Anytransliteration,e.g.,JuShi(朱熹)Hsin-HsiChen61Title谿山行旅圖”“TravelersamongMountainsandStreams”"travelers","mountains",and"streams"arebasiccomponentsUsersca
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 【正版授权】 ISO 3987:2024 EN Petroleum products - Determination of sulfated ash in lubricating oils and additives and fatty acid methyl esters
- 2024年秋新沪粤版物理八年级上册课件 第五章质量与密度 第四节一些物质的属性
- 出国留学进修协议书模板
- 员工个人担保协议书模板
- 个人借房子协议书模板
- 中小学生烟卡游戏相关话题的舆情态势分析报告
- 幼儿美术创作之儿童印染 课件 第一章 儿童印染概述
- 了解ERP系统的基础
- 博思智联:济南舜华园发展建设有限公司绩效优化方案
- 2024年跨境电商软件服务行业报告
- 高压泵智能化控制系统开发与应用
- 电科院:储能构网控制及并网测试
- 数控折弯机保养点检表
- 2023医疗机构临床实验室管理办法
- 四年级书法下册教案《第4课 两点水》北师大版
- 多发性皮肌炎护理查房
- 主题班会我的情绪我做主心理健康主题班会活动
- 茶园改造项目实施方案
- 苏少版小学五年级上册音乐教案含教学计划、教学进度全册
- 增值税知识讲座
- 影像学检查护理
评论
0/150
提交评论