课件跨语言资讯检索导论_第1页
课件跨语言资讯检索导论_第2页
课件跨语言资讯检索导论_第3页
课件跨语言资讯检索导论_第4页
课件跨语言资讯检索导论_第5页
已阅读5页,还剩61页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

跨語言資訊檢索導論Hsin-HsiChen(陳信希)DepartmentofComputerScienceandInformationEngineeringNationalTaiwanUniversityHsin-HsiChen1OutlineMultilingualEnvironmentsWhatisCross-LanguageInformationRetrieval?MajorProblemsinCLIRMajorApproachesinCLIRCaseStudy:CLIRinNPDMSummaryHsin-HsiChen2MultilingualCollectionsThereare6,703languageslistedintheEthnologueDigitallibrariesOCLCOnlineComputerLibraryCenterservesmorethan17,000librariesin52countriesandcontainsover30millionbibliographicrecordswithover500millionrecordsownershipattachedinmorethan370languagesWorldWideWebAround40%ofInternetusersdonotspeakEnglish,however,80%ofWebsitesarestillinEnglishHsin-HsiChen3真實世界語言使用人口(/faq.htm)中文英語印度語西班牙語葡萄牙語孟加拉語俄語阿拉伯語日語Hsin-HsiChen4(StatisticsfromEuro-MarketingAssociates,1998)西班牙語德語日語法語中文荷蘭語葡萄牙語義大利語瑞典語韓文Hsin-HsiChen5(StatisticsfromEuro-MarketingAssociates,1999)中文人口百分比(6.1%)<法文人口百分比(8.8%)(1998年)Hsin-HsiChen6網路世界語言使用人口Hsin-HsiChen7網際網路內容(NetworkWizardsJan99InternetDomainSurvey)英語日語德語法語荷蘭語芬蘭語西班牙語中文瑞典語33,8781,6871,68465454647345843254640%旳Internet使用者不懂英文,但是80%旳Internet內容是英文Hsin-HsiChen8(Source:)Hsin-HsiChen9WhatisCross-LanguageInformationRetrieval?Definition:Selectinformationinonelanguagebasedonqueriesinanother.TerminologiesCross-LanguageInformationRetrieval

(ACMSIGIR96WorkshoponCross-LinguisticInformationRetrieval)TranslingualInformationRetrieval

(DefenseAdvancedResearchProjectAgency-DARPA)Hsin-HsiChen10Generalization:

Multi-&Cross-LingualInformationAccessHsin-HsiChen11MLIRApplicationsMultilingualinformationaccessinmultilingualcountry,organization,enterprise,etc.Cross-languageinformationretrievalforuserswhoreadasecondlanguage(largepassivevocabulary)butarenotabletoformulategoodqueries(smallactivevocabulary).Monolingualusersmayretrieveimagesbytakingadvantageofmultilingualcaptions.Monolingualusersmayretrievedocumentsandhavethemtranslated(automaticallyormanually)intheirlanguage.Hsin-HsiChen12WhyisCross-LanguageInformationRetrievalImportant?MoreinformationworkerswithlesstimerequirefastaccesstoglobalresourcesglobalB2Binteractions(virtualenterprises)globalB2Cinteractions(onlinetrading,travelling)timecriticalinformation(translationcomestoolate)Hsin-HsiChen13History1970SaltonrunsretrievalexperimentswithasmallEnglish/Germandictionary1972PevznershowsforEnglishandRussianthatacontrolledthesauruscanbeusedeffectivelyforquerytermtranslation1978ISOStandard5964fordevelopingmultilingualthesauri(revisedin1985)1990LatentSemanticIndexing(LSI)appliedtoCLIRHsin-HsiChen14History(Continued)19941stPhDthesisonCLIRbyKhaledRadwan1996SimilaritythesaurusappliedtoCLIR(ETHZurich)1996DictionarybasedretrievalappliedtoCLIR(Umass&XEROXGrenoble)1997GeneralizedVectorSpaceModel(GVSM)appliedtoCLIR(CMU)Hsin-HsiChen15History(Continued)1997CLIR(Cross-LanguageInformationRetrieval)trackstartswithinTREC1998NTCIRstartsinJapan1999TIDES(TranslingualInformationDetection,Extraction,andSummarization)startsinU.S.2023CLEFstartsinEuropeHsin-HsiChen16AnArchitectureofMultilingualInformationAccessHsin-HsiChen17MajorProblemsofCLIRQueriesanddocumentsareindifferentlanguages.translationWordsinaquerymaybeambiguous.disambiguationQueriesareusuallyshort.expansionHsin-HsiChen18MajorProblemsofCLIR(Continued)Queriesmayhavetobesegmented.segmentationAdocumentmaybeintermsofvariouslanguages.languageidentificationHsin-HsiChen19EnhancingTraditional

InformationRetrievalSystemsWhichpart(s)shouldbemodifiedforCLIR?DocumentsQueriesDocumentRepresentationQueryRepresentationComparison(3)(1)(2)(4)Hsin-HsiChen20EnhancingTraditionalInformationRetrievalSystems(Continued)(1):texttranslation(2):vectortranslation(3):querytranslation(4):termvectortranslation(1)and(2),(3)and(4):interlingualformHsin-HsiChen21WhataretheProblems?Ambiguousterms(e.g.,performance)Multiwordphrasesmaycorrespondtosingle-wordphrases(e.g.SouthAfrica=>南非,Südafrika)CoverageofthevocabularyThereisnotaone-to-onemappingbetweentwolanguagesTranslatingqueriesautomatically(lackofsyntax)Translatingdocumentsautomatically(performance,…)ComputingmixedresultlistsHsin-HsiChen22Cross-LanguageInformationRetrievalHsin-HsiChen23QueryTranslationBasedCLIREnglishQueryTranslationDeviceChineseQueryMonolingualChineseRetrievalSystemRetrievedChineseDocumentsHsin-HsiChen24Translatingthe400Million

non-EnglishPagesoftheWWW...wouldtake100’000days(300years)ononefastPC.Or,1monthon3’600PC’s.Hsin-HsiChen25Knowledge-BasedExamplesSubjectThesaurusHierarchicalandassociativerelations.Uniquetermassignedtoeachnode.ConceptListTermspacepartitionedintoconceptspaces.TermListListofcross-languagesynonyms.LexiconMachinereadablesyntaxand/orsemantics.Hsin-HsiChen26Ontology-BasedApproachesExploitcomplexknowledgerepresentationse.g.,EuroWordNetAProposalforConceptualIndexingusingEuroWordNetHsin-HsiChen27Dictionary-BasedApproachesExploitmachine-readabledictionaries.

Problemstranslationambiguity+targetpolysemycoverage(unknownwords,abbreviations,...)Hsin-HsiChen28Dictionary-BasedApproaches

(Continued)Issue1:selectionstrategySelectall.SelectNrandomly.SelectbestN.Issue2:whichlevelwordphraseHsin-HsiChen29SelectionStrategy:SelectAllHullandGrefenstette1996Takeconcatenationofalltermtranslation.

E:politicallymotivatedcivildisturbances

F:troublescivilsacaracterepolitique

trouble-turmoil,discord,trouble,unrest,disturbance,disorder

civil-civil,civilian,courteous

caractere-character,nature

politique-political,diplomatic,politician,policyOriginalEnglish(0.393)vs.Automaticword-basedtransferdictionary(0.235):59.8%.errors:multi-wordexpressionsandambiguityHsin-HsiChen30SelectionStrategy:SelectAll

(Continued)Davis1997(TREC5)ReplaceeachEnglishquerytermwithallofitsSpanishequivalenttermsfromtheCollinsbilingualdictionary.Monolingual(0.2895)vs.All-equivalentsubstitution(0.1422):49.12%Hsin-HsiChen31EvaluationMethodAveragePrecision(5-,9-,11-points)ModelSpanishQueryMonoIREngineEnglishQueryBilingualDictionaryMonoIREngineTRECSpanishCorpusSpanishEquivalentsEnglishQueryMonoIREngineTRECSpanishCorpusSpanishEquivalentsbyPOSPOSBilingualDictionaryTRECSpanishCorpusHsin-HsiChen32SelectionStrategy:SelectNSimpleword-by-wordtranslationEachquerytermisreplacedbythewordorgroupofwordsgivenforthefirstsenseoftheterm’sdefinition.50-60%dropinperformance(averageprecision)Hsin-HsiChen33SelectionStrategy:SelectN

(Continued)word/phrasetranslationTakeatmostthreetranslationsofeachword,onefromeachofthefirstthreesenses.Takephrasetranslationifappearingindictionary.30-50%worsethangoodtranslationWell-translatedphrasescangreatlyimproveeffectiveness,butpoorlytranslatedphrasesmaynegatetheimprovements.WBW(0.0244),phrasal(0.0148),goodphrasal(0.0610)

-39.3%+150.3%Hsin-HsiChen34SelectionStrategy:SelectBestNHayashi,KikuiandSusaki1997searchforadictionaryentrycorrespondingtothelongestsequenceofwordsfromlefttorightchoosethemostfrequentlyusedword(orphrases)inatextcorpuscollectedfromWWWnoreportforthisquerytranslationapproachDavis1997(TREC5)POSdisambiguationMonolingual(0.2895)vs.All-equivalentsubstitution(0.1422)vs.POSdisambiguation(0.1949):near67.3%Hsin-HsiChen35Corpus-BasedApproachesCategorizationTerm-AlignedSentence-AlignedDocument-Aligned(Parallel,Comparable)UnalignedUsageSetupThesaurusVectorMappingHsin-HsiChen36Term-AlignedCorporaFine-grainedalignmentinparallelcorporaOard1996Termalignmentisachallengingproblem.ParallelBinlingualCorpusCooccurranceStatisticsTranslationTablesMachineTranslationSystemEnglishQuerySpanishQueryHsin-HsiChen37Sentence-AlignedCorporaDavis&Dunning1996(TREC4)High-frequencyTermsHsin-HsiChen38BriefSummarydictionary-basedmethodsSpecializedvocabularynotinthedictionarieswillnotbetranslated.Ambiguitieswilladdextraneoustermstothequery.parallel/comparablecorpora-basedmethodsParallelcorporaarenotalwaysavailable.Availablecorporatendtoberelativesmallortocoveronlyasmallnumberofsubjects.Performanceisdependentonhowwellthecorporaarealigned.Hsin-HsiChen39BriefSummary(Continued)Dictionariesareveryuseful.Achieve50%ontheirownParallelcorporahavelimitations.DomainshiftsTermalignmentaccuracyDictionariesandcorporaarecomplementary.Dictionariesprovidebroadandshallowcoverage.Corporaprovidenarrow(domain-specific)butdeep(moreterminology)coverageofthelanguage.Hsin-HsiChen40HybridMethodsWhatknowledgecanbeemployed?lexicalknowledgecorpusknowledge...Hsin-HsiChen41HybridMethods(Continued)QueryExpansionIssue1:contextpseudorelevancefeedback(localfeedback)::

Aqueryismodifiedbytheadditionoftermsfoundinthetopretrieveddocuments.localcontextanalysis::

Queriesareexpandedbytheadditionofthetoprankedconceptsfromthetoppassages.Hsin-HsiChen42HybridMethods(Continued)Issue2:whenbeforequerytranslationafterquerytranslationHsin-HsiChen43HybridMethods(Continued)Ballesteros&Croft1997OriginalSpanishTRECQuerieshumantranslationEnglish(BASE)QueriesSpanishQueriesautomaticdictionarytranslationEnglishQueriesqueryexpansionSpanishQueriesqueryexpansionSpanishQueriesautomaticdictionarytranslationINQUERYHsin-HsiChen44HybridMethods(Continued)PerformanceEvaluationpre-translation

MRD(0.0823)vs.LF(0.1099)vs.LCA10(0.1139)

+33.5%+38.5%post-translation

MRD(0.0823)vs.LF(0.0916)vs.LCA20(0.1022)

+11.3%+24.1%combinedpre-andpost-translation

MRD(0.0823)vs.LF(0.1242)vs.LCA20(0.1358)

+51.0%+65.0%32%belowamonolingualbaselineHsin-HsiChen45Cross-LanguageEvaluationForumAcollaborationbetweentheDELOSNetworkofExcellenceforDigitalLibrariesandtheUSNationalInstituteforStandardsandTechnology(NIST)ExtensionofCLIRtrackatTREC(1997-1999)Hsin-HsiChen46MainGoalsPromoteresearchincross-languagesystemdevelopmentforEuropeanlanguagesbyprovidinganappropriateinfrastructurefor:CLIRsystemevaluation,testingandtuningComparisonanddiscussionofresultsHsin-HsiChen47CLEF2023TaskDescriptionFourevaluationtracksinCLEF2023multilingualinformationretrievalbilingualinformationretrievalmonolingual(non-English)informationretrievaldomain-specificIRHsin-HsiChen48CaseStudy:CLIRforNPDMHsin-HsiChen493MinDigitalLibraries/MuseumsMulti-mediaSelectingsuitablemediatorepresentcontents

Multi-lingualityDecreasingthelanguagebarriersMulti-cultureIntegratingmultipleculturesHsin-HsiChen50NPDMProjectPalaceMuseum,Taipei,oneofthefamousmuseumsintheworldNSCsupportsapioneerstudyofadigitalmuseumprojectNPDMstartingfrom2023EnamelsfromtheMingandCh’ingDynastiesFamousAlbumLeavesoftheSungDynastyIllustrationsinBuddhistScriptureswithRelativeDrawingsHsin-HsiChen51DesignIssuesStandardizationAstandardmetadataprotocolisindispensablefortheinterchangeofresourceswithothermuseums.Multimedia

Asuitablepresentationschemeisrequired.InternationalizationtosharethevaluableresourcesofNPDMwithusersofdifferentlanguagestoutilizeknowledgepresentedinaforeignlanguageHsin-HsiChen52TranslingualIssue

CLIRtoallowuserstoissuequeriesinonelanguagetoaccessdocumentsinanotherlanguagethequerylanguageisEnglishandthedocumentlanguageisChineseTwocommonapproachesQuerytranslationDocumenttranslationHsin-HsiChen53ResourcesinNPDMpilotanenamel,acalligraphy,apainting,oranillustrationMICI-DCMetadataInterchangeforChineseInformationAccessiblefieldstousersShortdescriptionsvs.fulltextsBilingualversionsvs.ChineseonlyFieldsformaintenanceonlyHsin-HsiChen54SearchModesFreesearchusersdescribetheirinformationneedusingnaturallanguages(ChineseorEnglish)Specifictopicsearchusersfillinspecificfieldsdenotingauthors,titles,dates,andsoon

Hsin-HsiChen55ExampleInformationneedRetrieval“TravelersAmongMountainsandStreams,FanK‘uan”(“范寬谿山行旅圖”)

PossiblequeriesAuthor:FanKuan;Kuan,Fan

Time:SungDynasty

Title:MountainsandStreams;Travelamongmountains;Travelamongstreams;Mountainandstreampainting

Freesearch:landscapepainting;travelers,hugemountain,Nature;scenery;ShensiprovinceHsin-HsiChen56EnglishNamesChineseNamesMachineTransliterationEnglishTitlesChineseTitlesDocumentTranslationNameSearchTitleSearchEnglishQueryQueryDisambiguationSpecificBilingualDictionaryGenericBilingualDictionaryChineseQueryQueryTranslationChineseIRSystemNPDMCollectionResultsECIRinNPDMHsin-HsiChen57SpecificTopicSearchpropernamesareimportantquerytermsCreatorssuchas“林逋”(LinP’u),“李建中”(LiChien-chung),“歐陽脩”(Ou-yangHsiu),etc.

Emperorssuchas“康熙”(K'ang-hsi),“乾隆”(Ch'ien-lung),“徽宗”(Hui-tsung),etc.Dynastysuchas”宋”(Sung),“明”(Ming),“清”(Ch’ing),etc.Hsin-HsiChen58NameTransliteration

ThealphabetsofChineseandEnglisharetotallydifferent

Wade-Giles(WG)andPinyinaretwofamoussystemstoromanizeChineseinlibraries

backwardtransliterationTransliteratetargetlanguagetermsbacktosourcelanguageones

Chen,Huang,andTsai(COLING,1998)LinandChen(ROCLING,2023)Hsin-HsiChen59NameMappingTableDivideanameintoasequenceofChinesecharacters,andtransformeachcharacterintophonemesLookupphoneme-to-WG(Pinyin)mappingtable,andderiveacanonicalformforthenameExample“林逋”“ㄌㄧㄣㄆㄨ”

“LinP’u”(WG)

Hsin-HsiChen60NameSimilarityExtractnamedentityfromthequerySelectthemostsimilarnamedentityfromnamemappingtableNamingsequence/schemeLastNameFirstName1,e.g.,ChuHsi(朱熹)FirstName1LastName,e.g.,HsiChu(朱熹)LastNameFirstName1-FirstName2,e.g.,HsuTao-ning(許道寧)FirstName1-FirstName2LastName,e.g.,Tao-ningHsu(許道寧)Anyorder,e.g.,TaoNingHsu(許道寧)Anytransliteration,e.g.,JuShi(朱熹)Hsin-HsiChen61Title谿山行旅圖”“TravelersamongMountainsandStreams”"travelers","mountains",and"streams"arebasiccomponentsUsersca

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论