跨语言资讯检索导论ppt课件_第1页
跨语言资讯检索导论ppt课件_第2页
跨语言资讯检索导论ppt课件_第3页
跨语言资讯检索导论ppt课件_第4页
跨语言资讯检索导论ppt课件_第5页
已阅读5页,还剩61页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、跨語言資訊檢索導論Hsin-Hsi Chen (陳信希)Department of Computer Science and Information EngineeringNational Taiwan UniversityOutlinenMultilingual EnvironmentsnWhat is Cross-Language Information Retrieval?nMajor Problems in CLIRnMajor Approaches in CLIRnCase Study: CLIR in NPDMnSummaryMultilingual CollectionsnThe

2、re are 6,703 languages listed in the EthnologuenDigital librariesnOCLC Online Computer Library Center serves more than 17,000 libraries in 52 countries and contains over 30 million bibliographic records with over 500 million records ownership attached in more than 370 languagesnWorld Wide WebnAround

3、 40% of Internet users do not speak English, however, 80% of Web sites are still in English0200400600800Speakers (Millions)ChineseHindi-UrduPortugueseRussianJapanese真實世界語言运用人口( g11n/faq.htm)中文英語印度語西班牙語葡萄牙語孟加拉語俄語阿拉伯語日語(Statistics from Euro-Marketing Associates, 2019)西班牙語德語日語法語中文荷蘭語葡萄牙語義大利語瑞典語韓文glreac

4、h/globstats/(Statistics from Euro-Marketing Associates, 2019)中文人口比例(6.1%) 南非, Sdafrika)nCoverage of the vocabularynThere is not a one-to-one mapping between two languagesnTranslating queries automatically (lack of syntax)nTranslating documents automatically (performance, )nComputing mixed result lis

5、tsCross-Language Information RetrievalCont r ol l ed Vocabul ar yThes aur us - bas edOnt ol ogy- bas edDi ct i onar y- bas edKnowl edge- bas edTer m- al i gnedSent ence- al i gnedPar al l elCompar abl eDocument - al i gnedUnal i gnedCor pus - bas edHybr i dFr ee TextQuer y Tr ans l at i onText Tr an

6、s l at i onVect or Tr ans l at i onDocument Tr ans l at i onNo Tr ans l at i onCr os s - Language I nf or mat i on Ret r i evalQuery Translation Based CLIREnglishQueryTranslationDeviceChineseQueryMonolingualChineseRetrievalSystemRetrievedChineseDocumentsTranslating the 400 Millionnon-English Pages o

7、f the WWWn. would take 100000 days (300 years) on one fast PC. Or, 1 month on 3600 PCs.Knowledge-BasednExamplesnSubject ThesaurusnHierarchical and associative relations.nUnique term assigned to each node.nConcept ListnTerm space partitioned into concept spaces.nTerm ListnList of cross-language synon

8、yms.nLexiconnMachine readable syntax and/or semantics.Ontology-Based ApproachesnExploit complex knowledge representations e.g., EuroWordNet nA Proposal for Conceptual Indexing using EuroWordNetDictionary-Based ApproachesnExploit machine-readable dictionaries.nProblemsntranslation ambiguity + target

9、polysemyncoverage (unknown words, abbreviations, .)Dictionary-Based Approaches(Continued)nIssue 1: selection strategynSelect all.nSelect N randomly.nSelect best N.nIssue 2: which levelnwordnphraseSelection Strategy: Select AllnHull and Grefenstette 2019nTake concatenation of all term translation.E:

10、politically motivated civil disturbancesF: troubles civils a caractere politiquetrouble - turmoil, discord, trouble, unrest, disturbance, disordercivil - civil, civilian, courteouscaractere - character, naturepolitique - political, diplomatic, politician, policynOriginal English (0.393) vs. Automati

11、c word-based transfer dictionary (0.235): 59.8%.nerrors: multi-word expressions and ambiguitySelection Strategy: Select All(Continued)nDavis 2019 (TREC5)nReplace each English query term with all of its Spanish equivalent terms from the Collins bilingual dictionary.nMonolingual (0.2895) vs. All-equiv

12、alent substitution (0.1422): 49.12%Evaluation MethodnAverage Precision (5-, 9-, 11-points)nModelSpanish QueryMonoIR EngineEnglish QueryBilingualDictionaryMonoIR EngineTRECSpanishCorpusSpanishEquivalentsEnglish QueryMonoIR EngineTRECSpanishCorpusSpanishEquivalentsby POSPOSBilingualDictionaryTRECSpani

13、shCorpusSelection Strategy: Select NnSimple word-by-word translationnEach query term is replaced by the word or group of words given for the first sense of the terms definition.n50-60% drop in performance (average precision)Selection Strategy: Select N(Continued)nword/phrase translationnTake at most

14、 three translations of each word, one from each of the first three senses. Take phrase translation if appearing in dictionary.n30-50% worse than good translationnWell-translated phrases can greatly improve effectiveness, but poorly translated phrases may negate the improvements.nWBW (0.0244), phrasa

15、l (0.0148), good phrasal (0.0610) -39.3% +150.3%Selection Strategy: Select Best NnHayashi, Kikui and Susaki 2019nsearch for a dictionary entry corresponding to the longest sequence of words from left to rightnchoose the most frequently used word (or phrases) in a text corpus collected from WWWnno re

16、port for this query translation approachnDavis 2019 (TREC5)nPOS disambiguationnMonolingual (0.2895) vs. All-equivalent substitution (0.1422) vs. POS disambiguation (0.1949): near 67.3%Corpus-Based ApproachesnCategorizationnTerm-AlignednSentence-AlignednDocument-Aligned (Parallel, Comparable)nUnalign

17、ednUsagenSetup ThesaurusnVector MappingTerm-Aligned CorporanFine-grained alignment in parallel corporanOard 2019nTerm alignment is a challenging problem.ParallelBinlingualCorpusCooccurranceStatisticsTranslationTablesMachineTranslationSystemEnglish QuerySpanishQuerySentence-Aligned CorporanDavis &

18、; Dunning 2019 (TREC4)nHigh-frequency TermsBrief Summaryndictionary-based methodsnSpecialized vocabulary not in the dictionaries will not be translated.nAmbiguities will add extraneous terms to the query.nparallel/comparable corpora-based methodsnParallel corpora are not always available.nAvailable

19、corpora tend to be relative small or to cover only a small number of subjects.nPerformance is dependent on how well the corpora are aligned.Brief Summary (Continued)nDictionaries are very useful.nAchieve 50% on their ownnParallel corpora have limitations.nDomain shiftsnTerm alignment accuracynDictio

20、naries and corpora are complementary.nDictionaries provide broad and shallow coverage.nCorpora provide narrow (domain-specific) but deep (more terminology) coverage of the language.Hybrid MethodsnWhat knowledge can be employed?nlexical knowledgencorpus knowledgen.Hybrid Methods (Continued)nQuery Exp

21、ansionnIssue 1: contextnpseudo relevance feedback (local feedback):A query is modified by the addition of terms found in the top retrieved documents.nlocal context analysis:Queries are expanded by the addition of the top ranked concepts from the top passages.Hybrid Methods (Continued) Issue 2: when

22、before query translation after query translationHybrid Methods (Continued)nBallesteros & Croft 2019Original SpanishTREC QuerieshumantranslationEnglish (BASE)QueriesSpanishQueriesautomaticdictionarytranslationEnglishQueriesqueryexpansionSpanishQueriesqueryexpansionSpanishQueriesautomaticdictionar

23、ytranslationINQUERYHybrid Methods (Continued) Performance Evaluation pre-translationMRD (0.0823) vs. LF (0.1099) vs. LCA10 (0.1) +33.5% +38.5% post-translationMRD (0.0823) vs. LF (0.0916) vs. LCA20 (0.1022) +11.3% +24.1% combined pre- and post-translationMRD (0.0823) vs. LF (0.1242) vs. LCA20 (0.8)

24、+51.0% +65.0% 32% below a monolingual baselineCross-Language Evaluation ForumnA collaboration between the DELOS Network of Excellence for Digital Libraries and the US National Institute for Standards and Technology (NIST)nExtension of CLIR track at TREC (2019-2019)Main GoalsnPromote research in cros

25、s-language system development for European languages by providing an appropriate infrastructure for:nCLIR system evaluation, testing and tuningnComparison and discussion of resultsCLEF 2000 Task Description nFour evaluation tracks in CLEF 2000nmultilingual information retrievalnbilingual information

26、 retrievalnmonolingual (non-English) information retrievalndomain-specific IRCase Study: CLIR for NPDM3M in Digital Libraries/MuseumsnMulti-medianSelecting suitable media to represent contents nMulti-lingualityn Decreasing the language barriersnMulti-culturenIntegrating multiple cultures NPDM Projec

27、tnPalace Museum, Taipei, one of the famous museums in the worldnNSC supports a pioneer study of a digital museum project NPDM starting from 2000 nEnamels from the Ming and Ching Dynasties nFamous Album Leaves of the Sung Dynasty nIllustrations in Buddhist Scriptures with Relative Drawings Design Iss

28、uesnStandardizationnA standard metadata protocol is indispensable for the interchange of resources with other museums.nMultimedia nA suitable presentation scheme is required.nInternationalization nto share the valuable resources of NPDM with users of different languagesnto utilize knowledge presente

29、d in a foreign languageTranslingual Issue nCLIRnto allow users to issue queries in one language to access documents in another languagenthe query language is English and the document language is ChinesenTwo common approachesnQuery translationnDocument translationResources in NPDM pilotnan enamel, a

30、calligraphy, a painting, or an illustrationnMICI-DCnMetadata Interchange for Chinese InformationnAccessible fields to usersnShort descriptions vs. full textsnBilingual versions vs. Chinese onlynFields for maintenance onlySearch ModesnFree searchnusers describe their information need using natural la

31、nguages (Chinese or English)nSpecific topic searchnusers fill in specific fields denoting authors, titles, dates, and so on ExamplenInformation neednRetrieval “Travelers Among Mountains and Streams, Fan Kuan (“范寬谿山行旅圖) nPossible queriesnAuthor: Fan Kuan; Kuan, Fan nTime: Sung Dynasty nTitle: Mountai

32、ns and Streams; Travel among mountains; Travel among streams; Mountain and stream painting nFree search: landscape painting; travelers, huge mountain, Nature; scenery; Shensi province EnglishNamesChineseNamesMachineTransliterationEnglishTitlesChineseTitlesDocumentTranslationNameSearchTitleSearchEngl

33、ishQueryQueryDisambiguationSpecificBilingualDictionaryGenericBilingualDictionaryChineseQueryQueryTranslationChinese IRSystemNPDMCollectionResultsECIR in NPDMSpecific Topic Searchnproper names are important query termsnCreators such as “林逋 (Lin Pu), “李建中 (Li Chien-chung), “歐陽脩 (Ou-yang Hsiu), etc. nE

34、mperors such as “康熙 (Kang-hsi), “乾隆 (Chien-lung), “徽宗 (Hui-tsung), etc.nDynasty such as 宋 (Sung), “明 (Ming), “清 (Ching), etc.Name Transliteration nThe alphabets of Chinese and English are totally different nWade-Giles (WG) and Pinyin are two famous systems to romanize Chinese in libraries nbackward

35、transliterationnTransliterate target language terms back to source language ones nChen, Huang, and Tsai (COLING, 2019)nLin and Chen (ROCLING, 2000)Name Mapping TablenDivide a name into a sequence of Chinese characters, and transform each character into phonemesnLook up phoneme-to-WG (Pinyin) mapping

36、 table, and derive a canonical form for the name nExamplen“林逋 “ “Lin Pu (WG) Name SimilaritynExtract named entity from the query nSelect the most similar named entity from name mapping tablenNaming sequence/schemenLastName FirstName1, e.g., Chu Hsi (朱熹) nFirstName1 LastName, e.g., Hsi Chu (朱熹) nLast

37、Name FirstName1-FirstName2, e.g., Hsu Tao-ning (許道寧) nFirstName1-FirstName2 LastName, e.g., Tao-ning Hsu (許道寧) nAny order, e.g., Tao Ning Hsu (許道寧) nAny transliteration, e.g., Ju Shi (朱熹) Titlen谿山行旅圖 “Travelers among Mountains and Streamsntravelers, mountains, and streams are basic componentsnUsers

38、can express their information need through the descriptions of a desired art nSystem will measure the similarity of art titles (descriptions) and a query Free SearchnA query is composed of several concepts. nConcepts are either transliterated or translated.nThe query translation similar to a small scale IR system nResourcesnName-mapping

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论