keynote chengxiang zhai_第1页
keynote chengxiang zhai_第2页
keynote chengxiang zhai_第3页
keynote chengxiang zhai_第4页
keynote chengxiang zhai_第5页
已阅读5页,还剩56页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Beyond Search: StatisticalTopic Ms for Text AnalysisChengXiang ZhaiDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign1Keynote at SIGIR 2011, July 26, 2011,Search is a means to the end of finishing a taskInformation Synthesis& AnalysisSearchTask CompletionSearch 1Decision Ma

2、kingLearningSearch 2Information InterpretationInformation SynthesisPotentially iterateKeynote at SIGIR 2011, July 26, 2011,2Multiple SearchesExample Task 1: Comparing News ArticlesVietnam WarAfghan WarIraq War Fox During Iraq war BBC Post-Iraq war Before 9/11US blogEuropean blogAsian blogWhats in co

3、mmon? Whats unique?Keynote at SIGIR 2011, July 26, 2011,3Common Themes“Vietnam” specific“Afghan” specific“Iraq” specificUnited nationsDeath of peopleExample Task 2: Compare Customer ReviewsLaptopReviewsAPPLE LaptopReviewsDELL LaptopReviewsWhich laptop to buy?Keynote at SIGIR 2011, July 26, 2011,4Com

4、mon Themes“” specific“APPLE” specific“DELL” specificBattery Life.Hard diskSpeedExample Task 3:Identify Emerging Research TopicsSensor NetworksStructured data, XMLWhats hot in database research?Web dataData StreamsRanking, Top-K5Keynote at SIGIR 2011, July 26, 2011,Example Task 4: Analysis of Topic D

5、iffusionOne Week LaterHow did a discussion of a topic in blogs sp?Keynote at SIGIR 2011, July 26, 2011,6Sample Task 5:Opinion Analysis on Blog ArticlesQuery=“Da Vinci Code”Tom Hanks, who is myfavorite movie star act the leading role.ing. will loseyour faith bywatching the movie.PositiveNegative. so

6、sick of peoplemaking such a big dealabout a fiction booka good book to pasttime.What did people like/dislike about “Da Vinci Code”?Keynote at SIGIR 2011, July 26, 2011,7QuestionsCan we m general way?all these analysis problems in a Yes! Can we solve these problems with a unified approach? Yes! How c

7、an we bring users into the loop? Yes! Solutions: Statistical Topic MsKeynote at SIGIR 2011, July 26, 2011,8Rest of the talkOverview of Statistical Topic MsContextual Probabilistic Latent Sem Analysis (CPLSA)Text Analysis Enabled by CPLSAFrom Search Engines to Analysis EnginesKeynote at SIGIR 2011, J

8、uly 26, 2011,9What is a Statistical LM?A probability distribution over word sequences p(“Today is Wednesday”) » 0.001 p(“Today Wednesday is”) » 0.0000000000001 p(“The eigenvalue is positive”) » 0.00001Context/topic dependent!Can also be regarded as a probabilistic mechanism for “gener

9、ating” text, thus also called a “generative” mKeynote at SIGIR 2011, July 26, 2011,10The Simplest Language M(Unigram M)Generate a piece of text by generating each word independentlyThus, p(w1 w2 . wn)=p(w1)p(w2)p(wn)Parameters: p(wi)p(w1)+p(wN)=1 (N is voc. size)Essentially a multinomial distributio

10、n over wordsA piece of text can be regarded as a sample drawn according to this word distributionKeynote at SIGIR 2011, July 26, 2011,11Text Generation with Unigram LMq(Unigram) Language Mp(w| q)SamplingDocument dText miningpaperTopic 1: Text mining Given q, p(d| q) varies according to d Topic 2: He

11、althFood nutrition paperKeynote at SIGIR 2011, July 26, 2011,12food 0.25nutrition 0.1healthy 0.05diet 0.02text 0.2mining 0.1assocation 0.01clustering 0.02food 0.00001Estimation of Unigram LMq(Unigram) Language Mp(w| q)=?EstimationDocumenttext 10mining 5association 3database 3algorithm 2 query 1effic

12、ient 110/1005/1003/1003/100Total #words=1001/100language mas topic representation?Keynote at SIGIR 2011, July 26, 2011,13text ? mining ? assocation ? database ?query ?Language Mas Text Representation: Early Work1961: H. P. Luhns early idea of using relative frequency to represent text Luhn 611976: R

13、obertson & Sparck Jones BIR mRobertson & Sparck Jones 761989: Wong & Yaos work on multinomial distribution representation Wong & Yao 89Luhn, H. P (1961) The automatic derivation of information retrieval encodements frommachine-able texts. In A. Kent (Ed.), Information Retrieval and M

14、achine Translation,Vol. 3, Pt 2., pp. 1021-1028.S. Robertson and K. Sparck Jones. (1976). Relevance Weighting of Search Terms. JASIS, 27,129-146.S. K. M. Wong and Y. Y. Yao (1989), A probability distribution mretrieval. Information Processing and Management, 25(1):39-53.for informationKeynote at SIG

15、IR 2011, July 26, 2011,14Language M Two Importantas Text Representation: tones in 199819991998: Language mfor retrieval(i.e., querylikelihood scoring Ponte & Croft 98 (and also independently Hiemstra & Kraaij 99)1999: Probabilistic Latent Sem (PLSA) Hofmann 99AnalysisJ. M. Ponte and W. B. Cr

16、oft. A language ming approach to informationretrieval. In Proceedings of ACM-SIGIR 1998, pages 275-281.D. Hiemstra and W. Kraaij, Twenty-One at TREC-7: Ad-hoc and Cross-languagetrack, In Proceedings of the Seventh Text REtrieval Conference (TREC-7), 1999.Thomas Hofmann: Probabilistic Latent SemAnaly

17、sis. UAI 1999: 289-296Keynote at SIGIR 2011, July 26, 2011,15Probabilistic Latent SemAnalysis (PLSA)Topic 1Apple iPodP(w) = åP(z = i)P(w | Topici )i=1.KTopic 2Harry PotterKeynote at SIGIR 2011, July 26, 2011,16movie0.10potter0.05actress0.04music0.02hhaarrryry 00.0.099Idownloaded themusicofthemo

18、vie harry potter to my ipod nanonano0.08music0.05download0.02apple0.01ipipoodd 00.1.155Parameter Estimationizing data likelihood: Prior set by users L*= arg maxlog( P(Data | M)LParameter Estimation using EM algorithmPseudo-CountsGuess the affiliationEstimate the paramsKeynote at SIGIR 2011, July 26,

19、 2011,17movie0?.10harry0?.09potter0?.05actress0?.04music0?.02IIddoowwnnloloaaddeeddtthhee mmuussicic ooff tthheemmoovvieie hhaarrryyppoottteerr ttoo mmyyipipoodd nnaannooipod0?.15nano0?.08music0?.05download0?.02apple0?.01Context Features of a DocumentcommunitiesAuthorsourceLocationTimeAuthorsKeynote

20、 at SIGIR 2011, July 26, 2011,18Weblog ArticleA General View of Contextpapers written in199819981999Partition of documents Any combination ofcontext features(metadata) can define a context20052006papers written byBruce CroftWWW SIGIRACLKDDSIGMODKeynote at SIGIR 2011, July 26, 2011,19Empower PLSA wit

21、h Context Mei & Zhai 06Make topics depend on context variablesText is generated from a contextualized PLSAm(CPLSA)Fitting such a mto text enables a widerange of analysis tasks involving topics and contextQiaozhu Mei, ChengXiang Zhai, A Mixture Mfor Contextual Text Mining,Proceedings of the 2006

22、ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , (KDD'06 ), pages 649-655Keynote at SIGIR 2011, July 26, 2011,20Contextual Probabilistic Latent SemAnalysissChoose a themeView1View2View3Draw a word fThemesgoventdonNeOrlTexasJuly 2005sociolo gistChoose a viewTheme covera

23、ges:Choose a CoverageTexasdocumentJuly 2005Keynote at SIGIR 2011, July 26, 2011,21roCmritiqciism of government response togothveernhmurernict aneprimariDly ococnusimsteednotf critirceisspmonosef its response to Thecotontatl esxhut:t-in oil prodTuimcteio=n fJroumly t2he00G5ulfof MLeoxiccaotiodnoanp=a

24、ptTreoexximasately24% oAf tuhtehaonrhn=eulpprOodaid io. n= aSndocthioelsohguistt-ccuuctpin gAasgeprGodrouucptio=n 45O+verseventy countries pledgedmonetary do Orleans ornationsother assistnaenwce. w eanscity 0.2new0.1orleans 0.05 .ationdonate 0.1relief 0.05help 0.02 .rnmegovernment 0.3response 0.2.Co

25、mparing News ArticlesIraq War (30 articles) vs. Afghan War (26 articles) The common theme indicates that “United Nations” is involved in both wars Collection-specific themes indicate different roles of “United Nations” in the two wars Keynote at SIGIR 2011, July 26, 2011,22Cluster 1Cluster 2Cluster

26、3CommonThemeunited0.042nations0.04killed0.035month0.032deaths0.023IraqThemen0.03Weapons0.024Inspections 0.023troops0.016hoon0.015sanches0.012AfghanThemeNorthern0.04alliance0.04kabul0.03taleban0.025aid0.02taleban0.026rumsfeld0.02hotel0.012front0.011Spatiotemporal Patterns in Blog ArticlesQuery= “Hurr

27、icane Katrina”Topics in the results:Government Responsebush 0.071president 0.061federal 0.051government 0.047fema 0.047administrate 0.023response 0.020brown 0.019blame 0.017governor 0.014New Orleanscity 0.063orleans 0.054new 0.034louisiana 0.023flood 0.022evacuate 0.021storm 0.017resident 0.016cente

28、r 0.016rescue 0.012Oil Priceprice 0.077oil 0.064gas 0.045increase 0.020product 0.020fuel 0.018company 0.018energy 0.017market 0.016gasoline 0.012Praying and Blessinggod 0.141pray 0.047prayer 0.041love 0.030life 0.025bless 0.025lord 0.017jesus 0.016will 0.013faith 0.012Aid and Donationdonate 0.120rel

29、ief 0.076red 0.070cross 0.065help 0.050victim 0.036organize 0.022effort 0.020fund 0.019volunteer 0.019ali 0.405my 0.116me 0.060am 0.029think 0.015feel 0.012know 0.011something 0.007guess 0.007myself 0.006Spatiotemporal patterns23Keynote at SIGIR 2011,July 26, 2011,Theme Life Cycles (“Hurricane Katri

30、na”)Oil PriceNew OrleansKeynote at SIGIR 2011, July 26, 2011,24city0.0634orleans 0.0541new 0.0342louisiana 0.0235flood 0.0227evacuate 0.0211storm 0.0177price 0.0772oil 0.0643gas 0.0454increase 0.0210product 0.0203fuel 0.0188company 0.0182Theme Snapshots (“Hurricane Katrina”)Week2: The discussion mov

31、es towards the north and westWeek1: The theme is the strongest along the Gulf of MexicoWeek3: The theme distributes more uniformly overthe statesWeek4: The theme isrong along the east coast and the Gulf of MexicoWeek5: The theme fades out in most statesKeynote at SIGIR 2011, July 26, 2011,25Theme Li

32、fe Cycles (KDD Papers)gene 0.0173expressions 0.0096probability 0.0081microarray 0.00380.02Biology Data Web Information Time Series Classification Association Rule ClusteringBussiness0.0180.0160.014marketing 0.00870.012customer 0.0086m0.00790.01business 0.00480.0080.006rules 0.0142association 0.0064s

33、upport 0.00530.0040.0020199920002001200220032004Time (year)Keynote at SIGIR 2011, July 26, 2011,26Normalized Strength of ThemeTheme Evolution Graph: KDD199920002001200220032004TKeynote at SIGIR 2011,July 26, 2011,27Classifica- tion0.015text0.013unlabeled 0.012document 0.008labeled0.008learning0.007d

34、ecision 0.006tree0.006classifier 0.005class0.005Bayes0.005topic0.010mixture 0.008LDA0.006sem0.005Informa- tion0.012web0.010social0.008retrieval 0.007distance 0.005networks 0.004mixture 0.005random 0.006cluster 0.006clustering 0.005variables0.005SVM0.007criteria 0.007 classifica tion0.006linear0.005w

35、eb0.009 classifica tion0.007 features0.006 topic0.005Multi-Faceted Sentiment Summary (query=“Da Vinci Code”)28Keynote at SIGIR 2011, July 26, 2011,NeutralPositiveNegativeFacet 1: Movie. Ron Howards selection of Tom Hanks to play Robert Langdon.Tom Hanks stars in the movie,who can be mad at that?But

36、the movie might get delayed, and even killed off if he loses.Directed by: Ron Howard Writing credits: Akiva Goldsman .Tom Hanks, who is my favorite movie star act the leading role.ing . will lose your faith by . watching the movie.After watching the movie I went online and some research on .Anybody

37、is interested in it?. so sick of people making such a big deal about a FICTION book and movie.Facet 2:BookI remembered when i first the book, I finished thebook in two days.Awesome book. so sick of people making such a big deal about a FICTION book and movie.Iming “Da Vinci Code”now.So still a good

38、book topast time.This controversy book causelotsin west society.Separate Theme Sentiment Dynamics“religious beliefs”“book”Keynote at SIGIR 2011, July 26, 2011,29Event Impact Analysis: IR ResearchTheme: retrieval mSIGIR paperssPublication of the paper “A language ing approach to information retrieval

39、”m1992year1998Starting of the TREC conferencesKeynote at SIGIR 2011, July 26, 2011,30probabilist 0.0778m0.0432logic0.0404ir0.0338boolean0.0281algebra0.0200estimate0.0119weight0.0111m0.1687language0.0753estimate0.0520parameter0.0281distribution0.0268probable0.0205smooth0.0198markov0.0137likelihood0.0

40、059term0.1599relevance0.0752weight0.0660feedback0.0372independence 0.0311m0.0310frequent0.0233probabilistic 0.0188document0.0173vector0.0514concept0.0298extend0.0297m0.0291space0.0236boolean0.0151function0.0123feedback0.0077xml0.06780.0197m0.0191collect0.0187judgment 0.0102rank0.0097subtopic0.0079Ma

41、ny Other VariationsLatent Dirichlet Allocation (LDA) Blei et al. 03 Impose priors on topic choices and word distributions Make PLSA a generative mMany variants of LDA!In practice, LDA and PLSA variants tend to work equally well for text analysis Lu et al. 11Blei et al. 02 D. Blei, A. Ng, and M. Jord

42、an. Latent dirichlet allocation. In T G Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press.Yue Lu, Qiaozhu Mei, ChengXiang Zhai. Investigating Task Performance ofProbabilistic Topic Ms - An Empirical Study of PLSA a

43、nd LDA, InformationRetrieval, vol. 14, no. 2, April, 2011.Keynote at SIGIR 2011, July 26, 2011,31Other Uses of Topic Ms for Text AnalysisTopic analysis on social networks Mei et al. 08Opinion Integration Lu & Zhai 08Latent Aspect Rating AnalysisWang et al. 10Qiaozhu Mei, Deng Cai, Duo Zhang, Che

44、ngXiang Zhai. Topic Ming with NetworkRegularization, Proceedings of the World Wide Conference 2008 ( WWW'08), pages 101-110.Yue Lu, ChengXiang Zhai. Opinion Integration Through Semi-supervised Topic MProceedings of the World Wide Conference 2008 ( WWW'08), pages 121-130.ing,Hongning Wang, Yu

45、e Lu, ChengXiang Zhai. Latent Aspect Rating Analysis on Review TextData: A Rating Regression Approach, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'10), pages 115-124, 2010.Keynote at SIGIR 2011, July 26, 2011,32Topic Ming + Social Netwo

46、rks:who work together on what?Authors writing about the same topic form a commu Separation of 3 research communities: IR, ML, Web Topic MOnlyTopic M+ Social NetworkKeynote at SIGIR 2011, July 26, 2011,33Topic Mfor Opinion IntegrationHow to digest all?190,451 posts4,773,658 resultsKeynote at SIGIR 20

47、11, July 26, 2011,34Two Kinds of Opinions4,773,658 results190,451 postsKeynote at SIGIR 2011, July 26, 2011,35How to benefit from both?Ordinary opinions Forum discussions Blog articles Represent the majority Up to date Hard to access fragmentalExpert opinions CNET editors review Wikipedia article We

48、ll-structured Easy to access Maybe biased Outdated soonGenerate an Integrative SummaryOutputSimilaropinionsSupplementaryopinionsTopic: iPodDesignBatteryDesignBatteryPr ice. PriceExpert reviewwith aspectsText collectionof ordinaryopinions, e.g.WeblogsKeynote at SIGIR 2011, July 26, 2011,36Extra Aspec

49、ts Review AspectsInputIntegrated SummaryiTunes easy to usewarrantybetter to extend.cute tiny.thicker.last many hrsdie out sooncould afford itstill expensiveMethodsSemi-Supervised Probabilistic Latent Sem Analysis (PLSA) The aspects extracted from expert reviews serve as cluesto define a conjugate pr

50、ior on topicsum a Posteriori (MAP) estimation Repeated applications of PLSA to integrate and align opinions in blog articles to expert reviewKeynote at SIGIR 2011, July 26, 2011,37Results: Product ()Opinion Integration with review aspectsReview articleSimilar opinionsSupplementary opinionsYou can makeemergency calls, butyou can't use anyN/A methods for unlocking thehave emerged on theUInntleorcnke/thianckthe past few weeks,other functionsalwith thethey involve tinkeringhardwareConfirm theopinions fro

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论