由知识挖掘提升商务智能应用(谢邦昌)课件_第1页
由知识挖掘提升商务智能应用(谢邦昌)课件_第2页
由知识挖掘提升商务智能应用(谢邦昌)课件_第3页
由知识挖掘提升商务智能应用(谢邦昌)课件_第4页
由知识挖掘提升商务智能应用(谢邦昌)课件_第5页
已阅读5页,还剩77页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、由知识挖掘提升商务智能应用-统计分析的进阶加值应用From Knowledge Mining to Business Intelligence-Advanced Statistics Application 谢邦昌 博士厦门大学讲座教授兼博导 首都经贸大学讲座教授兼博导中央财经大学讲座教授兼博导 西南财经大学讲座教授中国人民大学兼职教授辅仁大学统计资讯学系及应用统计所教授中华资料采矿协会理事长Outline知识采矿(整合数据采矿与文本采矿)与商业智慧的发展知识采矿程序、步骤、产出与应用如何进行数据采矿与文本采矿整合知识采矿之技术发展评论知识保存价值减少循环时间反应时间重复投资作业花费会议时间外

2、界顾问等等增加生产力与质量企业知识的转换快且有效的决策课程创新群策群力 等等 企业知识的保留与转换知识资产的投资精简与退休人员轮替 生产力能力重复能量消耗过多的会议沟通问题组织目标 下达决策可行性快速非正规为何知识如此迫切?“The chief economic priority for developed countries is to raise the productivity of knowledge . . . The country that does this first will dominate the twenty-first century economically.”开

3、发中国家首要经济目标为知识的创造力谁先掌握谁就统领二十一世纪的经济Peter F. Drucker资料知识形成流程DataWarehouseKnowledgeSelection/cleansingPreprocessingTarget DataPreprocessed DataPatternTransformedData Data MiningTransformationInterpretation/EvaluationIntegrationRawDataUnderstandingBI结构Monitor&IntegratorComplete DataWarehouseExtractTransf

4、ormLoadRefreshmetadataOLAPServer1. Comprehensive Performance Management2. Analysis3. Query4. Reports5. Data miningData SourcesToolsServeData MartsOperationalDBsOther sourcesBusiness Intelligence资料采矿/探勘rule inductionneural networkstree generatorsrule inductionsupport vector machineregressionCOWEBesti

5、mation maximizationk-meansrough setsapriorigranular computingtrend functionsrule inductionneural networksCategorize your customers or clientsClassificationForecast future sales or usagePredictionGroup similar customers or clientsSegmentationDiscover products that are purchased togetherAssociationFin

6、d patterns and trends over timeSequenceGaining market intelligence from news feedsSreekumar Sukumaran and Ashish SurekaIntegrated BI SystemsComplete DataWarehouseETLStructural DataDBMSFile SystemXMLEALegacyUnstructured DataCMSScannedDocumentsEmailETLText taggor & AnnotatorIntermedia DataRDBMSXMLSree

7、kumar Sukumaran and Ashish Sureka知识来源与价值“On average, professional users spend 11 hours per week looking for information. Seventy-one percent said they could not find what they were looking for. Information Management SoftwareLazard Freres & Co. LLCFebruary 2001The volume of digitized information wil

8、l double every year from 2000 to 2005(an increase to 30 times todays volume). Knowledge Management vs. Information ManagementGartner GroupSeptember 2000网络讯息新闻报导专利电子邮件文件文献问题出版统计8TB(书籍),25TB(新闻),20TB(杂志),2TB(期刊)平均每分钟科学知识增加2000页新材料的阅读须时5年(24hrs/day)How Can I Keep Up With the Literature?Evolution“To stu

9、dy history one must know in advance that one is attempting something fundamentally impossible, yet necessary and highly important.” Father Jacobus (Hesses Magister Ludi)Das Glasperlenspiel (The Glass Bead Game)文件知识发掘与管理技术检索文件 过滤分类摘要 分群自然语言内文分析萃取探勘可视化萃取应用探勘应用信息存取知识认知信息结构知识产生Raw textTermsimilarityDocs

10、imilarityVector centroid分群 d分类META-DATA/ANNOTATION d d d d d d d d d d d d d d t t t t t t t t t t t tStemming & Stop wordsTokenized textTerm Weightingw11w12w1nw21w22w2n wm1wm2wmn t1t2 tn d1 d2 dmSentenceselection摘要Text ETL to MiningCall Taker: JamesDate: Aug. 30, 2002Duration: 10 min.CustomerID: AD

11、C00123Q:cust sys hasstopped working.A: checked custbios anditneed updated. Unstructured DataStructured DataCall Taker JamesDate 2002/08/30Duration 10 min.CustomerID ADC00123NounCustomerSoftwareBIOSSubj.Verb customer system.stopSW.Problem BIOS.needOriginal DataMeta DataLinguisticAnalysisTaggingDepend

12、ency AnalysisNamed Entity ExtractionIntention AnalysisCategoryDictionarySynonymDictionaryCategoryItemVisualization &Interactive MiningMiningIBM TAKMI(Nasukawa, Nagano,1999)Mining target: individual textMining unit: texts category labeled items extracted from text using NLPText is Tough其系一个极不容易表达的抽象性

13、概念 (AI-Complete) 是许多概念彼此间抽象而复杂的无尽关系组合一种名词可以代表很多不同的概念CELL, IV类似的概念也有很多种方式可以表达 (aliases)space ship, flying saucer, UFO, figment of imagination概念是很难加以可视化的高维度 其分析构面可能高达成百上千Text Mining is Easy重复性很高只要一些简单的算法,就可以从一些极为粗糙的工作中,得到不错的结果找出重要词组找到有意义的相关字从文章中建立摘要主要问题:结果评估必须定义目标及目的Traditional IR-based Extractiondocv

14、ector 1profile vector docvector nscoringscorejudgments rejected docs accepted docs noyesvectorlearningthresholdlearningutility functionOntologyVector initializationThreshold initializationReuse retrieval algorithmsNew threshold algorithmsScore ?threshold Text-DBLexiconsLuhns ideasIt is here proposed

15、 that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The s

16、ignificance factor of a sentence will therefore be based on a combination of these two measurements.信息萃取-Job2 JobTitle: Ice Cream Guru Employer: JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper MidwestContact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: /jo

17、bs_midwest.html OtherCompanyJobs: -Job1Information ExtractionGiven:Source of textual documentsWell defined limited query (text based)Find:Sentences with relevant informationExtract the relevant information and ignore non-relevant information (important!)Link related information and output in a prede

18、termined formatAdvisoryProgrammer- Oracle (Austin, TX) Response Code: 1008-0074-97-iexc-jcn Responsibilities: This is an exciting opportunity withSiemens Wireless Terminals; a start-up venture fully capitalized by a Global Leader in Advanced Technologies. Qualified candidates will: Responsible for a

19、ssisting with requirements definition, analysis, design and implementation that meet objectives, codes difficult and sophisticated routines . Develops project plans, schedules and cost data. Develop test plans and implement physical design of databases. Develop shell scripts for administrative and b

20、ackground tasks, stored procedures and triggers. Using Oracles Designer 2000, assist with Data Model maintenance and assist with applications development using Oracle Forms. Qualifications: BSCS, BSMIS or closely related field or related equivalent knowledge normally obtained through technical educa

21、tion programs. 5-8 years of professional experience in development, system design analysis, programming, installation using Oracle developmentAutomatic Pattern-Learning SystemsPros:Portable across domainsTend to have broad coverageRobust in the face of degraded input.Automatically find appropriate s

22、tatistical patternsSystem knowledge not needed by those who supply the domain knowledge.Cons:Annotated training data, and lots of it, is needed.Isnt necessarily better or cheaper than hand-built solnExamples: Riloff et al., AutoSlog, Soderland WHISK (UMass); Mooney et al. Rapier (UTexas); Ciravegna

23、(Sheffield) Learn lexicon-syntactic patterns from templatesTrainerDecoderModelLanguageInputAnswersAnswersLanguageInputText Analysis SpectrumEntity ExtractionTargeted Factsand EventsClassificationClusteringConceptIdentificationWhat is thisdocumentabout?Who didwhat towhom whenwhere, etc.Why is getting

24、 dimensional data so hard?Hank bought plastic explosives from Henry inTucson yesterday.Named Entity ExtractionPeople,Weapons,Vehicles,DatesNEREngineHankHenryPlastic explosivesTucson11/01/07FrameNetName Extraction via MMsTextSpeechRecognitionExtractorSpeechEntities NEModelsLocationsPersonsOrganizatio

25、nsThe delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.TrainingProgramtrainingsentencesanswersThe delegation, which included the commander of theU.

26、N. troops inBosnia, Lt. Gen. SirMichael Rose, went to the Serb stronghold ofPale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.An easy but successful HMM application:Prior to 1997 - no learning approach competitive with hand-built rule systemsSince 1997 - Statistical approaches

27、 (BBN (Bikel et al. 1997), NYU, MITRE, CMU/JustSystems) achieve state-of-the-art performanceNER数据库探勘作业流程决策参考决策建议自动分群自动专家分类事件关连分析文档库知识本体论推论图知识地图概念分群documentDocumentCollectionsunbeachFrequent term set:surffunsun, beachclusterC1C2C4C5C3Clustering:C1, C2, C4, C5.Clustering Description:surf, sun, beach,

28、fun.AnophelesFeedback as Model InterpolationConcept CDocument DResultsFeedback DocsF=d1, d2 , , dnGenerative modelDivergence minimization=0No feedback=1Full feedback非单调性资料(Heterogeneous)TDRTDRTDRTDRTDR成千成万的历史纪录巨量分析文件分群 1000解决方案个案库Mooter科学人杂志3月号文件数据分群Annotation and TaggingOnNovember 16, 2005, IBM ann

29、ounced it hadacquired Collation, a privately held companybased inRedwood City, California forundisclosed amount.DateAcquiringOrganizationAcquisitionEventAcquiredOrganizationPlaceAmountText AnnotatorDateOrganizationPlaceAmountNov. 16IBMRedwood City, CAUndisclosedOutput toRDBMSXMLoutputOn November 16,

30、 2005, IBM announced it had acquired Collation, a privately held company based in Redwood City, California for undisclosed amount.Linguistic Concept Extractionfrom Customer Service Records Bag of “Words”extractionCstmr IDCustomerYellowIncHappyNotSwitchCellPhoneExpressionsextractionCstmr IDCustomerYe

31、llow IncswitchCell PhoneNot happyNamed EntitiesextractionCustomer CRM termCstmr?Yellow Inc Telco CompanyCell Phone Telco TermNot happySwitchEvents/SentimentExtractionCustomer (cstmr) cell phone unhappy (Negative)Switch to (Negative Predicate) yellow inc (Competition)CombinedWith structured dataDecis

32、ion MakingChurner Special OfferKnowledge InferenceInformation ExtractionInformation RetrievalExtracting Information From TextStructuring knowledge from texttagging, compounds, grammatical analysis, ontological interpretation, regular expressions, patter recognitionTextDatabaseOntologyMinimalrecursio

33、nsemanticsrepresentationsDeep Thought EU projectKnowledge ConstructionWant to extract prominent concepts/relations from texttagging, compounds, NP recognition, term frequencies, stopwords, language identificationBrasethvik & Gulla, DKE, 38/1, 2001Domaindoc.coll.OntologyStatistical &linguisticanalyse

34、sManual laborPatterns ConstructionTaipeiTokyoNew YorkRepositoryTagging &annotationCDWKnowledge RepositoryOr structured dataPatternsPatternsExplorerWeb BrowserHard diskWindows XPDesktop computerHard disk size 40 GBProductsLaptopcomputersOperating SystemLinuxMacintoshis acrashesInstalled from http:/.人

35、、事、时、地、物元资料participate in人物性质Conceptual ObjectsPhysical EntitiesTemporal Entities应用affect or / refer torefer to / refinerefer to / identifielocationatwithin地点时间资源索引人物事件物件Derivedknowledgedata (e.g. RDF)ThesauriextentCRM entitiesOntologyexpansionSourcesandmetadata(XML/RDF)Backgroundknowledge /Authorit

36、iesCIDOCCRM orDCConcept LatticeC1:(D1,)C2:(d1,d2,d4,t1,t6)C3:(d3,d4,t4)C4:(d1,d2,t1,t3,t5,t6)C5:(d4,t1,t4,t6)C6:(d3,t2,t4)C7:(, T1)The formal conceptC4 has two own termst3,t5 and two inheritedterms t1,t6Given the context (D1,T1) whereD1 = d1,d2,d3,d4 & T1 = t1,t2,t3,t4,t5,t6 R t1 t2 t3 t4 t5 t6d11 0

37、 1 0 1 1 d21 0 1 0 1 1d30 1 0 1 0 0d41 0 0 1 0 1Table: The input relationR = documents keywordsHasseDiagramP14 performedP11 participated inP94 has createdE31 Document“Yalta Agreement”E7 Activity“Crimea Conference”E65 Creation Event*E38 ImageP86 falls withinP7 took place atP67 is referred to byE52 Ti

38、me-SpanFebruary 1945P81 ongoing throughoutP82 at some time withinE39 ActorE39 ActorE39 ActorE53 Place7012124E52 Time-Span11-2-1945Explicit Events, Object Identity, SymmetryRules ExtractionThe formal concept C4 makes it possible the following rules R1 : t3 t1 t6R2 : t5 t1 t6R3 : t3 t5The interpretati

39、on of the R1 and R2: The use of terms t3 or t5 is always associated with that of terms t1 and t6The rule R3 express mutual equivalence of the terms t3,t5: All the documents which have the term t3 also have the t5 term.文献知识群组专家与决策知识呈现实时性分群Real-time IndexMetadata ofSearching Results公文性资料中低收入户补助因果图-失依儿

40、童各县市福利, 信托基金的成立所在各县市失依儿童状态各县市政府,社会局等介入 对单亲家庭的补助之灾后重建及经费相关使用灾后重建基金规则Clustering范例很适合用机洗香味好闻去污力强洗衣省力气味清香能去除99种污渍洗得特别干净香味好闻白袜子洗得最干净气味很香不伤手能够很好的去除污渍衣服不易褪色洗衣不费力能去除99种污渍用量少洗得干净对皮肤刺激少洗各种污渍都很干净洗得干净价格适当洗衣服的效果较好气味不错一直使用该品牌洗好的衣物更白气味好闻广告印象深洗得干净易漂清不太伤手洗得干净用量少洗得干净用量比别的牌子少广告大洗得干净用量少质量好用量少洗得干净包装好广告多,吸引人香味好闻洗的干净、白宣传好

41、,广告有趣很多人都说好知识脉络知识地图事件追踪信息检索知识概念Kuhns Descriptive ProjectImmature ScienceNormal ScienceAnomaliesCrisisRevolutionEvolutionary theory is evolvingTasks in News DetectionNews FeedsDetectionSegmentationOn-LineRetroTrackingMight be RelevantUSS ColeOctober 12, 2000世贸中心五角大厦2001年九月11日 LocationAden,YemenDateOc

42、tober 12,200011:18 am (UTC+3)Attacktypesuicide bombingDeaths19 (including the 2 perpetrators)Injured39Perpetrator(s)al-Qaeda, carried out by Ibrahim al-Thawr and Abdullah al-Misawa911事件可预防FBI 明尼苏达干员Zacarias Moussaoui 个人计算机FBI凤凰城备忘录(George Will)Dr. Bhandari(Virtual Gold, Inc)资料探勘 可预防911悲剧恐怖份子911恐怖份子网

43、络911恐怖份子网络赤军旅(RedArmy Faction)威胁Horst Herold (德国联邦警察总长)建立数据探勘之信息网GermanysBundeskriminalamt 1972数据源房屋销售、能源公司成果Rolf Heissler (RAF 成员)结果erold遭报导违反人权退休1986修改犯罪条例911三个飞行员系来自Hamburg疫病警示及通报系统世界卫生组织多年前即建立了疫病警示及通报系统(Epidemic Alert and Response)。由于一些国家可能基于经济冲击的考虑,可能淡化有关疫情的报导,世界卫生组织的这套系统特别装置了一套软件,可以由各国媒体的网站上抓取

44、相关资料并由二十位专家分析这些资料中的信息。HighW信息 与 知识 Amazon数字相机销售新闻事件华盛顿时报美国家卫生院 NIH热门研究Proposals by Funding/Date across IRGs and Activity Types疾病诊疗指引 Athena/EON - StanfordAthena临床指引R. D. Shankar, et al. 2001高血压临床指引 Athena Hypertension GuidelineA. Advani, et al. 2003受灾户(金融辅助政策)贷款(受灾户、临时住宅)Generative Discriminative重建家

45、园专案金融机构贷款震灾重建暂行条例受灾户房屋利息损毁灾户objectmethodObject:attributeObject:attributeObject:attributeObject:conditionObject:attributeObject:Attribute (condition)Object:attributeSpecifyGeneralizeIntegrating Distributed KnowledgeAdaptive knowledge infrastructure is in placeKnowledge resources identified and shared

46、 appropriatelyTimely knowledge gets to the right person to make decisionsIntelligent tools for authoring through archivingCohesive knowledge development between JPL, its partners, and customersInstrument design is semi-automatic based on knowledge repositoriesMission software auto-instantiates based

47、 on unique mission parametersKM principals are part of Lab culture and supported by layered COTS productsRemote data management allows spacecraft to self-commandKnowledge gathered anyplace from hand-held devices using standard formats on interplanetary InternetExpert systems on spacecraft analyze and upload dataAutonomous agents operate across existing sensor and telemetry productsIndustry and academia supply spacecraft parts based on collaborative designs derived from JPLs

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论