版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
由知识挖掘提升商务智能应用
--统计分析的进阶加值应用
FromKnowledgeMiningtoBusinessIntelligence-AdvancedStatisticsApplication
谢邦昌博士厦门大学讲座教授兼博导
首都经贸大学讲座教授兼博导中央财经大学讲座教授兼博导西南财经大学讲座教授中国人民大学兼职教授辅仁大学统计资讯学系及应用统计所教授中华资料采矿协会理事长Outline知识采矿(整合数据采矿与文本采矿)与商业智慧的发展知识采矿程序、步骤、产出与应用如何进行数据采矿与文本采矿整合知识采矿之技术发展评论知识保存价值减少循环时间反应时间重复投资作业花费会议时间外界顾问…等等增加生产力与质量企业知识的转换快且有效的决策课程创新群策群力…
等等
企业知识的保留与转换知识资产的投资精简与退休人员轮替
生产力能力重复能量消耗过多的会议沟通问题组织目标
下达决策可行性快速非正规为何知识如此迫切?“Thechiefeconomicpriorityfordevelopedcountriesistoraisetheproductivityofknowledge...Thecountrythatdoesthisfirstwilldominatethetwenty-firstcenturyeconomically.”开发中国家首要经济目标为知识的创造力…谁先掌握谁就统领二十一世纪的经济PeterF.Drucker资料知识形成流程DataWarehouseKnowledgeSelection/cleansingPreprocessingTargetDataPreprocessedDataPatternTransformedData
DataMiningTransformationInterpretation/EvaluationIntegrationRawDataUnderstandingBI结构Monitor&IntegratorCompleteDataWarehouseExtractTransformLoadRefreshmetadataOLAPServer1.ComprehensivePerformanceManagement2.Analysis3.Query4.Reports5.DataminingDataSourcesToolsServeDataMartsOperationalDBsOthersourcesBusinessIntelligence资料采矿/探勘ruleinduction neuralnetworkstreegeneratorsruleinductionsupportvectormachine
regressionCOWEBestimationmaximizationk-meansroughsetsapriori granularcomputingtrendfunctionsruleinduction neuralnetworks CategorizeyourcustomersorclientsClassificationForecastfuturesalesorusagePredictionGroupsimilarcustomersorclientsSegmentationDiscoverproductsthatarepurchasedtogetherAssociationFindpatternsandtrendsovertimeSequenceGainingmarketintelligencefromnewsfeedsSreekumarSukumaranandAshishSurekaIntegratedBISystemsCompleteDataWarehouseETLStructuralDataDBMSFileSystemXMLEALegacyUnstructuredDataCMSScannedDocumentsEmailETLTexttaggor&AnnotatorIntermediaDataRDBMSXMLSreekumarSukumaranandAshishSureka知识来源与价值“Onaverage,professionalusersspend11hoursperweeklookingforinformation.Seventy-onepercentsaidtheycouldnotfindwhattheywerelookingfor."
—"InformationManagementSoftware"
LazardFreres&Co.LLC
February2001"Thevolumeofdigitizedinformationwilldoubleeveryyearfrom2000to2005(anincreaseto30timestoday'svolume)."
—"KnowledgeManagementvs.InformationManagement"
GartnerGroup
September2000网络讯息新闻报导专利电子邮件文件…文献问题出版统计8TB(书籍),25TB(新闻),20TB(杂志),2TB(期刊)平均每分钟钟科学知识识增加2000页新材料的的阅读须时时5年(24hrs/day)HowCanIKeepUpWiththeLiterature?Evolution“Tostudyhistoryonemustknowinadvancethatoneisattemptingsomethingfundamentallyimpossible,yetnecessaryandhighlyimportant.”FatherJacobus(Hesse'sMagisterLudi)DasGlasperlenspiel(TheGlassBeadGame)文件件知知识识发发掘掘与与管管理理技技术术检索索文件件过滤滤分类类摘要要分群群自然然语语言言内内文文分分析析萃取取探勘勘可视视化化萃取取应应用用探勘勘应应用用信息息存存取取知识识认认知知信息息结构构知识识产产生生RawtextTermsimilarityDocsimilarityVectorcentroid分群d分类META-DATA/ANNOTATIONddddddddddddddttttttttttttStemming&StopwordsTokenizedtextTermWeightingw11w12…w1nw21w22…w2n……wm1wm2…wmn
t1t2…tn
d1
d2
…dmSentenceselection摘要TextETLtoMiningCallTaker:JamesDate:Aug.30,2002Duration:10min.CustomerID:ADC00123Q:custsyshasstoppedworking.A:checkedcustbiosanditneedupdated.……UnstructuredDataStructuredData[CallTaker]James[Date]2002/08/30[Duration]10min.[CustomerID]ADC00123[Noun]Customer[Software]BIOS[Subj...Verb]customersystem..stop[SW..Problem]BIOS..needOriginalDataMetaDataLinguisticAnalysisTaggingDependencyAnalysisNamedEntityExtractionIntentionAnalysisCategoryDictionarySynonymDictionaryCategoryItemVisualization&InteractiveMiningMiningIBMTAKMI(Nasukawa,Nagano,1999)Miningtarget:individualtextMiningunit:>texts>categorylabeleditemsextractedfromtextusingNLPTextisTough其系一个个极不容容易表达达的抽象象性概念念(AI-Complete)是许多概概念彼此此间抽象象而复杂杂的无尽尽关系组组合一种名词词可以代代表很多多不同的的概念CELL,IV类似的概概念也有有很多种种方式可可以表达达(aliases)spaceship,flyingsaucer,UFO,figmentofimagination概念是很很难加以以可视化化的高维度其分析构构面可能能高达成成百上千千TextMiningisEasy重复性很很高只要一些些简单的的算法,,就可以以从一些些极为粗粗糙的工工作中,,得到不不错的结结果找出重要要词组找到有意意义的相相关字从文章中中建立摘摘要主要问题题:结果评估估必须定义义目标及及目的TraditionalIR-basedExtractiondocvector1profilevector
docvectorn…scoringscorejudgments
rejecteddocs
accepteddocs
noyesvectorlearningthresholdlearningutilityfunctionOntologyVectorinitializationThresholdinitializationReuseretrievalalgorithmsNewthresholdalgorithmsScore>?threshold
Text-DBLexiconsLuhn'sideasItishereproposedthatthefrequencyofwordoccurrenceinanarticlefurnishesausefulmeasurementofwordsignificance.Itisfurtherproposedthattherelativepositionwithinasentenceofwordshavinggivenvaluesofsignificancefurnishausefulmeasurementfordeterminingthesignificanceofsentences.Thesignificancefactorofasentencewillthereforebebasedonacombinationofthesetwomeasurements.信息萃取取-Job2
JobTitle:IceCreamGuru
Employer:
JobCategory:Travel/Hospitality
JobFunction:FoodServices
JobLocation:UpperMidwestContactPhone:800-488-2611
DateExtracted:January8,2001
Source:/jobs_midwest.html
OtherCompanyJobs:-Job1InformationExtractionGiven:SourceoftextualdocumentsWelldefinedlimitedquery(textbased)Find:SentenceswithrelevantinformationExtracttherelevantinformationandignorenon-relevantinformation(important!)LinkrelatedinformationandoutputinapredeterminedformatAdvisoryProgrammer-Oracle(Austin,TX)ResponseCode:1008-0074-97-iexc-jcnResponsibilities:ThisisanexcitingopportunitywithSiemensWirelessTerminals;astart-upventurefullycapitalizedbyaGlobalLeaderinAdvancedTechnologies.Qualifiedcandidateswill:Responsibleforassistingwithrequirementsdefinition,analysis,designandimplementationthatmeetobjectives,codesdifficultandsophisticatedroutines.Developsprojectplans,schedulesandcostdata.Developtestplansandimplementphysicaldesignofdatabases.Developshellscriptsforadministrativeandbackgroundtasks,storedproceduresandtriggers.UsingOraclesDesigner2000,assistwithDataModelmaintenanceandassistwithapplicationsdevelopmentusingOracleForms.Qualifications:BSCS,BSMISorcloselyrelatedfieldorrelatedequivalentknowledgenormallyobtainedthroughtechnicaleducationprograms.5-8yearsofprofessionalexperienceindevelopment,systemdesignanalysis,programming,installationusingOracledevelopment…AutomaticPattern-LearningSystemsPros:PortableacrossdomainsTendtohavebroadcoverageRobustinthefaceofdegradedinput.AutomaticallyfindappropriatestatisticalpatternsSystemknowledgenotneededbythosewhosupplythedomainknowledge.Cons:Annotatedtrainingdata,andlotsofit,isneeded.Isn’tnecessarilybetterorcheaperthanhand-builtsol’nExamples:Riloffetal.,AutoSlog,SoderlandWHISK(UMass);Mooneyetal.Rapier(UTexas);Ciravegna(Sheffield)Learnlexicon-syntacticpatternsfromtemplatesTrainerDecoderModelLanguageInputAnswersAnswersLanguageInputTextAnalysisSpectrumEntityExtractionTargetedFactsandEventsClassificationClusteringConceptIdentificationWhatisthisdocumentabout?Whodidwhattowhomwhenwhere,etc.Whyisgettingdimensionaldatasohard?HankboughtplasticexplosivesfromHenryinTucsonyesterday.NamedEntityExtractionPeople,Weapons,Vehicles,DatesNEREngineHankHenryPlasticexplosivesTucson11/01/07FrameNetNameExtractionviaMMsTextSpeechRecognitionExtractorSpeechEntitiesNEModelsLocationsPersonsOrganizationsThedelegation,whichincludedthecommanderoftheU.N.troopsinBosnia,Lt.Gen.SirMichaelRose,wenttotheSerbstrongholdofPale,nearSarajevo,fortalkswithBosnianSerbleaderRadovanKaradzic.TrainingProgramtrainingsentencesanswersThedelegation,whichincludedthecommanderoftheU.N.troopsinBosnia,Lt.Gen.SirMichaelRose,wenttotheSerbstrongholdofPale,nearSarajevo,fortalkswithBosnianSerbleaderRadovanKaradzic.AneasybutsuccessfulHMMapplication:Priorto1997-nolearningapproachcompetitivewithhand-builtrulesystemsSince1997-Statisticalapproaches(BBN(Bikeletal.1997),NYU,MITRE,CMU/JustSystems)achievestate-of-the-artperformanceNER数据库探探勘作业业流程决策参考决策建议自动分群自动/专家分类事件关连分析文档库知识本体论推论图知识地图概念分群群documentDocumentCollection{sun}{beach}Frequenttermset:{surf}{fun}{sun,beach}clusterC1C2C4C5C3Clustering:{C1,C2,C4,C5}.ClusteringDescription:{surf,sun,beach,fun}.AnophelesFeedbackasModelInterpolationConceptCDocumentDResultsFeedbackDocsF={d1,d2,…,dn}GenerativemodelDivergenceminimization=0Nofeedback=1Fullfeedback非单调性性资料(Heterogeneous)TDRTDRTDRTDRTDR成千成万的历史纪录巨量分析文件分群群1000解决方案个案库Mooter科学人杂杂志3月号文件数据据分群AnnotationandTaggingOnNovember16,2005,IBMannouncedithadacquiredCollation,aprivatelyheldcompanybasedinRedwoodCity,Californiaforundisclosedamount.DateAcquiringOrganizationAcquisitionEventAcquiredOrganizationPlaceAmountTextAnnotatorDateOrganizationPlaceAmountNov.16IBMRedwoodCity,CAUndisclosedOutputtoRDBMSXMLoutputOn<Date>November16,2005</Date>,<ACQUIRINGORG>IBM</ACQUIRINGORG>announcedithad<ACQUISITIONEVENT>acquired</ACQUISITIONEVENT><ACQUIREDORG>Collation</ACQUIREDORG>,aprivatelyheldcompanybasedin<PLACE>RedwoodCity,California</PLACE>for<AMOUNT>undisclosed</AMOUNT>amount.LinguisticConceptExtractionfromCustomerServiceRecordsBagof““Words”extractionCstmrIDCustomerYellowIncHappyNotSwitchCellPhoneExpressionsextractionCstmrIDCustomerYellowIncswitchCellPhoneNothappyNamedEntitiesextractionCustomerCRMtermCstmr?YellowIncTelcoCompanyCellPhoneTelcoTermNothappySwitchEvents/SentimentExtractionCustomer(cstmr)cellphoneunhappy(Negative)Switchto(NegativePredicate)yellowinc(Competition)CombinedWithstructureddataDecisionMakingChurnerSpecialOfferKnowledgeInferenceInformationExtractionInformationRetrievalExtractingInformationFromTextStructuringknowledgefromtexttagging,compounds,grammaticalanalysis,ontologicalinterpretation,regularexpressions,patterrecognitionTextDatabaseOntologyMinimalrecursionsemanticsrepresentations[DeepThoughtEUproject]KnowledgeConstructionWanttoextractprominentconcepts/relationsfromtexttagging,compounds,NPrecognition,termfrequencies,stopwords,languageidentification[Brasethvik&Gulla,DKE,38/1,2001]Domaindoc.coll.OntologyStatistical&linguisticanalysesManuallaborPatternsConstructionTaipeiTokyoNewYorkRepositoryTagging&annotationCDWKnowledgeRepositoryOrstructureddataPatternsPatternsExplorerWebBrowserHarddiskWindowsXPDesktopcomputerHarddisksize40GBProductsLaptopcomputersOperatingSystemLinuxMacintoshisacrashesInstalledfromhttp://...人、事、时、地、物元资料料participatein人物性质ConceptualObjectsPhysicalEntitiesTemporalEntities应用affector/refertoreferto/refinereferto/identifielocationatwithin地点时间资源索引引人物事件物件Derivedknowledgedata(e.g.RDF)ThesauriextentCRMentitiesOntologyexpansionSourcesandmetadata(XML/RDF)Backgroundknowledge/AuthoritiesCIDOCCRMorDCConceptLatticeC1:(D1,Ø)C2:({d1,d2,d4},{t1,t6})C3:({d3,d4},{t4})C4:({d1,d2},{t1,t3,t5,t6})C5:({d4},{t1,t4,t6})C6:({d3},{t2,t4})C7:(Ø,T1)TheformalconceptC4hastwoownterms{t3,t5}andtwoinheritedterms{t1,t6}Giventhecontext(D1,T1)whereD1={d1,d2,d3,d4}&T1={t1,t2,t3,t4,t5,t6}Rt1t2t3t4t5t6d1101011d2101011d3010100d4100101Table:TheinputrelationR=documentskeywordsHasseDiagramP14performedP11participatedinP94hascreatedE31Document“YaltaAgreement”E7Activity“CrimeaConference”E65CreationEvent*E38ImageP86fallswithinP7tookplaceatP67isreferredtobyE52Time-SpanFebruary1945P81ongoingthroughoutP82atsometimewithinE39ActorE39ActorE39ActorE53Place7012124E52Time-Span11-2-1945ExplicitEvents,ObjectIdentity,SymmetryRulesExtractionTheformalconceptC4makesitpossiblethefollowingrulesR1:t3t1t6R2:t5t1t6R3:t3t5TheinterpretationoftheR1andR2:Theuseoftermst3ort5isalwaysassociatedwiththatoftermst1andt6TheruleR3expressmutualequivalenceoftheterms{t3,t5}:Allthedocumentswhichhavethetermt3alsohavethet5term.文献知识群组专家与决策策知识呈现实时性分群群Real-timeIndexMetadataofSearchingResults公文性资料料中低收入户补助因果图--失依儿童各县市福利利,信托基金的的成立所在各县市市失依儿童童状态各县市政府府,社会局等介介入对单亲家庭庭的补助之之灾后重建建及经费相相关使用灾后重建基基金规则Clustering范例很适合用机洗香味好闻去污力强洗衣省力气味清香能去除99种污渍洗得特别干净香味好闻白袜子洗得最干净气味很香不伤手能够很好的去除污渍衣服不易褪色洗衣不费力能去除99种污渍用量少洗得干净对皮肤刺激少洗各种污渍都很干净洗得干净价格适当洗衣服的效果较好气味不错一直使用该品牌洗好的衣物更白气味好闻广告印象深洗得干净易漂清不太伤手洗得干净用量少洗得干净用量比别的牌子少广告大洗得干净用量少质量好用量少洗得干净包装好广告多,吸引人香味好闻洗的干净、白宣传好,广告有趣很多人都说好知识脉络知识识地地图图事件件追追踪踪信息息检检索索知识识概概念念Kuhn’sDescriptiveProjectImmatureScienceNormalScienceAnomaliesCrisisRevolutionEvolutionarytheoryisevolvingTasksinNewsDetectionNewsFeedsDetectionSegmentationOn-LineRetroTrackingMightbeRelevantUSSColeOctober12,2000世贸贸中中心心五角角大大厦厦2001年九九月月11日LocationAden,YemenDateOctober12,2000
11:18am(UTC+3)Attack
typesuicidebombingDeaths19(includingthe2perpetrators)Injured39Perpetrator(s)al-Qaeda,carriedoutbyIbrahimal-ThawrandAbdullahal-Misawa911事件件可预预防防FBI明尼尼苏苏达达干干员员ZacariasMoussaoui个人人计计算算机机FBI凤凰凰城城备备忘忘录录(GeorgeWill)Dr.Bhandari(VirtualGold,Inc)资料料探探勘勘可可预预防防911悲剧剧恐怖怖份份子子911恐怖怖份份子子网网络络911恐怖怖份份子子网网络络赤军军旅旅(RedArmyFaction)威胁胁HorstHerold(德国国联联邦邦警警察察总总长长)建立立数数据据探探勘勘之之信信息息网网Germany’’sBundeskriminalamt1972数据据源源房屋屋销销售售、、能能源源公公司司…成果果RolfHeissler(RAF成员员)结果果Herold遭报报导导违违反反人人权权退退休休1986修改改犯犯罪罪条条例例911三个个飞飞行行员员系系来来自自Hamburg疫病病警警示示及及通通报报系系统统世界界卫卫生生组组织织多多年年前前即即建建立立了了「「疫疫病病警警示示及及通通报报系系统统」」(EpidemicAlertandResponse)。由于一些国家家可能基于经经济冲击的考考虑,可能淡淡化有关疫情情的报导,世世界卫生组织织的这套系统统特别装置了了一套软件,,可以由各国媒体的的网站上抓取相关资料料并由二十位专专家分析这些些资料中的信信息。信息与知知识–Amazon数字相机销售售新闻事件–华盛顿时报美国家卫生院院NIH热门研究ProposalsbyFunding/DateacrossIRGsandActivityTypes疾病诊疗指引引Athena/EON-StanfordAthena临床指引R.D.Shankar,etal.2001高血压临床指指引AthenaHypertensionGuidelineA.Advani,etal.2003受灾户(金融辅助政策策)贷款(受灾户、临时时住宅)GenerativeDiscriminative重建家园专案案金融机构贷款震灾重建暂行行条例受灾户房屋利息损毁灾户objectmethodObject:attributeObject:attributeObject:attributeObject:conditionObject:attributeObject:Attribute(condition)Object:attributeSpecifyGeneralizeIntegratingDistributedKnowledgeAdaptiveknowledgeinfrastructureisinplaceKnowledgeresourcesidentifiedandsharedappropriatelyTimelyknowledgegetstotherightpersontomakedecisionsIntelligenttoolsforauthoringthrougharchivingCohesiveknowledgedevelopmentbetweenJPL,itspartners,andcustomersInstrumentdesignissemi-automaticbasedonknowledgerepositoriesMissionsoftwareauto-instantiatesbasedonuniquemissionparametersKMprincipalsarepartofLabcultureandsupportedbylayeredCOTSproductsRemotedatamanagementallowsspacecrafttoself-commandKnowledgegatheredanyplacefromhand-helddevicesusingstandardformatsoninterplanetaryInternetExpertsystemsonspacecraftanalyzeanduploaddataAutonomousagentsoperateacrossexistingsensorandtelemetryproductsIndustryandacademiasupplyspacecraftpartsbasedoncollaborativedesignsderivedfromJPL’sknowledgesystemCapturingKnowledgeSharingKnowledgeMarsNetEuropaOrbiterSpaceInterferometryMissionEnablescaptureofknowledgeatthepointoforigin,humanorrobotic,withoutinvasivetechnologyEnablesseamlessintegrationofsystemsthroughouttheworldandwithroboticspacecraftEnablessharingofessentialknowledgetocompleteAgencytasksModelingExpertKnowledgeSystemsmodelexpert
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 【正版授权】 ISO 29862:2024 EN Self adhesive tapes - Determination of peel adhesion properties
- 2024年工程信息技术支持与维护协议
- 2024室内装修木工施工协议范本版B版
- 2024商业入股合作条款详细协议版B版
- 2024垫资合同协议
- 2024年二手房预售协议模板版
- 2024合同范本之汽车租赁合同管理制度
- 2024年公司股份回购合同模板一
- 2024展柜销售协议详尽样本版B版
- 2024年定制型变压器商业买卖协议样本版B版
- 最新北师大版小学二年级数学上册期末乐考方案A(附答案)
- JJG 1029-2007涡街流量计
- GB/T 39784-2021电子档案管理系统通用功能要求
- GB/T 26436-2010禽白血病诊断技术
- 山东省威海市2023年初中学业考试生物试题及答案(word版)
- 小学英语比较级和最高级优秀课件
- 2020-2021学年湖北省武汉市东湖高新区部编版六年级上册期末测试语文试卷
- 对越自卫反击战资料课件
- 初中九年级历史下册第10课《凡尔赛条约》和《九国公约》教案
- 新外研版必修一unit5-The-Monarchs-Journey-reading课件
- 三上书法《撇》教学课件
评论
0/150
提交评论