版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、数据发掘第八章:规范规范、工具和开展趋势.本章内容8.1 数据发掘规范与规范8.2 数据发掘工具8.3 数据发掘的研讨趋势根本要求:了解数据发掘在运用中的相关规范规范及未来的研讨趋势。.8.1 数据发掘规范与规范数据发掘过程模型是确保数据发掘任务顺利进展的关键。典型的过程模型有:SPSS的5A模型评价(Assess)、访问(Access)、分析(Analyze)、行动(Act)、自动化(Automate)SAS的SEMMA模型采样(Sample)、探求(Explore)、修正(Modify)、建模(Model)、评价(Assess)跨行业数据发掘过程规范CRISP-DM (Cross Indu
2、stry Standard Process for Data Mining ) 。Two Crows公司的数据发掘过程模型,它与正在建立的CRISP-DM有许多类似之处。 . 数据发掘相关规范CRISP-DM交叉行业数据发掘过程规范,Cross Industry Standard Process for Data Mining。SPSS、NCR以及DaimlerChrysler三个在数据发掘领域阅历丰富的公司发起建立一个社团,目的建立数据发掘方法和过程的规范 8.1 数据发掘规范与规范.Crisp - DMProject ObjectivesData UnderstandingData Pre
3、parationModelingEvaluationReportingBackgroundRequirements, assumptions, constraintsTerminologyData mining goals & success criteriaProject planInitial Data collection reportData description reportData Exploration reportData quality reportData description reportData pre-processing stepsModeling assump
4、tionTest designModel descriptionModel assessment (inc. validation)Assessment of data mining results withrespect to objectivesFinal report:Summary:ObjectivesData Mining processData Mining resultsData Mining assessment-ConclusionsFuture work(Business Understanding)(Deployment)Widely accepted PROCESS M
5、ODEL for data miningProvides a framework for describing the modeling process in detail“BEST PRACTICE.Business Understanding PhaseUnderstand the business objectivesWhat is the status quo?Understand business processesAssociated costs/painDefine the success criteriaDevelop a glossary of terms: speak th
6、e languageCost/Benefit AnalysisCurrent Systems AssessmentIdentify the key actorsMinimum: The Sponsor and the Key UserWhat forms should the output take?Integration of output with existing technology landscapeUnderstand market norms and standards8.1 数据发掘规范与规范.Business Understanding PhaseTask Decomposi
7、tionBreak down the objective into sub-tasksMap sub-tasks to data mining problem definitions Identify ConstraintsResourcesLaw e.g. Data ProtectionBuild a project planList assumptions and risk (technical/ financial/ business/ organisational) factors8.1 数据发掘规范与规范.Data Understanding PhaseCollect DataWha
8、t are the data sources?Internal and External Sources (e.g. Axiom, Experian)Document reasons for inclusion/exclusionsDepend on a domain expertAccessibility issuesAre there issues regarding data distribution across different databases/legacy systemsWhere are the disconnects?8.1 数据发掘规范与规范.Data Understa
9、nding PhaseData DescriptionDocument data quality issuesCompute basic statistics Data ExplorationSimple univariate data plots/distributionsInvestigate attribute interactionsData Quality IssuesMissing Values: Understand its sourceStrange Distributions8.1 数据发掘规范与规范.Data Preparation PhaseIntegrate DataJ
10、oining multiple data tablesSummarisation/aggregation of dataSelect DataAttribute subset selectionRationale for Inclusion/ExclusionData samplingTraining/Validation and Test sets8.1 数据发掘规范与规范.Data Preparation PhaseData TransformationUsing functions such as logFactor/Principal Components analysisNormal
11、ization/Discretisation/BinarisationClean DataHandling missing values/OutliersData ConstructionDerived Attributes8.1 数据发掘规范与规范.The Modeling PhaseBuild ModelChoose initial parameter settingsStudy model behaviour: Sensitivity analysisAssess the modelBeware of over-fittingInvestigate the error distribut
12、ion: Identify segments of the state space where the model is less effectiveIteratively adjust parameter settings8.1 数据发掘规范与规范.The Evaluation PhaseValidate ModelHuman evaluation of results by domain expertsEvaluate usefulness of results from business perspectiveDefine control groupsCalculate lift cur
13、vesExpected Return on InvestmentReview ProcessDetermine next stepsPotential for deploymentDeployment architectureMetrics for success of deployment8.1 数据发掘规范与规范.PMML预测模型标志言语,Predictive Model Markup Language。数据发掘运用往往需求多种类型的数据发掘软件、算法协同运转,这就要求对发掘出的模型可以很好地承继、复用与集成。DMGThe Data Mining Group,DMG提出PMML言语。PMM
14、L最新版本为4.1,支持16种数据发掘模型,包括:AssociationModel 关联规那么、BaselineModel基准模型、ClusteringModel聚类模型、GeneralRegressionModel回归模型、MiningModel组合模型、NaiveBayesModel朴素贝叶斯、 NearestNeighborModel 最近邻模型NeuralNetwork神经网络、RegressionModel线性、多项式、对数三种回归模型、RuleSetModel规那么集、 SequenceModel序列方式、Scorecard、TimeSeriesModel、SupportVecto
15、rMachineModel支持向量机、 TextModel文本模型、TreeModel决策树8.1 数据发掘规范与规范.PMML的模型定义由以下几部分组成:8.1 数据发掘规范与规范.The header element contains general information about the PMML document, such as copyright formation for the model, its description, and information about the application used to generate the model such as na
16、me and version. 8.1 数据发掘规范与规范PMML version=3.2 . .The data dictionary records information about the data elds from which the model was built.8.1 数据发掘规范与规范 DataField name=Species . .Data Transformations: transformations allow for the mapping of user data into a more desirable form to be used by the mi
17、ning model. PMML defines several kinds of simple data transformations.Normalization: map values to numbers, the input can be continuous or discrete.Discretization: map continuous values to discrete values.Value mapping: map discrete values to discrete values.Functions (custom and built-in): derive a
18、 value by applying a function to one or more parameters.Aggregation: used to summarize or collect groups of values.8.1 数据发掘规范与规范.Model: contains the definition of the data mining model. Model Name (attribute modelName)Algorithm Name (attribute algorithmName)Number of Layers (attribute numberOfLayers
19、)Mining Schema: lists all fields used in the model. Name : must refer to a field in the data dictionaryUsage type: defines the way a field is to be used in the model. Typical values are: active, predicted, and supplementary. Predicted fields are those whose values are predicted by the model.Outlier
20、Treatment : defines the outlier treatment to be use. Missing Value Replacement Policy : if this attribute is specified then a missing value is automatically replaced by the given values.Missing Value Treatment : indicates how the missing value replacement was derived.8.1 数据发掘规范与规范.Targets: allow for
21、 post-processing of the predicted value in the format of scaling if the output of the model is continuous.8.1 数据发掘规范与规范.PMML Example: Association Rule :8.1 数据发掘规范与规范t1: Cracker, Coke, Watert2: Cracker, Watert3: Cracker, Watert4: Cracker, Coke, Water Model attributes Items.PMML Example: Association R
22、ule :8.1 数据发掘规范与规范t1: Cracker, Coke, Watert2: Cracker, Watert3: Cracker, Watert4: Cracker, Coke, Water Item SetsAssociation Rules.JDMJava Data Mining API。旨在提供一个访问数据发掘工具的规范API,支持数据发掘模型的建立、运用,数据及元数据的创建、存储、访问及维护,从而使得Java运用程序可以可以方便集成数据发掘技术。8.1 数据发掘规范与规范. Semantic Web相关规范Tim Berners-Lee 在XML 2000会议报告中初次提
23、出了语义Web的层次模型Layer Cake。其特点在与:基于XML和RDF/RDFS,构建本体和逻辑推理规那么,以完成基于语义的知识表示和推理,从而为计算机所了解和处置。8.1 数据发掘规范与规范.第一层是Unicode一致编码和URIUniform Resource Identifier,一致资源标识器。UNICODE于1993年成为国际规范组织ISO的一项国际规范ISO/IEC10646,其目的是全球一切文种一致编码。URI包含三个部分:被用来访问资源的一致命名规那么分配体系、资源宿主机器的称号、途径方式的资源称号。与URL 本不同的是,URI只是一个标识符,不直接提供访问资源的方法。8
24、.1 数据发掘规范与规范.第二层是XMLEXtensible Markup Language。XML具有简单、自描画、可扩展的特点,并且实现了内容、构造和表现三者的分别,因此,更适宜于数据表示和交换。XML Schema中的约束主要用于XML文档的构造合法性验证。第三层是RDFResource Description Framework,资源描画框架。元数据层。RDF是建立在XML上的元数据描画与交换框架,以“资源Resource属性Property属性值Property Value的方式描画对象。一个例子8.1 数据发掘规范与规范.8.1 数据发掘规范与规范.8.1 数据发掘规范与规范.第四
25、层是RDF-SRDF Schema。RDF-S是对RDF 的扩展,是RDF的词汇描画言语Vocabulary Description Language,用于定义RDF资源描画文件中出现的词汇。第五层是本体Ontology和规那么Rule。领域知识层。OWL用于明确表示词汇体系中的术语及术语间的关系,在词义和语义的表达来说,OWL有更强的表达才干。规那么用于描画领域知识中的前提和结论。SPARQLSimple Protocol and RDF Query Language是W3C引荐的用于对RDF数据查询的言语和协议。8.1 数据发掘规范与规范.本章内容8.1 数据发掘规范与规范8.2 数据发掘
26、工具8.3 数据发掘的研讨趋势.Free open-source data mining software and applicationsGATE: anatural language processingand language engineering tool.Orange: A component-based data mining andmachine learningsoftware suite written in thePythonlanguage.R: Aprogramming languageand software environment for statistical
27、computing, data mining, and graphics. RapidMiner: An environment formachine learningand data mining experiments.UIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video originally developed by IBM.Weka
28、: A suite of machine learning software applications written in theJavaprogramming language.8.2 数据发掘工具.Commercial data-mining software and applicationsIBM SPSS Modeler: data mining software provided by IBM.Microsoft Analysis Services: data mining software provided by Microsoft.Oracle Data Mining: dat
29、a mining software by Oracle.SAS Enterprise Miner: data mining software provided by the SAS Institute.STATISTICA Data Miner: data mining software provided by StatSoft.8.2 数据发掘工具.WEKA: Waikato Environment for Knowledge AnalysisIts a data mining/machine learning tool developed by Department of Computer
30、 Science, University of Waikato, New Zealand.Weka is also a bird found only on the islands of New Zealand. Download and Install WEKAWebsite: cs.waikato.ac.nz/ml/weka/index.htmlSupport multiple platforms (written in java): Windows, Mac OS X and Linux8.2 数据发掘工具.Main Features 49 data preprocessing tool
31、s76 classification/regression algorithms8 clustering algorithms3 algorithms for finding association rules15 attribute/subset evaluators + 10 search algorithms for feature selectionMain GUI“The Explorer (exploratory data analysis)“The Experimenter (experimental environment)“The KnowledgeFlow (new process model inspired interface)8.2 数据发掘工具.WEKA only deals with “flat files 8.2 数据发掘工具relation heart-disease-simplifiedattribute age numericattribute se
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- dba面试题及答案
- 中班打击乐郊游课件
- 孤独之旅课件
- 《业务开发管理》课件
- 一起真快乐课件
- 河南省濮阳市2024-2025学年高二上学期11月期中考试数学试题(无答案)
- 天津市塘沽第二中学2024-2025学年七年级上学期期中考试数学试卷(无答案)
- 小猪佩奇平均分课件
- 【语文课件】敬畏生命-
- 高一物理《速度变化快慢的描述-加速度》-教学设计、课后练习、学习任务单
- 公务员考试议论文范文精选5篇
- 高考模拟作文写作:“如何辨别取舍信息”导写(附:写作指导及范文点评)
- 四年级数学老师家长会ppt
- 喜马拉雅有声书用户行为市场报告课件
- 2009-2022历年江苏省苏州工业园区管委会直属事业单位统一公开招聘人员《综合知识与能力素质》试题(管理类)含答案2022-2023上岸必备汇编4
- ACS510变频器参数表
- G344项目临建工程施工方案-12号定稿
- 小学数学人教四年级上册(2022年新编)平行四边形和梯形认识平行四边形
- 少先队主题班会工作汇报模板009号课件
- 电气设备常见故障分析
- 造纸和纸制品公司安全风险分级管控清单
评论
0/150
提交评论