已阅读5页,还剩7页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
毕业设计(论文)外文文献翻译专业计算机科学与技术学生姓名班级学号指导教师博雅学院中文译文数据挖掘技术简介摘要:微软 SQL Server2005中提供用于创建和使用数据挖掘模型的集成环境的工作。本教程使用的四种情况:有针对性的邮件预测;顺序分析和聚类;演示如何使用挖掘模型算法;挖掘模型查看器和数据挖掘工具。 介绍数据挖掘教程旨在通过创建走在Microsoft SQL Server 2005的数据挖掘模型的过程。数据挖掘算法,并在SQL Server 2005工具可以很容易地建立一个项目,包括市场购物篮分析各种全面的解决方案,预测分析,有针对性的邮件分析。这些解决方案的情景更详细的解释在后面的教程。SQL Server 2005最明显的部分是用来创建和处理数据挖掘模型的工作室。在线分析处理( OLAP )和数据挖掘工具被统一为两个工作环境:商业智能开发工作室和SQL Server 管理工作室。通过商业智能开发工作室,您可以在与服务器断开连接的情况下建立一个服务项目分析。当项目已经准备就绪,您可以发布到服务器上。您也可以直接面向服务器工作。SQL Server 管理工作室的主要职能是管理服务器。之后将有针对每一个环境的详细说明。欲了解更多关于从两个环境中选择的信息,请参看SQL Server联机丛书中的“在SQL Server 工作室和商业智能开发工作室中选择”。数据挖掘工具都存在于数据挖掘的编辑。使用编辑器,您可以管理挖掘模型,创造新模式,查看模型,比较模型,并建立在现有模型的预测。当你创建一个挖掘模型,你会想要去探索它,寻找有趣的模式和规则。在编辑器中的每个挖掘模型查看器是自定义进行探讨,以特定的算法建立的模型。如需观众的信息,请参看SQL Server联机丛书中的“查看数据挖掘模型”。您的项目往往会包含多个挖掘模型,所以才能使用的模式创建的预测,你要能够确定哪些模式是最准确的。出于这个原因,编辑包含一个模型比较工具挖掘精度的图表标签。使用此工具,您可以比较准确的预测模型和您确定最佳模式。 为了建立数据预期,你将使用一种 DME语言,DMX扩展了传统的SQL语法,包含了一些创建修改和建立数据预期的命令,关于DMX的详细信息,请参考SQL BOL中的 “Data Mining Extensions (DMX) Reference”章节。因为建立一个数据预期可能比较复杂,所以数据挖掘编辑器包含了一个工具叫做 “Prediction Query Builder”, 该工具可以让你在一个图形化的界面下编辑DMX查询语句,你也可以在该工具中可以查看自动生成的DMX语句。了解了前面介绍的实现数据挖掘的工具之外,同等重要的是了解数据挖掘模型的结构本身,建立一个数据模型的关键是数据挖掘算法,该算法在你操作的数据中寻找我们需要的部分,并且转换这些数据成为一个可操作的数据模型。 一些很重要的建立数据挖掘解决方案的步骤是用来整理准备那些用于建立数据模型的数据,SQL2005包含一个DTS的工作环境以及一些DTS的工具用于清理验证准备数据,关于DTS的更多信息请查看SQL BOL中的DTS Data Mining Tasks and Transformations 章节。Adventure 数据库AdventureWorksDW 数据库是基于一个虚构的自行车制造公司而建立,公司的名称叫做 “Adventure Works Cycles”(简称AW公司)。AW公司生产并向北美,欧洲和亚洲的商业市场销售金属和复合材料的自行车,主要的工作都在华盛顿Bothell完成,那里拥有 500 员工,以及一些地区销售部门遍及各地。 AW公司通过INTERNET批发和零售他们的产品,本教程中的数据模型实例需要你使用这些网络销售数据作为数据模型。 关于AW公司数据库的更多信息请参考 SQL Server联机丛书中的如下章节:Sample Databases and Business Scenarios。数据库详细信息网络销售数据构架包含9242个客户的信息,这些客户分布在6个国家,并被合并为3个区域:南美 (83%)欧洲 (12%)澳大利亚 (7%)该数据库包含三个财政年度的数据: 2002年, 2003年和2004年。数据库中的产品根据子类别,型号和产品来分类。商业智能开发工作室商业智能开发工作室是一套用于创建商务智能项目的工具。由于商业智能开发工作室是创建于IDE环境中的,在该环境中,你可以在脱机状态下创建一个完整地解决方案。你可以想改多少数据挖掘对象就改多少,但是在你发布该项目前,这些改变将不会反映在服务器上。一个SSAS数据库用于集成多种技术,这个数据库作为数据挖掘模型以及OLAP等技术的基础。你可以使用商业智能 建立和修改一个SSAS项目并部署这个项目到一个或多个SSAS服务如果你在开发一个SSAS项目你也可以使用商业智能开发工作室直接连接数据库,这样你所作的改动可以立刻影响到数据库中。SQL Server 管理工作室 SQL Server管理工作室是一个行政和脚本工具与Microsoft SQL Server组件工作的集合。此工作区的不同之处,你是在互联环境中工作的行动是在传播到服务器只要您保存您的工作从商务智能开发工作室中。在数据被清理并为数据挖掘准备好后,大多数和创建苏局挖掘解决方案相关联的工作都在商业智能开发工作室中工作。通过使用商业智能开发工作室,你可以利用迭代过程确定的给定情况下的最佳模式来发布和测试数据挖掘解决方案。一旦开发商对解决方案满意,就可以将其发布到分析服务服务器。从这点来看,重点从SQL Server管理工作室的开发转移到了维护和应用。在SQL Server管理工作室中,您可以管理您的数据库和执行一些在商业智能开发工作室中的相同的职能,比如在挖掘模式中查看、创建预测。数据转换服务在SQL Server 2005中数据转换服务( DTS )包括抽取,转换和加载(简称ETL )工具 。这些工具可用于执行一些数据挖掘中最重要的任务,为数据模型的建立清理和准备数据。在数据挖掘,您通常可以执行重复数据转换清理数据,然后利用这些数据组成挖掘模型。利用DTS中的任务和转移,您可以把数据准备和模型建立结合为一个单一的DTS包。DTS公司还提供了DTS设计器,以帮助您轻松地建立和运行的包含了所有的任务和转变的软件包。利用DTS设计器,您可以将包发布到服务器上并定期的运行他们。这是非常有用例如,你每周收集数据资料,并向要每次自动执行相同的清洁转换工作。你可以通过向商业智能开发式的解决方案中分别增加项目来将数据转换项目和分析服务项目结合起来工作,作为商务智能解决方案的一部分。挖掘模式算法数据挖掘算法是挖掘模型的创建的基础。SQL Server 2005中各种各样的算法可以让你执行多种类型的执行。欲了解更多有关算法及其参数调整的信息,请参看SQL Server联机丛书中的“数据挖掘算法”。决策树决策树算法支持分类与回归并且对预测模型也行之有效。利用该算法,你可以预测离散和连续这两个属性。在建立模型时,该算法检查每个数据集的输入属性是怎样的影响预测属性的结果,以及使用最强的关系的输入属性制造了一系列的分裂,称为节点。随着新节点添加到模型中,树状结构开始形成。顶端节点树描述了大多数预测属性的统计分析。每个节点建立把预测属性比作投入的属性的分布情况上。如果输入的属性被视为导致预测属性有利于促成比另一个更好的状态,于是一个新的节点添加到模型。该模型继续增长,直到没有剩余的属性制造分裂提供了一个更好的预测在现有节点。该模型力图找到一个结合的属性和引起在预测属性不成比例分配的状态,因此,您可以预测预测属性的结果。簇簇算法采用迭代技术组从包含相似特性的数据及中进行分类。利用这些组合,您可以探讨的数据,更多地了解存在的关系,这在理论上可能不容易通过偶然的观察获得。此外,您也可以从算法创建的簇建立预测模型。例如,考虑那些住在同一社区,驱动器相同的车,吃同样的食物,买了类似的版本的产品的那一个群体的人。这是一组数据。另一组可能包括去相同的餐厅,也有类似的薪金,休假和每年两次以外的地区的人。观测这些集合是如何的分布,可以更好地了解预测属性的结果是如何相互影响的。传统贝叶斯在传统贝叶斯算法快速生成挖掘,可用于分类和预测的模型。它计算的每个输入属性的国家给予每个可预测属性,它可以用来预测以后的预测属性上已知的结果输入属性状态,概率。用于生成该模型的概率计算,并在立方体的处理中。该算法只支持离散或离散化的属性,它认为所有输入属性是独立的。在传统贝叶斯算法产生一个简单的挖掘模型可以被认为是在数据挖掘过程的起点。由于在建立模型中使用的计算大多是在加工过程中产生的立方体,迅速返回结果。这使得该模型的一个探索发现的数据和如何在不同的输入属性的预测属性的不同分布状态不错的选择。时间系Microsoft时序算法创建,可用于预测了来自OLAP和关系数据源的时间连续变量模型。例如,您可以使用Microsoft时序算法来预测销售和在一个立方体的历史数据为基础的利润。 利用该算法,你可以选择一个或多个变量进行预测,但必须是连续的。您只能有一个为每个模型病例。此案系列标识系列中的位置,如超过之日起在几个月或几年的长度寻找销售。一个案件可能含有一组变量(例如,在不同的商店销售)。 Microsoft时序算法 可以用其预测交叉变量的相关性。例如,在一家商店前的销售可能会在其他商店的预测目前的销售非常有用。神经网络在Microsoft SQL Server 2005分析服务,Microsoft神经网络算法创建通过构建一个多层感知器神经元网络分类和回归挖掘模型。类似Microsoft决策树算法提供程序,那么每一个可预测属性的状态,该算法计算出的每个输入属性可能状态的概率。该算法提供程序处理案件的整套,反复比较,与已知的案件实际的分类个案的预测分类。从整个案件的第一次迭代的初始设置分类的错误是反馈到网络,并用于修改为下一次迭代网络的性能,等等。您可以在以后使用这些概率来预测一个属性的预测结果,根据输入的属性。该算法之间和Microsoft决策树算法的主要区别之一,但是,是其学习的过程是朝着减少错误,而Microsoft决策树算法拆分规则,以最大限度地获取信息,优化网络参数。该算法同时支持离散和连续属性的预测。线性回归线性回归算法是决策树算法的一种特殊的构造,获得了无效的分裂(整个回归公式是建立在一个单一根节点)。该算法支持预测连续属性。逻辑回归逻辑回归算法是神经网络算法的一种特殊的构造,得到了消除隐蔽层。该算法支持预测的离散和连续属性。英文原文Introduction to Data MiningAbstract: Microsoft SQL Server 2005 provides an integrated environment for creating and working with data mining models. Thistutorial uses four scenarios, targetedmailing,forecasting,marketbasket, andsequenceclustering, to demonstrate how to use the mining model algorithms, mining model viewers, and data mining toolsthat are included in this release of SQL Server.IntroductionThe data mining tutorial is designed to walk you through the process of creating data mining models in Microsoft SQL Server 2005. The data mining algorithms and tools in SQL Server 2005 make it easy to build a comprehensive solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted mailing analysis. The scenarios for these solutions are explained in greater detail later in the tutorial. The most visible components in SQL Server 2005 are the workspaces that you use to create and work with data mining models. The online analytical processing (OLAP) and data mining tools are consolidated into two working environments: Business Intelligence Development Studio and SQL Server Management Studio. Using Business Intelligence Development Studio, you can develop an Analysis Services project disconnected from the server. When the project is ready, you can deploy it to the server. You can also work directly against the server. The main function of SQL Server Management Studio is to manage the server. Each environment is described in more detail later in this introduction. For more information on choosing between the two environments, see Choosing Between SQL Server Management Studio and Business Intelligence Development Studio in SQL Server Books Online.All of the data mining tools exist in the data mining editor. Using the editor you can manage mining models, create new models, view models, compare models, and create predictions based on existing models. After you build a mining model, you will want to explore it, looking for interesting patterns and rules. Each mining model viewer in the editor is customized to explore models built with a specific algorithm. For more information about the viewers, see Viewing a Data Mining Model in SQL Server Books Online.Often your project will contain several mining models, so before you can use a model to create predictions, you need to be able to determine which model is the most accurate. For this reason, the editor contains a model comparison tool called the Mining Accuracy Chart tab. Using this tool you can compare the predictive accuracy of your models and determine the best model. To create predictions, you will use the Data Mining Extensions (DMX) language. DMX extends SQL, containing commands to create, modify, and predict against mining models. For more information about DMX, see Data Mining Extensions (DMX) Reference in SQL Server Books Online. Because creating a prediction can be complicated, the data mining editor contains a tool called Prediction Query Builder, which allows you to build queries using a graphical interface. You can also view the DMX code that is generated by the query builder. Just as important as the tools that you use to work with and create data mining models are the mechanics by which they are created. The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the data that you pass it, and it translates them into a mining model it is the engine behind the process. Some of the most important steps in creating a data mining solution are consolidating, cleaning, and preparing the data to be used to create the mining models. SQL Server 2005 includes the Data Transformation Services (DTS) working environment, which contains tools that you can use to clean, validate, and prepare your data. For more information on using DTS in conjunction with a data mining solution, see DTS Data Mining Tasks and Transformations in SQL Server Books Online.In order to demonstrate the SQL Server data mining features, this tutorial uses a new sample database called AdventureWorksDW. The database is included with SQL Server 2005, and it supports OLAP and data mining functionality. In order to make the sample database available, you need to select the sample database at the installation time in the “Advanced” dialog for component selection.Adventure WorksAdventureWorksDW is based on a fictional bicycle manufacturing company named Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base. Adventure Works sells products wholesale to specialty shops and to individuals through the Internet. For the data mining exercises, you will work with the AdventureWorksDW Internet sales tables, which contain realistic patterns that work well for data mining exercises. For more information on Adventure Works Cycles see Sample Databases and Business Scenarios in SQL Server Books Online.Database DetailsThe Internet sales schema contains information about 9,242 customers. These customers live in six countries, which are combined into three regions:North America (83%)Europe (12%)Australia (7%)The database contains data for three fiscal years: 2002, 2003, and 2004. The products in the database are broken down by subcategory, model, and product.Business Intelligence Development StudioBusiness Intelligence Development Studio is a set of tools designed for creating business intelligence projects. Because Business Intelligence Development Studio was created as an IDE environment in which you can create a complete solution, you work disconnected from the server. You can change your data mining objects as much as you want, but the changes are not reflected on the server until after you deploy the project.Working in an IDE is beneficial for the following reasons:The Analysis Services project is the entry point for a business intelligence solution. An Analysis Services project encapsulates mining models and OLAP cubes, along with supplemental objects that make up the Analysis Services database. From Business Intelligence Development Studio, you can create and edit Analysis Services objects within a project and deploy the project to the appropriate Analysis Services server or servers.If you are working with an existing Analysis Services project, you can also use Business Intelligence Development Studio to work connected the server. In this way, changes are reflected directly on the server without having to deploy the solution.SQL Server Management StudioSQL Server Management Studio is a collection of administrative and scripting tools for working with Microsoft SQL Server components. This workspace differs from Business Intelligence Development Studio in that you are working in a connected environment where actions are propagated to the server as soon as you save your work. After the data has been cleaned and prepared for data mining, most of the tasks associated with creating a data mining solution are performed within Business Intelligence Development Studio. Using the Business Intelligence Development Studio tools, you develop and test the data mining solution, using an iterative process to determine which models work best for a given situation. When the developer is satisfied with the solution, it is deployed to an Analysis Services server. From this point, the focus shifts from development to maintenance and use, and thus SQL Server Management Studio. Using SQL Server Management Studio, you can administer your database and perform some of the same functions as in Business Intelligence Development Studio, such as viewing, and creating predictions from mining models. Data Transformation ServicesData Transformation Services (DTS) comprises the Extract, Transform, and Load (ETL) tools in SQL Server 2005. These tools can be used to perform some of the most important tasks in data mining: cleaning and preparing the data for model creation. In data mining, you typically perform repetitive data transformations to clean the data before using the data to train a mining model. Using the tasks and transformations in DTS, you can combine data preparation and model creation into a single DTS package.DTS also provides DTS Designer to help you easily build and run packages containing all of the tasks and transformations. Using DTS Designer, you can deploy the packages to a server and run them on a regularly scheduled basis. This is useful if, for example, you collect data weekly data and want to perform the same cleaning transformations each time in an automated fashion.You can work with a Data Transformation project and an Analysis Services project together as part of a business intelligence solution, by adding each project to a solution in Business Intelligence Development Studio.Mining Model AlgorithmsData mining algorithms are the foundation from which mining models are created. The variety of algorithms included in SQL Server 2005 allows you to perform many types of analysis. For more specific information about the algorithms and how they can be adjusted using parameters, see Data Mining Algorithms in SQL Server Books Online.Microsoft Decision TreesThe Microsoft Decision Trees algorithm supports both classification and regression and it works well for predictive modeling. Using the algorithm, you can predict both discrete and continuous attributes. In building a model, the algorithm examines how each input attribute in the dataset affects the result of the predicted attribute, and then it uses the input attributes with the strongest relationship to create a series of splits, called nodes. As new nodes are added to the model, a tree structure begins to form. The top node of the tree describes the breakdown of the predicted attribute over the overall population. Each additional node is created based on the distribution of states of the predicted attribute as compared to the input attributes. If an input attribute is seen to cause the predicted attribute to favor one state over another, a new node is added to the model. The model continues to grow until none of the remaining attributes create a split that provides an improved prediction over the existing node. The model seeks to find a combination of attributes and their states that creates a disproportionate distribution of states in the predicted attribute, therefore allowing you to predict the outcome of the predicted attribute.Microsoft ClusteringThe Microsoft Clustering algorithm uses iterative techniques to group records from a dataset into clusters containing similar characteristics. Using these clusters, you can explore the data, learning more about the relationships that exist, which may not be easy to derive logically through casual observation. Additionally, you can create predictions from the clustering model created by the algorithm. For example, consider a group of people who live in the same neighborhood, drive the same kind of car, eat the same kind of food, and buy a similar version of a product. This is a cluster of data. Another cluster may include people who go to the same restaurants, have similar salaries, and vacation twice a year outside the country. Observing how these clusters are distributed, you can better understand how the records in a dataset interact, as well as how that interaction affects the outcome of a predicted attribute.Microsoft Nave BayesThe Microsoft Nave Bayes algorithm quickly builds mining models that can be used for classification and prediction. It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The probabilities used to generate the model are calculated and stored during the processing of the cube. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. The Microsoft Nave Bayes algorithm produces a simple mining model that can be considered a starting point in the data mining process. Because most of the calculations used in creating the model are generated during cube processing, results are returned quickly. This makes the model a good option for exploring the data and for discovering how various input attributes are distributed in the different states of the predicted attribute.Microsoft Time SeriesThe Microsoft Time Series algorithm creates models that can be used to predict continuous variables over time from both OLAP and relational data sources. For example, you can use the Microsoft Time Series algorithm to predict sales and profits based on the historical data in a cube.Using the algorithm, you can choose one or more variables to predict, but they must be continuous. You can have only one case series for each model. The case series identifies the location in a series, such as the date when looking at sales over a length of several months or years. A case may contain a set of variables (for example, sales at different stores). The Microsoft Time Series algorith
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 工业园区供水资金筹措与支付计划
- 2024家庭装修简易合同
- SK-7041-生命科学试剂-MCE
- 变电站市场风险评估
- 项目房产测绘合同书(3篇)
- 酒店厨师长期雇佣合同范本
- 中国二次元手游行业投资分析、市场运行态势研究报告-智研咨询发布
- 2025年中国微流控芯片行业发展现状、进出口贸易及市场规模预测报告
- 挖掘机短期租赁合同范本
- 农村安装水电工程合同范本
- 初中语文-江城子·密州出猎苏轼教学设计学情分析教材分析课后反思
- 压裂队安全管理制度
- -让生活更美好 作文批改评语
- 超星尔雅《百年风流人物:曾国藩》课程完整答案
- 离线论文 关于科学思维方法在实际生活和工作中的应用、意义
- GK1C内燃机 操作规程
- 梅岭三章导学案
- 登杆培训材料
- 手术室护理风险防范措施
- 六年级英语辨音复习题
- 船用柴油机课程
评论
0/150
提交评论