聚类分析文献英文翻译_第1页
聚类分析文献英文翻译_第2页
聚类分析文献英文翻译_第3页
聚类分析文献英文翻译_第4页
聚类分析文献英文翻译_第5页
已阅读5页,还剩9页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、电气信息工程学院外 文 翻 译英文名称: data mining-clustering 译文名称: 数据挖掘聚类分析 专 业: 自动化 姓 名: * 班级学号: * 指导教师: * 译文出处: data mining:ian h.witten, eibe frank 著 clustering5.1 introduction clustering is similar to classification in that data are grouped. however, unlike classification, the groups are not predefined. instead,

2、 the grouping is accomplished by finding similarities between data according to characteristics found in the actual data. the groups are called clusters. some authors view clustering as a special type of classification. in this text, however, we follow a more conventional view in that the two are di

3、fferent. many definitions for clusters have been proposed: l set of like elements. elements from different clusters are not alike. l the distance between points in a cluster is less than the distance between a point in the cluster and any point outside it.a term similar to clustering is database seg

4、mentation, where like tuple (record) in a database are grouped together. this is done to partition or segment the database into components that then give the user a more general view of the data. in this case text, we do not differentiate between segmentation and clustering. a simple example of clus

5、tering is found in example 5.1. this example illustrates the fact that that determining how to do the clustering is not straightforward. as illustrated in figure 5.1, a given set of data may be clustered on different attributes. here a group of homes in a geographic area is shown. the first floor ty

6、pe of clustering is based on the location of the home. homes that are geographically close to each other are clustered together. in the second clustering, homes are grouped based on the size of the house.clustering has been used in many application domains, including biology, medicine, anthropology,

7、 marketing, and economics. clustering applications include plant and animal classification, disease classification, image processing, pattern recognition, and document retrieval. one of the first domains in which clustering was used was biological taxonomy. recent uses include examining web log data

8、 to detect usage patterns.when clustering is applied to a real-world database, many interesting problems occur:l outlier handling is difficult. here the elements do not naturally fall into any cluster. they can be viewed as solitary clusters. however, if a clustering algorithm attempts to find large

9、r clusters, these outliers will be forced to be placed in some cluster. this process may result in the creation of poor clusters by combining two existing clusters and leaving the outlier in its own cluster.l dynamic data in the database implies that cluster membership may change over time.l interpr

10、eting the semantic meaning of each cluster may be difficult. with classification, the labeling of the classes is known ahead of time. however, with clustering, this may not be the case. thus, when the clustering process finishes creating a set of clusters, the exact meaning of each cluster may not b

11、e obvious. here is where a domain expert is needed to assign a label or interpretation for each cluster.l there is no one correct answer to a clustering problem. in fact, many answers may be found. the exact number of clusters required is not easy to determine. again, a domain expert may be required

12、. for example, suppose we have a set of data about plants that have been collected during a field trip. without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created. l another related iss

13、ue is what data should be used of clustering. unlike learning during a classification process, where there is some a priori knowledge concerning what the attributes of each classification should be, in clustering we have no supervised learning to aid the process. indeed, clustering can be viewed as

14、similar to unsupervised learning.we can then summarize some basic features of clustering (as opposed to classification):l the (best) number of clusters is not known.l there may not be any a priori knowledge concerning the clusters.l cluster results are dynamic.the clustering problem is stated as sho

15、wn in definition 5.1. here we assume that the number of clusters to be created is an input value, k. the actual content (and interpretation) of each cluster, is determined as a result of the function definition. without loss of generality, we will view that the result of solving a clustering problem

16、 is that a set of clusters is created: k=.definition 5.1.given a database d= of tuples and an integer value k, the clustering problem is to define a mapping f: where each is assigned to one cluster ,. a cluster, contains precisely those tuples mapped to it; that is, =and . a classification of the di

17、fferent types of clustering algorithms is shown in figure 5.2. clustering algorithms themselves may be viewed as hierarchical or partitional. with hierarchical clustering, a nested set of clusters is created. each level in the hierarchy has a separate set of clusters. at the lowest level, each item

18、is in its own unique cluster. at the highest level, all items belong to the same cluster. with hierarchical clustering, the desired number of clusters is not input. with partitional clustering, the algorithm creates only one set of clusters. these approaches use the desired number of clusters to dri

19、ve how the final set is created. traditional clustering algorithms tend to be targeted to small numeric database that fit into memory .there are, however, more recent clustering algorithms that look at categorical data and are targeted to larger, perhaps dynamic, databases. algorithms targeted to la

20、rger databases may adapt to memory constraints by either sampling the database or using data structures, which can be compressed or pruned to fit into memory regardless of the size of the database. clustering algorithms may also differ based on whether they produce overlapping or nonoverlapping clus

21、ters. even though we consider only nonoverlapping clusters, it is possible to place an item in multiple clusters. in turn, nonoverlapping clusters can be viewed as extrinsic or intrinsic. extrinsic techniques use labeling of the items to assist in the classification process. these algorithms are the

22、 traditional classification supervised learning algorithms in which a special input training set is used. intrinsic algorithms do not use any a priori category labels, but depend only on the adjacency matrix containing the distance between objects. all algorithms we examine in this chapter fall into

23、 the intrinsic class.the types of clustering algorithms can be furthered classified based on the implementation technique used. hierarchical algorithms can be categorized as agglomerative or divisive. ”agglomerative” implies that the clusters are created in a bottom-up fashion, while divisive algori

24、thms work in a top-down fashion. although both hierarchical and partitional algorithms could be described using the agglomerative vs. divisive label, it typically is more associated with hierarchical algorithms. another descriptive tag indicates whether each individual element is handled one by one,

25、 serial (sometimes called incremental), or whether all items are examined together, simultaneous. if a specific tuple is viewed as having attribute values for all attributes in the schema, then clustering algorithms could differ as to how the attribute values are examined. as is usually done with de

26、cision tree classification techniques, some algorithms examine attribute values one at a time, monothetic. polythetic algorithms consider all attribute values at one time. finally, clustering algorithms can be labeled base on the mathematical formulation given to the algorithm: graph theoretic or ma

27、trix algebra. in this chapter we generally use the graph approach and describe the input to the clustering algorithm as an adjacency matrix labeled with distance measure. we discuss many clustering algorithms in the following sections. this is only a representative subset of the many algorithms that

28、 have been proposed in the literature. before looking at these algorithms, we first examine possible similarity measures and examine the impact of outliers. 5.2 similarity and distance measuresthere are many desirable properties for the clusters created by a solution to a specific clustering problem

29、. the most important one is that a tuple within one cluster is more like tuples within that cluster than it is similar to tuples outside it. as with classification, then, we assume the definition of a similarity measure, sim(), defined between any two tuples, . this provides a more strict and altern

30、ative clustering definition, as found in definition 5.2. unless otherwise stated, we use the first definition rather than the second. keep in mind that the similarity relationship stated within the second definition is a desirable, although not always obtainable, property.a distance measure, dis(),

31、as opposed to similarity, is often used in clustering. the clustering problem then has the desirable property that given a cluster, and .some clustering algorithms look only at numeric data, usually assuming metric data points. metric attributes satisfy the triangular inequality. the cluster can the

32、n be described by using several characteristic values. given a cluster, of n points , we make the following definitions zrl96:here the centroid is the “middle” of the cluster; it need not be an actual point in the cluster. some clustering algorithms alternatively assume that the cluster is represent

33、ed by one centrally located object in the cluster called a medoid. the radius is the square root of the average mean squared distance from any point in the cluster to the centroid, and of points in the cluster. we use the notation to indicate the medoid for cluster.many clustering algorithms require

34、 that the distance between clusters (rather than elements) be determined. this is not an easy task given that there are many interpretations for distance between clusters. given clusters and, there are several standard alternatives to calculate the distance between clusters. a representative list is

35、:l single link: smallest distance between an element in one cluster and an element in the other. we thus have dis()=and.l complete link: largest distance between an element in one cluster and an element in the other. we thus have dis()=and.l average: average distance between an element in one cluste

36、r and an element in the other. we thus have dis()=and.l centroid: if cluster have a representative centroid, then the centroid distance is defined as the distance between the centroids. we thus have dis()=dis(), whereis the centroid forand similarly for .l medoid: using a medoid to represent each cl

37、uster, the distance between the clusters can be defined by the distance between the medoids: dis()=5.3 outliersas mentioned earlier, outliers are sample points with values much different from those of the remaining set of data. outliers may represent errors in the data (perhaps a malfunctioning sens

38、or recorded an incorrect data value) or could be correct data values that are simply much different from the remaining data. a person who is 2.5 meters tall is much taller than most people. in analyzing the height of individuals, this value probably would be viewed as an outlier.some clustering tech

39、niques do not perform well with the presence of outliers. this problem is illustrated in figure 5.3. here if three clusters are found (solid line), the outlier will occur in a cluster by itself. however, if two clusters are found (dashed line), the two (obviously) different sets of data will be plac

40、ed in one cluster because they are closer together than the outlier. this problem is complicated by the fact that many clustering algorithms actually have as input the number of desired clusters to be found.clustering algorithms may actually find and remove outliers to ensure that they perform bette

41、r. however, care must be taken in actually removing outliers. for example, suppose that the data mining problem is to predict flooding. extremely high water level values occur very infrequently, and when compared with the normal water level values may seem to be outliers. however, removing these val

42、ues may not allow the data mining algorithms to work effectively because there would be no data that showed that floods ever actually occurred.outlier detection, or outlier mining, is the process of identifying outliers in a set of data. clustering, or other data mining, algorithms may then choose t

43、o remove or treat these values differently. some outlier detection techniques are based on statistical techniques. these usually assume that the set of data follows a known distribution and that outliers can be detected by well-known tests such as discordancy tests. however, these tests are not very

44、 realistic for real-world data because real-world data values may not follow well-defined data distributions. also, most of these tests assume single attribute value, and many attributes are involved in real-world datasets. alternative detection techniques may be based on distance measures.聚类分析5.1 i

45、ntroduction 5.1简介 clustering is similar to classification in that data are grouped.聚类分析与分类数据分组类似。然而,与数据分类不同的是,所分的组预先是不确定的。相反,分组是根据在实际数据中发现的特点通过寻找数据之间的相关性来实现的。这些组被称为聚类。一些作者认为聚类分析作为一种特殊类型的分类。但是,在本文两个不同的观点中我们遵循更传统的看法。提出了许多有关聚类的定义: · 类似元素的集合set of like elements. elements from different clusters are

46、 not al类类。不同聚类中的元素是不一样的。 · the distance between points in a cluster is less than the distance between a point in the cluster and any point outside it.在聚类中的点之间的距离比在聚类中的一个点和聚类之外任何一点之间的距离要小。 a term similar to clustering is database segmentation, where like tuple (record) in a database are grouped

47、together. 与聚类类似的术语是数据库分割,其中数据库中的元组(记录)被放在一起。 this is done to partition or segment the database into components that then give the user a more general view of the data.这样做是为了分割或划分成数据的数据库组件,然后给用户一个普遍的看法。这样本文in this case text, we do not differentiate between segmentation and clusterin这样 本本我们就不区分分割和聚类。a

48、 simple example of clustering is found in example 5.1.this example illustrates the fact that that determining how to do the clustering is not straightforwar一个简单聚类分析的例子见例5.1.这个例子说明了决定如何做聚类并不是容易的。as illustrated in figure 5.1,a given set of data may be clustered on different attributes. 正如图5.1所示,一个给定的数

49、据集合可能汇聚不同的属性。here a group of homes in a geographic area is show这里显示了一个地域的住宅群。the first floor type of clustering is based on the location of the home. homes that are geographically close to each other are clustered together. in the second clustering, homes are grouped based on the size of the house.一

50、楼的聚类类型是基于家庭的位置。家庭地理位置相近,彼此都聚集在一起。在第二个聚类,家庭的分类是基于房子的大小分类。 clustering has been used in many application domains, including biology, medicine, anthropology, marketing, and economics.聚类已被用于许多应用领域,包括生物学,医学,人类学,市场营销和经济学。 clustering applications include plant and animal classification, disease classificati

51、on, image processing, pattern recognition, and document retrieval. 聚类分析的应用包括植物和动物分类,疾病分类,图像处理,模式识别,文献检索。最先one of the first domains in which clustering was used was biological taxonom使用聚类分析的领域是生物分类学。recent uses include examining web log data to detect usage patterns最近的使用包括通过研究web日志的数据来检测其使用模式。when cl

52、ustering is applied to a real-world database, many interesting problems occur:当聚类分析应用到现实世界的数据库,许多有趣的问题将出现: · 异常值的处理是困难的。这里的元素通常不属于任何一个集合。它们可以被看作是孤立集合。但是,如果聚类算法试图找到更大的集合,这些异常值将被迫放在某个集合内。此过程可能会导致结合两个现有的聚类来建立出贫乏的聚类,并且新建立的聚类本身会出现异常。· dynamic data in the database implies that cluster membership

53、 may change over time.数据库的动态数据意味着聚类成员可能会随时间而改变。 · 解释每个聚类的意义可能是困难的。通过分类,类的标签提前了。然而,聚类可能并非如此。这样,当聚类过程生成了一个聚类集合,每个集合的确切含义可能不非常明显。下面是其中一个领域专家是需要为每个聚类分配一个标签或解释。· there is no one correct answer to a clustering problem.对于聚类问题没有准确的答案in fact, many answers may be。事实上,也可以找到很多答案。该聚类所需的确切数目是不容易的确定。同样,一

54、个领域的专家可能需要。for example, suppose we have a set of data about plants that have been collected during a field trip. without any prior knowledge of plant classification, if we attempt to divide this set of data into similar groupings, it would not be clear how many groups should be created例如,假设我们有经过实地考察

55、采集的植物数据。分析之前没有任何有关植物分类的知识,如果我们试图将这些数据划分为类似的分组,我们不知道应该建立多少分组。 · another related issue is what data should be used of clustering.另一个相关的问题是聚类分析应该使用什么样的数据。与分类过程中的学习不同,分类有一些先验知识,知道每个分类的属性,而在聚类分析中,没有有监督的学习来促进这一过程。事实上,聚类分析可以看作无监督学习。 这样我们总结一些聚类分析的本特征(相对于分类而言): · the (best) number of clusters is no

56、t known.聚类的(最佳)数目是不知道的· there may not be any a priori knowledge concerning the clusters.对于某个聚类可能没有任何先验知识· cluster results are dynamic.聚类的果是动态的。 the clustering problem is stated as shown in definition 5.1.here we assume that the number of clusters to be created is an input value, k.聚类问题叙述的正

57、如定义5.1.所示,这里我们假设创建的聚类的数目为一个输入值k,每个聚类,()的the actual content (and interpretation) of each cluster, kj, 1<j<k, is determined as a result of the function definiti实际内容(说明),作为一个功能定义。不失一般性without loss of generality, we will view that the result of solving a clustering problem is that a set of cluster

58、s is creat: k=不失一,我们认为,解决问题的结果建立的聚类集合:k=。 d efinition 5.1. given a database d = of tuples and an integer value k , the clustering problem is to define a mapping f : d where each ti is assigned to one cluster kj,1< j < k . a cluster, kj, contains precisely those tuples mapped to it; that is, kj

59、=, 定义 5.1已知一个数组集合d=和一个整数k,聚类问题是定义一个映射f:,其中分配到聚类() 。聚类,就是集合d映射到=and 。  a classification of the different types of clustering algorithms is shown in figure 5.2.clustering algorithms themselves may be viewed as hierarchical or partitiona聚类算法的不同类型的分类如图5.2。聚类算法本身就可视为分层或分块的。分层聚类分析可以建立一个嵌套的聚类集合。在层次结构中的每层都有单独的聚类。在最低层,每个项目都划分在不同的特殊的集合中。在最顶层,所有的项

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论