复合进化计算的文本聚类实现_第1页
复合进化计算的文本聚类实现_第2页
复合进化计算的文本聚类实现_第3页
复合进化计算的文本聚类实现_第4页
复合进化计算的文本聚类实现_第5页
已阅读5页,还剩30页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、复合进化计算的文本聚类实现 / - 1 - 中国科技论文在线 Implementing Text Clustering by Compound Evolutionary Algorithm# QIAO Yingying, SONG Wei* 5 (School of Internet of Things Engineering, Jiangnan University) Foundations: National Natural Science Foundation of China (61103129);Natural Science Foundation of Jiangsu Provin

2、ce (SBK201122266);the Specialized Research Fund for the Doctoral Program of Higher Education (20100093120004) Brief author introduction:QIAO Yingying(1987-),female,graduate,text clustering Correspondance author: SONG Wei(1981 ),male,associate professor,machine learning,information retrieval. Abstrac

3、t: In order to solve the problem of premature convergence and limitations in optimization of single clustering algorithm, this paper proposes a new clustering algorithm combining GA and QPSO for high-dimensional text clustering. Using robust global optimization performance of GA to initialize partic

4、les in QPSO, the searching efficiency and clustering performance of QPSO algorithm is improved 10 by giving full play to the advantages of the two algorithms. We conducted the experiments on the 20Newsgroup dataset and the experiments results showed that our method outperforms the sole GA or QPSO me

5、thod and achieved high text clustering accuracy and effectiveness measured by the precision, recall and F-measure with tolerable time consumption. 15 Key words: text clustering; GA clustering; QPSO clustering; PSO clustering 0 Introduction Clustering of text documents plays a vital role in efficient

6、 document organization, summarization, topic extraction and information retrieval1. It organizes the text sets according to 20 the nature of the text content and makes the entire text collection into several classes through the corresponding algorithms, as a result, the documents belonging to same c

7、lass are similar to each other as far as possible, and vice verse. Because of needing no training process and manual annotation to objects, text clustering has certain flexibility and high ability of automatic processing. 25 Clustering high-dimensional datasets is a contemporary challenge 2. With th

8、e development of science technology and broadening scope of engineering problems, text dataset are usually composed of tens of thousands of keywords. Thus the problem of high-dimensional text data makes the existing clustering algorithms fail to satisfy the request of information retrieval system. M

9、oreover, the improvement of single algorithm has its own limitations. Especially, the premature 30 convergence and poor local optimization ability are the common fault of almost all stochastic optimization algorithms. Thus, combining advantages of different swarm intelligence algorithms to improve t

10、he optimization performance is a new research hotspot in the current optimal field. Genetic algorithm (GA) as a typical random search and optimization algorithm is proposed first by professor J.H.Holland in the 1970s 3. This algorithm imitates the process of biologic 35 evolution and inheritance, fo

11、llowing Darwins competition principle of “survival of the fittest, superior bad discard”4. It can realize the optimization function through reproductive evolution of the advantage of individuals in a population. However, its local search ability is weak and tending to be premature. Quantum-behaved P

12、article swarm optimization (QPSO) is a swarm intelligence optimization algorithm, proposed by Sun J. et al5. The inspiration of QPSO algorithm came from 40 quantum mechanics and trajectory analysis of the individual particles behavior in particle swarm optimization (PSO) algorithm. Since PSO has bee

13、n proved to be not a global convergent algorithm, QPSO as a variant of PSO has been proposed to improve the global search ability of the classical PSO. / - 2 - 中国科技论文在线 In this paper, we propose a compound method which combines GA and QPSO (GA-QPSO) 45 for text clustering. In such a compound method

14、the robust global search ability of GA is utilized to initialize the particles of QPSO, which improves the robustness of QPSO algorithm. The remainder of this paper is organized as follows: Section 1 provides a general overview of the GA, and QPSO clustering algorithm. The GA-QPSO clustering algorit

15、hm is described in section 2. Section 3 provides the method to represent each document for clustering, and how to 50 compute the similarity between documents. Section 4 provides the detailed experimental setup and the comparing results between our approach and other clustering algorithms, e.g. GA, P

16、SO, and QPSO. The conclusion is in Section 5. 1 Clustering algorithm 1.1 GA clustering 55 Genetic algorithm (GA) is presented by professor J.H.Holland in the 1970s. It is a kind of effective optimization method based on the principle of natural selection and genetics. In GAs, the parameters of the s

17、earch space are encoded in the form of strings, called chromosomes. A collection of chromosomes is called a population and a random distributed population is created first, like different points in the search space 6. The process of solving the 60 problems is represented as chromosomes evolution pro

18、cess. At the initial population, a number of individuals are selected with the selection strategy, according to the fitness function associated with each chromosome. Then selection, crossover and mutation operator are applied to produce the next generation of the population. Generations evolution wi

19、ll go on, until the termination conditions is satisfied. In text clustering problem, a chromosome can be represented as: 65 ),.,.,( 1 ikijii MMMX ? (1) WhereijM refers to the jth cluster centroid vector of thethi chromosome. 1.2 QPSO clustering Quantum-behaved particle swarm optimization (QPSO), ins

20、pired by concepts from quantum mechanics and PSO, is a probabilistic global optimization algorithm. The particle in QPSO is 70 assumed to be in a bound state and attracted by a quantum potential well centered on its local attractor, thus having a new stochastic update equation for its position7. It

21、has been proved that the iterative equation leads QPSO to be global convergence8 and the particles move according to the following iterative equation: gdidid PPp ? )1( ? , ()rand? (2) 75 )/1ln( uXmbestpX iddidid ? ? , )1,0( Uu (3) Where mbest is the mean best position among personal best positions o

22、f all the particles in the whole population. That is: ? ? ?MiMiinMiiMiii PMPMPMPMmbest1 112111,.,1,11 (4) The idp , a stochastic point between idP and gdP , is the local attractor on thethd dimension of 80 the thi particle, ? is a random number distributed uniformly in 0,1, ? is another uniformly-di

23、stributed random number in 0,1 and ?, called Contraction-Expansion Coefficient, is the only parameter should be adjusted to control the search speed of QPSO. 2 Compound GA-QPSO clustering algorithm The development of GA has been very mature, and has the robust global search ability and 85 / - 3 - 中国

24、科技论文在线 better clustering performance compared to other clustering algorithms. In our method, GA as a global optimization algorithm start to the fast rough clustering and yield initial clustering partitions. The particles in QPSO will be initialized by the documents assigned to the partitions, and ea

25、ch corresponding clustering center of all particles is initialized from the same category respectively, which is even contributing to the work of calculating the mean best position of 90 QPSO. The method is effective and the global optimization performance of QPSO algorithm will be more stronger tha

26、n random initialization of the algorithm. Like chromosomes in GA, particles in PSO also refer to a number of potential solutions to the optimization problem. Then particles are initialized by the clustering results of GA. The representation of particles is in the form of equation (1), which is the s

27、ame to that of chromosomes 95 in GA. The fitness function for chromosomes or particles is measured as the quantization: ?KjCd ijiijiMdiFitness1),cos()( (5) Where id is one document vector belonging to cluster ijC . ijM refers to the jth cluster centroid vector of the thi chromosome in clusterijC . )

28、,cos( iji Md , the cosine measure between 100 id andijM . ),.,( 21 nij M ? (6) Where n is the number of terms in the document id . Some minor adjustments are also made to the parameters of the iterative formula according to the actual analysis of the text in our method. Different to update equation

29、(3), the adjusted iterative 105 equation is as follows: )( uXmbestpX iddidid ? ? (7) Where ? is a random number distributed uniformly in (0, 1 and u is another random constant between 00.5, which to limit the search space of particles so as to prevent overstepping the boundary. 110 The GA-QPSO algor

30、ithm can be summarized as: Step 1: GA clustering: the clustering process is the same to chapter 1.1. Step 2: Particles initialized: the corresponding clustering center of all particles is initialized by the same clustering partition yield by GA. Step 3: (a) Assign each document vector in the text co

31、llection to the closest centroid vector. 115 (b) Calculate the fitness value of particles based on equation (5). (c) Determine each particles previous best position and the current global best position among the particles best positions. (d) Calculate the mean best position among the particles by eq

32、uation (4) and get a stochastic point between idp and gdP by equation (2) for each dimension of the 120 particle. (e) Using equation (7) to generate the next solutions. Step 4: Repeat step 3 until one of the following termination conditions is satisfied. (a) The change in centroid vectors is more th

33、an a predefined value. (b) The pre-specified number of iterations is exceeded. 125 Step 5: Output the final solution obtained by the best particle. / - 4 - 中国科技论文在线 3 Text clustering 3.1 Representation of documents To apply any clustering algorithm on a dataset, documents must at first be represente

34、d as vectors in the term space. Documents are represented by the widely used vector-space model. 130 Most of the methods for text clustering and information retrieval are based on the VSM. The VSM is implemented by creating a term-document matrix and a vector of documents. In this model, each docume

35、nt is represented as a vector whose components are the occurrence frequencies of terms in one text. We represent each text as vector ,., 21 nd ?, where iw is the term weight of the thi indexing term in one text. Each dimension in the vector stands for a distinct 135 term in the term space of the tex

36、t collection. The term weight value denotes the significance of this term in a document. 3.2 Similarity metric To use a clustering algorithm we need to judge the similarity between two documents represented by term vectors. The traditional method of determining closeness of two vectors is to 140 use

37、 the size of the angle between them. This angle is computed by the use of the inner product. However, it is not necessary to use the actual angle. Often the expression “similarity coefficient” is used instead of an angle. We used the cosine distance, which is represented as ? jijiji dddddd ),cos( (8

38、) Where jninjijiji dd ?2211 145 4 Experiments In order to demonstrate the performance of PSO-GA on real text data, we used the 20Newsgroup dataset. There are 500 documents in dataset1 from five categories, comp.os.ms-windows.misc, soc.religion.christian, misc.forsale, rec.sport.hockey, sci.space. In

39、 dataset2, we used 800 documents, which were chosen from eight categories, sci.space, 150 misc.forsale, comp.os.ms-windows.misc, rec.sport.hockey, soc.religion.christian, talk.politics.guns, comp.sys.mac.hardware and comp.graphics. The summary of datasets used for experiments were shown in Tab. 1. T

40、ab. 1 The dataset used for experiments Dataset Number of documents Number of clusters Number of all terms tng500 500 5 9971 On the dataset, we run each algorithm twenty trials and record the average and maximum 155 F-measure values that each algorithm achieved, as shown in Tab.2 and 3. Tab. 2 The av

41、erage F-measure values produced by the various algorithms on dataset PSO QPSO GA GA-QPSO 0.775688 0.753269 0.806737 0.886567 Tab. 3 The maximum F-measure values produced by the various algorithms on dataset PSO QPSO GA GA-QPSO 0.850036 0.902578 0.880165 0.929634 From Tab.2 and Tab.3, GA-QPSO generat

42、es the best clustering results for both average and maximum F-measure value. From Tab.3, the maximum F-measure of QPSO is higher than PSO, 160 which indicates the ability of searching global optimal solution of QPSO exceeds PSO, though the / - 5 - 中国科技论文在线 average F-measure of QPSO is a little lower

43、 than PSO. From Tab.2 and 3, the average and maximum F-measure value of GA is always higher than those of PSO, which demonstrates that the clustering effect of GA is better than PSO. Moreover, we use fitness value to reflect the true variation on each generation. Fig.1 showed 165 the best, the worst

44、 and the common situations of 20 runs of the GA-QPSO algorithm, in the meantime, the best one of 20 runs of other algorithms were shown for comparison. Fig. 1 Fitness values of the different algorithms on each generation for dataset From Fig.1, we can see that GA-QPSO(Best) achieves the best perform

45、ance with the fitness 170 value of 126.876. The results of GA-QPSO(Common) are inferior to GA-QPSO(Best) and superior to other algorithms. That is to say, the searching performance of GA-QPSO is stable, comparing to sole GA or QPSO. Moreover, although its worst results are not as good as the best re

46、sults of PSO and QPSO, it still performs better than GA which converged quickly. 5 Conclusion 175 In this paper, a compound QPSO and GA algorithm for text clustering is proposed. The method enhances the global search ability of QPSO algorithm and has very high searching efficiency. In order to verif

47、y the validity of the proposed clustering algorithm, the experiments were conducted on 20Newsgroup dataset. The experimental results showed that the GA-QPSO algorithm surpasses QPSO and GA alone, and has better clustering effect in terms of F-measure. 180 That is to say, the new compound clustering algorithm combining QPSO and GA is effective and feasible. References 1 Kowalski G. Information retrieval systems: theory and implementation J.Journal of the American Society for Informatio

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论