高级人工智能演讲报告书中英文_第1页
高级人工智能演讲报告书中英文_第2页
高级人工智能演讲报告书中英文_第3页
高级人工智能演讲报告书中英文_第4页
高级人工智能演讲报告书中英文_第5页
已阅读5页,还剩14页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、South China University of Technology华南理工大学高级人工 智能演讲报告书题目:Machine learning: Trends, perspectives, and prospects (Unsupervised learning andfeature reduction)学院计算机科学与工程专 业计算机科学与技术(全英创新班)学生姓名学生学号指导教师起始日期2015 年11月1日选题概述【小组论文选题】Machine learning: Trends, perspectives, and prospects.【个人完成部分】Unsupervised lea

2、rning and feature reduction论文学习报告【论文结构】通过对论文的三次细读,本人对论文内容结构理解如下:.机器学习的定义与发展现状.机器学习核心算法有监督学习有监督学习综述深度学习综述无监督学习无监督想学习综述特征降维方法强化学习.机器学习中亟待解决问题数据隐私性数据分布性.机器学习发展机遇与挑战【内容总结】根据本人的理解,对文章的所有内容进行了如下总结:1.机器学习的定义与发展现状:机器学习主要集中在研究2个问题:(1)机器能否通过经验提升性能?(2)是什么样的统计理论原理支持着所有的学习系统 ?机器学习在这几十年取得了长足的发展。之所以机器学习取得那么大的进 步,是

3、通过经验学习的机器,比纯手工打代码高效得多。学习问题可定义为对经 验的研究来提高某个度量函数,这个度量函数可以是有监督学习里的准确率等。机器学习中有很多算法,总的来说可以看作为对待选解集的一个搜索,这个 搜索是通过对经验的学习来对性能进行优化的过程,根据待选解的表示、搜索的 方法,机器学习算法可以分为很多类。机器学习算法背后依靠的原理就是统计学中的最优化原理,机器学习是计算机科学与统计学的交叉学科,这个交叉学科仍处在发芽阶段。最近由于大数据的 兴起,算法可伸缩性、数据隐私性等问题对机器学习算法提出了新的挑战。其中 算法可伸缩性还需要解决的是一个数据粒度性问题。.机器学习核心算法:有监督学习:有

4、监督学习的应用很广泛,有如垃圾邮件过滤、人脸识别等。有监督学习是度 量函数最优化的一个典型例子,学习的任务就是学习一个映射函数f,把任何输入样本x*映射到类别y*上。其中发展最为成功的是二分类问题,在多分类、 ranking、结构预测问题上也有丰富的研究成果。其中卓越的算法有如决策树、逻 辑回归、支持向量机、神经网络等。一些通用性比较强的学习算法受到了学者们 的认可,其中主要以集成学习为主。集成学习集成了多个分类器的学习结果,对 数据的判别具有普遍性。因为现实问题的提出,对有监督学习算法提出了各种各 样的要求,根据算法复杂度和性能的折衷,已有很多成功的算法。深度学习:深度学习是近几年最热门的课

5、题之一,是对人工神经网络的提升。深度学习网 络由多层的threshol淖元组成,利用梯度原理对网络的权值进行优化调整,使得 误差最小化。因为算法的卓越性,它可以处理几百万个参数、大规模的数据集, 正因为具有如此良好的伸缩性,这个算法已成功应用在计算机视觉、自然语言处 理等领域。如今深度学习的领域仍然处于膨胀阶段,越来越多的学者投入到此领 域中,发展前景无可估量。无监督学习:无监督学习是对无标签数据学习的一个过程。其中最具有代表性的是聚类学 习,聚类的任务是把数据组成数据簇,使得簇内数据相似度最大,簇间数据相似 度最小。特定的聚类算法都是对数据具有一定的分布假设,通过分布的假设发现 数据的内在结

6、构。与聚类相似,特征降维也是基于对数据分布的假设进行的,部 分特征降维算法可以视为无监督学习。特征降维在大数据处理上的地位尤为重 要,著名的特征降维方法有如PCA、LDA等。强化学习:所谓强化学习就是智能系统从环境到行为映射的学习,以使奖励信号(强化信号)函数值最大,强化学习不同于连接主义学习中的监督学习,主要表现在教 师信号上,强化学习中由环境提供的强化信号是对产生动作的好坏作一种评价(通常为标量信号),而不是告诉强化学习系统 RLS(reinforcement learning system)如何去产生正确的动作。由于外部环境提供的信息很少,RLS必须靠自身的经历进行学习。通过这种方式,R

7、LS在行动-评价的环境中获得知识,改进行动方案以 适应环境。三种学习算法的结合:三种学习算法都取得了一定的成功, 近期的研究很多集中在三种算法的混合 当中,其中的例子有如半监督学习,在聚类过程中,部分样本的类属关系可以得 知,而这个有监督的类属关系作为约束提供在无监督聚类当中。另外的例子有如 模型选择、主动学习、causa学习等。很多的问题影响到学习算法的设计,其中 包括串行与并行性、鲁棒性等问题。.机器学习亟待解决问题:数据的隐私性与分布性是最迫切的两个问题。其中数据的隐私性涉及了政策原因、数据拥有权、数据隐私度等问题。有的 数据人们是愿意分享出来以供大家研究的,而有的数据则希望绝对保密,对

8、于同 一数据,不同人也有不同的想法。如何把握数据隐私的度量是重要的道德性问题。由于数据的大规模以及分布性,把数据集中在一起处理通常不可能,有如全 国连锁的分公司在各地都拥有巨大数据库,只能通过分布式学习然后结合学习结 果。现代的算法应该越来越注重并行性,以使得它能在实际环境中很好的运作。.机器学习机遇与挑战:现代机器学习虽然取得了很大的发展,可以离真正的智能还离很远。永不止 境的自适应学习机是机器学习的最高愿望。作为下一步,半人工合作式学习算法 是一个新课题,通过人工的介入,使得机器能分析处理各种各种的复杂数据。拓展学习_这篇论文是一篇综述性的文章,没有公式、算法、实验、证明,也没提及无 监督

9、学习上的前沿方向,页数也是最少的。所以为了更好地进行理论学习,本人 另外阅读了 2篇论文作为拓展学习:Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27: 1226 1238.X. Z. Fern and C. E. Brodley. Random projection for high di

10、mensional data clustering: A cluster ensemble approach. In Machine Learning, Proceedings of the International Conference on, 2003.其中第一篇讲的是特征选择,第二篇讲的是随机映射(特征降维方法)以及 无监督学习中的前沿方向-聚类集成。由于本人一开始做的是英文的 PPT和学习报告,后来和组员PPT合并时才 协商一起用中文,所以我把 PPT和上面的内容翻译成了中文。下面拓展学习部 分我写了一个晚上,不想再花一个晚上翻译成中文(。),所以保持使用英文, 望见谅。Featur

11、e selectionFeature selection was proposed in 1970? and there has been great amount of work studied on it. Objective of feature reduction is three-fold: Improving the accuracy of classification, provide a faster and more cost-effective predictors, and provide a better understanding of the underlying

12、process that generated the data. Feature selection is different from feature extraction, which can be seen in the figure below:Target of feature selection is to select a subset of features instead of mapping them into low dimension.Given a set of features 尸=,”/广,工| Feature Selection problem is defin

13、ed as finding a subset that maximizes the learners ability to classify patterns. More formally, F? should maximize some scoring function ar R where is the space of all possible feature subsets of F:F =arg maxG哥6(G )Framework of feature selection is given as follow:Where two main part of it is genera

14、tion step and evaluation step. For generation step, the main task is select candidate subset of feature for evaluation. There are 3 ways in how the feature space is examined: (1) Complete (2) Heuristic (3) Random.Complete/exhaustive:Examine all combinations of possible feature subset which contain |

15、2 elements, for example we can exam feature f1, f2, f3 in this way: f1,f2,f3= f1,f2,f3,f1,f2,f1,f3,f2,f3,f1,f2,f3 . Optimal subset is achievable if we search all the possible solution, but it?s too expensive if feature space is very large.HeuristicSelection is directed under certain guideline. Start

16、 with empty feature set (or full set), select (or delete) one feature in each step until the target number of features is achieved. For example the incremental generation of subsets: f1 一f1,f3- f1,f3,f2.RandomNo predefined way to select feature candidate, pick feature at random. Require more user-de

17、fined input parameters like the time of try.According to whether the learning algorithm is participate in the selection step, feature selection method can be divided into three category: filter, wrapper, and embedded, which is given as follow:Filter approach:Wrapper approach:Embedded approachFilter

18、Approach is usually fast. It provide generic selection of features, not tuned by given learning algorithm. But it?s tied to specific statistical method, not optimized for used classifier, so sometimes filter methods are used as a pre-processing step for other methods.For wrapper approach, learner is

19、 considered a black-box, used to score subsets according to their predictive power. The accuracy is usually high, but result vary for different learners, loss generalization. One needs to define: how to search the space of all possible variable subsets (possible selections)and how to assess the pred

20、iction performance of a certain subset. Finding optimal subset is NP-hard! A wide range of heuristic search strategies can be used: IDPT, Branch-and-bound method, simulated annealing, TABU search algorithm, genetic algorithm, forward selection (start with empty feature set and add features at each s

21、tep) and backward deletion (start with full feature set and delete one feature at each step). Predictive power is usually measured on a validation set or by cross-validation. Drawback of wrapper method is that a large amount of computation is required, has danger of overfitting.Embedded approach is

22、specific to a given learning machine. It combine the advantages of both previous methods: reduce the classification of learning, takes advantage of its own variable selection algorithm and usually implemented by a two-step or multi-step process.For evaluation step, the main task is usually implement

23、ed by a two-step or multi-step process. 5 main type of evaluation functions are: distance (Euclideandistance measure), information (entropy, information gain, etc.), dependency (correlation coefficient), consistency (min-features bias), and classification error rate. Where the first four method are

24、used for filter method and the last one is for wrapper.An application of feature selection in supervised learning is given in the following, which is extracted for the paper ,Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min- redundancy?.Optimal charac

25、terization condition of feature selection in supervised learning is minimal classification error and maximal statistical dependency is for maximal statistical dependency. One of the most popular approaches to realize Max-Dependency is maximal relevance, which means that one of the most popular appro

26、aches to realize Max-Dependency is maximal relevance:Problems of thismethod is Combinations of individually good features do not necessarily lead to good classification performance, i.e. them best features are not the best m features” . And also relevance alone may introduce rich redundancy. So feat

27、ures with minimum redundancy should also be considered. So the author proposed another algorithm that solve the problem.Main work of the paper consist of three part: (1) Present a theoretical analysis showing that mRMR (max-relevance and min-redundancy) is equivalent to Max-Dependency for first-orde

28、r feature selection. (2) Investigate how to combine mRMR with other feature selection methods into a two-stage selection algorithm. (3) Compare mRMR, Max-Relevance, Max-Dependency, and the two-stage feature selection algorithm through comprehensive experiments. Since the first part is unrelated to t

29、he course project, so I skipped it and only one experiment in the original paper will be mentioned.The proposed algorithm is named mRMR (Max-Relevance and Min-Redundancy), where max-relevance means select features that are most relevant to the target class, i.e. select features satisfying:I(x,y) is

30、mutual information that I had mentioned before. And Min-Redundancy means that select features that not redundant with selected features, which satisfying:i -、win R = -汇 “工;r JThen a Operator is define to achieve this multi-object optimization task which combine D and R, optimize D and Rsimultaneousl

31、y:In practice, incremental search methods can be used to findthe near-optimal features:L*Sb蛾TJUntil now this not the whole process of the algorithm, its only a half of it.The algorithm in this paper is A two-stage process: (1) Find a candidate feature subset using mRMR incremental selection method.

32、(2) Use more sophisticated method (classifier involved) to search a compact feature subset from the candidate subset. So that this two-stage algorithm is a case of embedded method.The first stage is given as follow:Use mRMR incremental selection to select sequential features:Si U & U U Sn_i C Sn|Com

33、pare classification error of all the subset , find the range of k, called Q , within which the respective error,is consistently small.Within Q , find smallest error e*=min&k, optimal subset size is the k corresponds to .The Second stage is given as follow:For backward selection:Exclude one redundant

34、 feature if resultant error is smaller thaneach time (select the one leads to greatest error reduction). Terminate if no error reduction can be obtained.For forward selection:Select one feature which leads to greatest error reduction each time.Terminate if error begins to increase.Now the algorithm

35、of this paper is complete. Best evaluate how effective and efficient this algorithm is, there is also a problem that how tojudge which algorithm is superior. So the author define a measurementofRM-characteristic. Given two feature set and KH , which is generate sequentially:号:用匚S;匚七& uu兄.c Si 田:身uu忌

36、uu兜We say that 、 is recursively more characteristic (RM-characteristic) thanby by p %, if for p % of k, errOr ofs smaller than 及I rnAlMRfMurt nmitw1020 W 川Iw Im muntHrI rnAlMRfMurt nmitw1020 W 川Iw Im muntHrFigure above is one of the result of experiment given in the paper. Each row is for a differen

37、t dataset and each column is for different classification algorithm. For each graph, X-axis denotes the number of selected features, Y-axis is for error rate. The line on the top with triangle on it is the proposed algorithm and the button one is the state-of-art algorithm on that time. As shown in

38、the result, classification accuracy can be significantly improved based on mRMR feature selection. There is also an experiment done by myself to verify that feature selection method can improve accuracy:This experiment is carried on Spambase dataset by SVM algorithm with linear kernel. X-axis denote

39、s the number of selected features, Y-axis is for accuracy. Red line is the proposed algorithm, others are baseline, traditional, random. We can see that the proposed algorithm performs the best. So I am convinced that feature selection methods can improved accuracy of learning algorithm.Random proje

40、ctionRandom projection is one of feature extraction algorithm. Most famous feature extraction algorithm includes PCA, LDA, LLE etc. Random projection is mentioned as LSH method sometimes and it?s highly unstable, so it?s not so famous. But it?s quiet useful in some case and much efficient than that

41、of most famous algorithm such as PCA.Main steps of random projection can be introduced briefly:Select a set of high-dimensional unit vectors (not necessary orthogonal) randomlyProject high dimension data into low dimension by production of these vectorsSuch steps sounds simple and somewhat unreliabl

42、e, but in fact there?s Lemmathat guarantee theprecisionof it, which is calledJohnson-Lindenstrauss Lemma. Main idea of it is, It is possible to project n points in a space of arbitrarily high dimension onto an O(logn)-dimensional space, such that the pairwise distances between the points are approxi

43、mately preserved. More formally:JohnsoivLinde 11r占uTheoremFor dny 0 ( 4(1/2 - HnnThen for arvy sei Vot n points in p , there is a map f T HZ such that for all u, r G V .(i -卅 |/() _ /(i + 0| MRFuntiewore, this map can be found in cxpecled poiymomiai timeHere we use sample distance as a measure of go

44、odness of feature reduction performance for the reason that one of the Objective of feature reduction is that pairwise distances of the points are approximately the same as before. In data mining area, we know that dataset has two way of representation: Data matrixand Discernibility Matrix: o d(2rl)

45、 o d(3,I) (30 0 * .0Discernibility MatrixIf pairwise distance of data points reserve precisely, then the Discernibility Matrix retain most of the information for the original dataset, and we say thatthat?s a good feature reduction method.There are several ways for random projection /: R 、.We adopt t

46、he one in the original Johnson-Lindenstrauss paper:Let be a random k Let be a random k X d matrix that projects Ronto a untform /ando.n k-dmenional subspace 厉Multiply Ab” fixed scaUr J For every r g R is mapped toTo make a better understanding, I draw a graph for the process:X二二二 X二二二 nXAdvantage of

47、 random projection is that it does not use any defined a interestingness criterion like PCA and High-dimensional distributions look more like Gaussian when projected to low dimensional. But its a highlyunstable algorithm, for example:The left picture is the true distribution of a high dimensional da

48、taset(use 2 of its features to make the graph). The middle and right is two single run of clustering algorithm after random projection. We can find that result of each run make have great difference. But its just this unstable performance provide a multi view of the same dataset, which is useful in

49、ensemble learning.Cluster ensembleEnsemble learning is a hot topic in these years. Cluster ensemble is one of the newest topic of unsupervised learning. Frame work of classification ensemble is shown as follow:Given a certain dataset, we first generate a different view of the dataset, which can be i

50、mplemented by bootstrap sampling, random feature subspace or other method. Then we use different learning algorithm or the same algorithm with different parameter or even just the same algorithm to generate serval different classifier. When a new data comes in, multi classifier can be used to classi

51、fy it and obtain the final classification result based on voting scheme or other method. Cluster ensemble is almost the same with classification ensemble:Data transformation 1Clustering algorithm 1-Clustering solution 1/*/ .Data transformation 2Clustering algorithm 2Clustering solution 2Original dat

52、asetAConsensusfunction-AV/TData transformation BClustering algorithm B-AClustering solution B/Final resultEnsemble generatorConsensus functionBut the main difference is that clustering solution of each run may have different output label and different number of output class, which make it impossible

53、 to obtain a final result by voting scheme directly. So a consensus function needs to be define to combine the result of multi runs.There is an applicationof random projection I had mentioned before and a case study for cluster ensemble. It?s given by the paper ,Random projection for high dimensional data clustering - A cluster ensemble approach?

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论