第九课资料第9课迁移学习论文tradaboost_h

上传人：我*** IP属地：北京上传时间：2020-11-04 格式：DOCX 页数：8 大小：229.08KB 积分：9.6 举报 版权申诉

已阅读5页，还剩3页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Boosting for Transfer LearningWenyuan DDepartment of Computer Science and Engineering, Shanghai Jiao Tong University, ChinaQiang Yangqyangcse.ust.hkDeptarment of Computer Science, Hong Kong University of Science and Technology, Hong KongGui-Rong Xue Yong Y.c

2、n Department of Computer Science and Engineering, Shanghai Jiao Tong University, ChinaAbstractTraditional machine learning makes a ba- sic assumption: the training and test data should be under the same distribution. However, in many cases, this identical- distribution assumption

3、does not hold. The assumption might be violated when a task from one new domain comes, while there are only labeled data from a similar old domain. Labeling the new data can be costly and it would also be a waste to throw away all the old data. In this pa- per, we present a novel transfer learning f

4、ramework called TrAdaBoost, which extends boosting-based learning algorithms (Freund & Schapire, 1997). TrAdaBoost allows users to utilize a small amount of newly labeled data to leverage the old data to construct a high-quality classication model for the new data. We show that this method can allow

5、 us to learn an accurate model using only a tiny amount of new data and a large amount of old data, even when the new data are not sucient to train a model alone. We show that TrAdaBoost allows knowledge to be ef- fectively transferred from the old data to the new. The eectiveness of our algorithm i

6、s an- alyzed theoretically and empirically to show that our iterative algorithm can converge well to an accurate model.1. IntroductionA fundamental assumption in classication learning is that the data distributions of training and test sets should be identical. When the assumption does not hold, tra

7、ditional classication methods might perform worse. However, in practice, this assumption may not always hold. For example, in Web mining, the Web data used in training a Web-page classication model can be easily out-dated when applied to the Web some- time later, because the topics on the web change

8、 fre- quently. Often, new data are expensive to label and thus their quantities are limited due to cost issues. How to accurately classify the new test data by mak- ing the maximum use of the old data becomes a critical problem.Although the training data are more or less out-dated, there are certain

9、 parts of the data that can still be reused. That is, knowledge learned from this part of the data can still be of use in training a classier for the new data. To nd out which those data are, we employ a small amount of labeled new data, called same-distribution training data, to help vote on theuse

10、fulness of each of the old datatance. Becausesome of the old training data might be out-dated andbe under a dierent distribution from the new data, we call them di-distribution training data. Our goal is to learn a high-quality classication model using both the same-distribution and di-distribution

11、data. This learning process corresponds to transferring knowledgeIn this paper, we try to develop a general framework for transfer learning based on (Freund & Schapire, 1997), and analyze the correctness of the general modelAppearing in Proceedings of the 24 th International Confer- ence on Machine

12、Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s).Boosting for Transfer Learningusing the Probability Approximately Correct (PAC) theory. Our key idea is to use boosting to lter out the di-distribution training data that are very dier- ent from the same-distribution data by aut

13、omaticallyOur problem setting can also be considered as learning with auxiliary data, where the labeled di-distribution data are treated as the auxiliary data. In previous works, Wu and Dietterich (2004) proposed an image classication algorithm using both inadequate training data and plenty of low q

14、uality auxiliary data. They demonstrated some improvement by using the auxil- iary data. However, they did not give a quantitative study using dierent auxiliary examples. Liao et al. (2005) improved learning with auxiliary data using ac- tive learning. Rosenstein et al. (2005) proposed a hierarchica

15、l Naive Bayes approach for transfer learn- ing using auxiliary data, and discussed when transfer learning would improve the performance and when de- crease.Another closely related task is learning under sample selection bias (Zadrozny, 2004) or covariate shift (Shi- modaira, 2000), which deals with

16、the case when all the same-distribution data are unlabeled. In the Novel- prize work, Heckman (1979) investigated correcting sample selection bias in econometrics. Bickel and Scheer (2007) studied the sample selection bias prob- lem in the spam ltering domain. Other researches addressing on correcti

17、ng sample selection bias include (Dudk et al., 2006; Huang et al., 2007) etc.adjusting the weights of trainingtances. The re-maining di-distribution data are treated as the addi-tional training data which greatly boost the condence of the learned model even when the same-distribution training data a

18、re scarce. Our experimental results support our conjecture that the boosting-based frame- work, which we implement in an algorithm known as TrAdaBoost, is a high-performance and simple trans- fer learning algorithm. Our theoretical analysis also conrms that our boosting learning method converges wel

19、l to the desired model.The rest of the paper is organized as follows.ec-tion 2, we discuss the related works.ection 3, wepresent our formal denitions for the problem and theframework. We present a theoretical analysisec-tion 4 to show the properties of our transfer learning framework. Our experiment

20、al results and discussionare shownection 5. Section 6 concludes the wholepaper and gives some future works.2. Related WorkTransfer learning is what happens when someone nds it much easier to learn to play chess having already learned to play checkers, or to recognize tables hav- ing already learned

21、to recognize chairs; or to learn Spanish having already learned Italian1. Recently, transfer learning has been recognized as an impor- tant topic in machine learning research. Several re- searchers have proposed new approaches to solve the problems of transfer learning. Early transfer learning works

22、 raised some important issues, such as learning how to learn (Schmidhuber, 1994), learning one more thing (Thrun & Mitchell, 1995), multi-task learning (Caruana, 1997). A related topic is multi-task learn- ing whose objective is to discover the common knowl- edge in multiple tasks. This common knowl

23、edge be- longs to almost all the tasks, and is helpful for solving a new task. Ben-David and Schuller (2003) provided a theoretical justication for multi-task learning. In contrast, we address on the single-task learning prob- lem, but the distributions of the training and test data dier from each o

24、ther. DaumeIII and Marcu (2006)3. Transfer Learning through TrAdaBoostTo enable transfer learning, we use part of the labeled training data that have the same distribution as the test data to play a role in building the classication model. We call these training data same-distribution training data.

25、 The quantity of these same-distribution training data is often inadequate to train a good classi- er for the test data. The training data, whose distri- bution may dier from the test data, perhaps because they are out-dated, are called di-distribution training data. These data are assumed to be abu

26、ndant, but the classiers learned from these data cannot classify the test data well due to dierent data distributions.More formally, let X bstance space, Xd be the di-distributiontance space,and Y = 0, 1 be the set of category labels. A con- cept is a boolean function c mapping from X to Y ,where X

27、= Xs Xd. The test data set is denotedby S = (xt ), where xt Xs (i =1,., k). Here,have studied the domain-transfer problemtatisticaliik is the size of the test set S which is unlabeled.natural language processing, using a specic GaussianThe training data set T X Y is partitioned into two labeled sets

28、 Td,and Ts. Td represents themodel. In this paper, we try to develop a transfer clas-sication framework under the PAC learning model.= (xd, c (xd) ,di-distribution training data that Tdii1/russell/ebtl/where x X (di = 1,., n). Trepresents the same-disdistribution training da

29、ta that Ts = (xs, c(x s) ,jjBoosting for Transfer Learningwhere xs Xs (j = 1 ,., m). n and m areAlgorithm 1 TrAdaBoostInput the two labeled data sets Td and Ts, the unla-thejsizes of Td and Ts, respectively. c(x) returns the la- bel for the data tance x. The combined training set T = (xi, c(xi) is d

30、ened as followsbeled data set S, a base learning algorithm Learner, and the maximum number of iterations N .w1Initialize the initial weight vector, that=xd, i = 1,., n;i( w1 ,., wn+m ). We allow the users to specify the11xi =xs,i = n + 1 , . . . , n + m.iinitial values for w .For t = 1 , . . . , N1H

31、ere, Tdcorresponds to some labeled data from anold domain that we try to reuse as much as we can;n+m wt).1.2.Set pt= w /t (i=1ihowever we do not know which part of Td is useful to us. What we can do is to label a small amount of data from the new domain and call it Ts, and then use these data to nd

32、out the useful part of Td. The problem that we are trying to solve is: given a small number of labeled same-distribution training data Ts, many di- distribution training data Td and some unlabeled testdata S, the objective is to train a classier c : X Ythat minimizes the prediction error on the unla

33、beleddata set S.We now present our Transfer AdaBoost learning framework TrAdaBoost, which extends AdaBoost for transfer learning. AdaBoost (Freund & Schapire, 1997) is a learning framework which aims to boost the accuracy of a weak learner by carefully adjustingCall Learner, providing it the combine

34、d training set T with the distribution pt over T and the un- labeled data set S. Then, get back a hypothesis ht : X Y (or 0, 1 by condence).Calculate the error of h ton T :sn+m3.w t | ht(x i) c(x i)|i=.tn+mwti=n+1 ii=n+14.Set t=t /(1 )t and = 1/(1 +2 ln n/N).Note that, t is required to be less than

35、1/2.Update the new weight vector:5.the weights of trainingtances and learn a classi-wt|h (tx )i c(x )i| ,1 i nt+1ier accordingly. However, AdaBoost is similar to mostwi=|h (x )c(x )|w t ii, n +1 i n + mttraditional machine learning methods by assuming the distributions of the training and test data

36、to be iden- tical. In our extension to AdaBoost, AdaBoost is still applied to same-distribution training data to build the base of the model. But, for di-distribution train-i tOutput the hypothesis1Nh (x)N1,tt= N/2 t2h (x) =t= N/2 tf0, otherwiseingtances, when they are wrongly predicted dueto distri

37、bution changes by the learned model, these tances could be those that are the most dissimilarto the same-distributiontances. Thus, in our ex-tension, we add a mechanism to decrease the weightsstances with large training weights will intent to helpthe learning algorithm to train better classiers.of t

38、hesetances in order to weaken their impacts.A formal description of the framework is given in Algo- rithm 1. As can be seen from the algorithm, in each it-4. Theoretical Analysis of TrAdaBoostIn the last section, we have presented the transfer clas- sication learning framework. In this section, we t

39、heo- retically analyze our framework in terms of its conver- gence property, and show why the framework is able to learn knowledge even when the domain distribu- tions are not identical. Our analysis extends the the- ory of AdaBoost (Freund & Schapire, 1997). Due to the limitation of space, we will

40、only give a sketch of the analysis. The details are given in a long version.eration round, if a di-distribution trainingtance ismistakenly predicted, thetance may likely conictwith the same-distribution training data. Then, we de- crease its training weight to reduce its eect throughmultiplying its

41、weight by|ht(xi)c(xi)|.Note that|ht (xi)c(xi)| (0, 1. Thus, in the next round, themisclassied di-distribution training tances, which are dissimilar to the same-distribution ones, will aect the learning process less than the current round. Af-Let l t= | ht(xi) c(xi)| (i = 1,.,n and t = 1,.,N)ihave la

42、rger training weights, while the di-distributionbe the loss of the trainingtance xi suered bytrainingtances that are dissimilar to the same-the hypothesis ht. The distribution dt is the training weight vector with respect to Td in the tth iteration,distribution ones will have lower weights. The in-B

43、oosting for Transfer Learningnthat d t= w /t (Theorem 3 Let I = i : hf (xi) = c(xi) and n +1 i n + m. The prediction error on the same- distribution training data Ts suered by the nal hy-pothesis hf is dened as = PrxTs hf (x) = c(x) =wt), (i = 1,., n). The trainingiij=1 jloss with respect to Td suer

44、ed by TrAdaBoost throughNnN iterations is Ld =dtlt. The training losst=1i=1 i iwith respect to antance xi (i = 1 ,., n) suered byN|I|/m.Then, TrAdaBoost through N iterations is L(xi) =lt.t=1 iIn the following, the theoretical conclusions with re- spect to TrAdaBoost are presented. Theorem 1 dis- cus

45、ses the convergence property of TrAdaBoost. The- orem 2 analyzes the average weighted training loss on di-distribution data Td. Theorem 3 shows the predic- tion error on the same-distribution data Ts. Note that, TrAdaBoost optimizes the average weighted trainingNt(1 )t.(3) 2 N/2 t= N/2 When t 0.5, t

46、he prediction error of the nal hy-pothesis hf on the same-distribution training data Tsbecomes increasingly smaller after each iteration.From Theorems 2 and 3, two conclusions can be reached. First, based on Theorem 3, the nal hy- pothesis hf produced by TrAdaBoost increasingly re- duces the error o

47、n the same-distribution training data. Second, based on Theorem 2, the weighted average training loss in the di-distribution part gradually con- verges to zero. Therefore, TrAdaBoost minimizes both the error on the same-distribution training data and the weighted average loss on the di-distribution

48、train- ing data simultaneously. Intuitively, it can be under- stood that TrAdaBoost rst performs its learning on the same-distribution training data, and then chooses the most helpful di-distribution training data, on con- dition that they do not result in more average training loss, as additional t

49、raining data.We wish to understand why TrAdaBoost theoretically improves the implementation of AdaBoost with the combined training set T . In fact, TrAdaBoost does not guarantee to always improve AdaBoost, since the quality of di-distribution training data is not certain. The goal of TrAdaBoost, sho

50、wn by Theorems 2 and 3, is to reduce the weighted training error on the di- distribution data, while still preserving the properties of AdaBoost. The di-distribution data after reweight- ing are used as the additional data which might be helpful for training a good prediction model.All the above ana

51、lysis focused on the error on the training data and did not address any generaliza- tion issues. In Theorem 4, we give an alternative upper-bound of the generalization error on the same- distribution data.loss onTdand the prediction error onTs,simul-taneously. Finally, Theorem 4 gives an alternative

52、upper-bound of the generalization error on the same- distribution data.Theorem 1 In TrAdaBoost, when the number of it- erations is N, we haveLdNL(x i)N2 ln n + ln n . min+(1)NN1inTheorem 1 can also be found in (Freund & Schapire, 1997) together with its proof. It indicates that the average training

53、loss (Ld/N ) through N iterations on the di-distribution training data Td is not much (atmost2 ln n/N + ln n/N ) larger than the averagetraining loss of thetance whose average trainingloss is minimum (that is, min1 i n L(xi)/N ). From Equation (1), we can also nd the convergence rate of TrAdaBoost o

54、ver the di-distribution training data Td is O( ln n/N ) in the worst case.ptTheorem 2 In TrAdaBoost,denotes the weightitance xi, which isThen,dened as pt =of the trainingn+m wt ).wt/ (i=1iNt= N/2nptlti=1 i ilimN = 0.(2)N N/2Theorem 2 can be derived from Theorem 1.Due tothe limitation of space, we om

55、it the proof of Theorem 2 here. Intuitively, Equation (2) indicates that the av- erage weighted training loss suered by TrAdaBoost on the di-distribution training data Td from the N/2 thN thiteration to theconverges to zero. That is theTheorem 4 Let dbe the VC-dimension of the hy-VCreason we vote th

56、e hypothesis ht from the N/2 thpothesis space, the generalization error on the same- distribution data, with high probability, is at mostN thiteration to thein Algorithm 1.Therefore,TrAdaBoost is able to reduce the average weightedtraining loss on the di-distribution data.Theorem 3 below shows that

57、TrAdaBoost preserves the similar error-convergence property as the AdaBoost al- gorithm (Freund & Schapire, 1997)Nd VCm+ O(4)Here, N is the number of iterations, m is the size of theBoosting for Transfer Learningsame-distribution training data Ts, and is the error on the same-distribution training datafromhf.Table 1. The descriptions of

人人文库> 全部分类> 应用文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

第九课资料第9课迁移学习论文tradaboost_h

文档简介

温馨提示

最新文档

评论

第九课资料第9课 迁移学习论文tradaboost_h

文档简介

温馨提示

最新文档

评论

相关文档

第九课资料第9课迁移学习论文tradaboost_h