版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
FoundationsofMachineLearning
LinearModelforClassification2023/11/4LinearModelforClassificationLesson4-1LinearModelforClassificationLogisticRegression(逻辑回归或者对数几率回归)TuningmodelswithgridsearchMulti-classclassification(多分类学习)Class-imbalance(类别不平衡)问题Multi-labelclassification(多标签分类)2023/11/4LinearModelforClassificationLesson5-2LogisticRegression(逻辑回归)InLinearRegression,wediscussedsimplelinearregression,multiplelinearregression,andpolynomialregression.Thesemodelsarespecialcasesofthegeneralizedlinearmodel(广义线性模型),aflexibleframeworkthatrequiresfewerassumptionsthanordinarylinearregression.Inthislesson,wewilldiscusssomeoftheseassumptionsastheyrelatetoanotherspecialcaseofthegeneralizedlinearmodelcalledlogisticregression.2023/11/4LinearModelforClassificationLesson5-3逻辑回归逻辑回归是一种将数据分类为离散结果的方法。分类问题其实和回归问题相似,不同的是分类问题需要预测的是一些离散值而不是连续值。例如,我们可以使用逻辑回归将电子邮件分类为垃圾邮件或非垃圾邮件。能不能直接使用回归分析处理分类问题?2023/11/4LinearModelforClassificationLesson5-4BinaryclassificationwithlogisticRegressionInlogisticregression,theresponsevariabledescribestheprobabilitythattheoutcomeisthepositivecase.Iftheresponsevariableisequaltoorexceedsadiscriminationthreshold,thepositiveclassispredicted;otherwise,thenegativeclassispredicted.Theresponsevariableismodeledasafunctionofalinearcombinationoftheexplanatoryvariablesusingthelogisticfunction.2023/11/4LinearModelforClassificationLesson5-5CostFunction我们不能在逻辑回归中使用和线性回归相同的costfunction,因为其输出会是波动的,出现很多局部最小值,即它将不是‘凸函数’。2023/11/4LinearModelforClassificationLesson5-6因为y={0,1},可以将目标函数做如下简化:2023/11/4LinearModelforClassificationLesson5-7CostFunctionforSolvingOverfitting我们同样可以将正则化应用到逻辑回归中,解决过拟合的问题。2023/11/4LinearModelforClassificationLesson5-8sklearn.linear_model.LogisticRegressionclasssklearn.linear_model.LogisticRegression(penalty=’l2’,dual=False,tol=0.0001,C=1.0,fit_intercept=True,intercept_scaling=1,class_weight=None,random_state=None,solver=’
lbfgs’,max_iter=100,multi_class=’warn’,verbose=0,warm_start=False,n_jobs=None)https:///lcqin111/article/details/83861476/2019-02-11/p/10597118.html2023/11/4LinearModelforClassificationLesson5-9LinearModelforClassificationLogisticRegression(逻辑回归或者对数几率回归)TuningmodelswithgridsearchMulti-classclassification(多分类学习)Class-imbalance(类别不平衡)问题Multi-labelclassification(多标签分类)2023/11/4LinearModelforClassificationLesson5-10TuningmodelswithgridsearchHyperparametersareparametersofthemodelthatarenotlearned.Forexample,hyperparametersofourlogisticregressionSMSclassifierincludethevalueoftheregularizationtermandthresholdsusedtoremovewordsthatappeartoofrequentlyorinfrequently.Inscikit-learn,hyperparametersaresetthroughtheconstructor.Inthepreviousexamples,wedidnotsetanyargumentsforLogisticRegression();weusedthedefaultvaluesforallofthehyperparameters.Thesedefaultvaluesareoftenagoodstart,buttheymaynotproducetheoptimalmodel.Gridsearchisacommonmethodtoselectthehyperparametervaluesthatproducethebestmodel.Gridsearchtakesasetofpossiblevaluesforeachhyperparameterthatshouldbetuned,andevaluatesamodeltrainedoneachelementoftheCartesianproductofthesets.Thatis,gridsearchisanexhaustivesearchthattrainsandevaluatesamodelforeachpossiblecombinationofthehyperparametervaluessuppliedbythedeveloper.2023/11/4LinearModelforClassificationLesson5-11TuningmodelswithgridsearchGridSearchCVclasssklearn.model_selection.GridSearchCV(estimator,param_grid,*,scoring=None,n_jobs=None,iid='deprecated',refit=True,cv=None,verbose=0,pre_dispatch='2*n_jobs',error_score=nan,return_train_score=False)RandomizedSearchCVclasssklearn.model_selection.RandomizedSearchCV(estimator,param_distributions,*,n_iter=10,scoring=None,n_jobs=None,iid='deprecated',refit=True,cv=None,verbose=0,pre_dispatch='2*n_jobs',random_state=None,error_score=nan,return_train_score=False)[source]2023/11/4LinearModelforClassificationLesson5-12TuningmodelswithgridsearchGridSearchCV2023/11/4LinearModelforClassificationLesson5-13pipeline=Pipeline([('vect',TfidfVectorizer(stop_words='english')),('clf',LogisticRegression())])parameters={'vect__max_df':(0.25,0.5,0.75),'vect__stop_words':('english',None),'vect__max_features':(2500,5000,10000,None),'vect__ngram_range':((1,1),(1,2)),'vect__use_idf':(True,False),'vect__norm':('l1','l2'),'clf__penalty':('l1','l2'),'clf__C':(0.01,0.1,1,10),}TuningmodelswithgridsearchGridSearchCV2023/11/4LinearModelforClassificationLesson5-14file_name=path.dirname(__file__)+\
"/../data/SMSSpamCollection.txt"X,y=[],[]withopen(file_name,'r',encoding='UTF-8')asfile:line=file.readline()whileline:d=line.split("\t")X.append(d[1])y.append(d[0])line=file.readline()
a={'ham':0,'spam':1}y=[a[s]forsiny]TuningmodelswithgridsearchGridSearchCV2023/11/4LinearModelforClassificationLesson5-15grid_search=GridSearchCV(pipeline,parameters,n_jobs=-1,\verbose=1,scoring='accuracy',cv=3)X_train,X_test,y_train,y_test=train_test_split(X,y)grid_search.fit(X_train,y_train)print('Bestscore:%0.3f'%grid_search.best_score_)print('Bestparametersset:')best_parameters=grid_search.best_estimator_.get_params()forparam_nameinsorted(parameters.keys()):print('\t%s:%r'%(param_name,best_parameters[param_name]))predictions=grid_search.predict(X_test)print('Accuracy:',accuracy_score(y_test,predictions))print('Precision:',precision_score(y_test,predictions))print('Recall:',recall_score(y_test,predictions))LinearModelforClassificationLogisticRegression(逻辑回归或者对数几率回归)TuningmodelswithgridsearchMulti-classclassification(多分类学习)Class-imbalance(类别不平衡)问题Multi-labelclassification(多标签分类)2023/11/4LinearModelforClassificationLesson5-16Multi-classclassificationThegoalofmulti-classclassificationistoassignaninstancetooneofthesetofclasses.scikit-learnusesone-vs.-all(one-vs.-the-rest)ormultinomial,tosupportmulti-classclassification.One-vs.-allclassificationusesonebinaryclassifierforeachofthepossibleclasses.Theclassthatispredictedwiththegreatestconfidenceisassignedtotheinstance.LogisticRegressionsupportsmulti-classclassificationusingtheone-versus-allstrategyoutofthebox.2023/11/4LinearModelforClassificationLesson5-17拆分策略最经典的拆分策略有三种.“一对一”(Onevs.One,简称OvO)、“一对其余"(Onevs.Rest,简称OvR)和"多对多"(Manyvs.Many,简称MvM).OvO将这N个类别两两配对,从而产生N(N一1)/2个三分类任务,例如OvO将为区分类别Ci和Cj训练个分类器,该分类器把D中的Ci类样例作为正例,Cj类样例作为反例.在测试阶段,新样本将同时提交给所有分类器,于是我们将得到N(N-1)/2个分类结果,最终结果可通过投票产生:即把被预测得最多的类别作为最终分类结果.2023/11/4LinearModelforClassificationLesson5-18拆分策略最经典的拆分策略有三种.“一对一”(Onevs.One,简称OvO)、“一对其余"(Onevs.Rest,简称OvR)和"多对多"(Manyvs.Many,简称MvM).OvR则是每次将一个类的样例作为正例、所有其他类的样例作为反例来训练N个分类器.在测试时若仅有一个分类器预测为正类,则对应的类别标记作为最终分类结果.若有多个分类器预测为正类,则通常考虑各分类器的预测置信度,选择置信度最大的类别标记作为分类结果.2023/11/4LinearModelforClassificationLesson5-19拆分策略最经典的拆分策略有三种.“一对一”(Onevs.One,简称OvO)、“一对其余"(Onevs.Rest,简称OvR)和"多对多"(Manyvs.Many,简称MvM).MvM是每次将若干个类作为正类,若干个其他类作为反类.显然,OvO和OvR是MvM的特例.MvM的正、反类构造必须有特殊的设计,不能随意选取.一种最常用的MvM技术是“纠错输出码”(ErrorCorrectingOutputCodes,简称ECOC).ECOC是将编码的思想引入类别拆分,并尽可能在解码过程中具有容错性.2023/11/4LinearModelforClassificationLesson5-20sklearn.linear_model.LogisticRegressionclasssklearn.linear_model.LogisticRegression(penalty=’l2’,dual=False,tol=0.0001,C=1.0,fit_intercept=True,intercept_scaling=1,class_weight=None,random_state=None,solver=’
lbfgs’,max_iter=100,multi_class=’warn’,verbose=0,warm_start=False,n_jobs=None)multi_class:str,{‘ovr’,‘multinomial’,‘auto’},default:‘ovr’Iftheoptionchosenis‘ovr’,thenabinaryproblemisfitforeachlabel.For‘multinomial’thelossminimisedisthemultinomiallossfitacrosstheentireprobabilitydistribution,evenwhenthedataisbinary.‘multinomial’isunavailablewhensolver=’liblinear’.‘auto’selects‘ovr’ifthedataisbinary,orifsolver=’liblinear’,andotherwiseselects‘multinomial’.2023/11/4LinearModelforClassificationLesson5-212023/11/4LinearModelforClassificationLesson5-22sklearn.multiclass.OneVsRestClassifier(estimator,n_jobs=None)One-vs-the-rest(OvR)multiclass/multilabelstrategyAlsoknownasone-vs-all,thisstrategyconsistsinfittingoneclassifierperclass.Thisstrategycanalsobeusedformultilabellearning,whereaclassifierisusedtopredictmultiplelabelsforinstance,byfittingona2-dmatrixinwhichcell[i,j]is1ifsampleihaslabeljand0otherwise.https:///stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier2023/11/4LinearModelforClassificationLesson5-23sklearn.multiclass.OneVsOneClassifier(estimator,n_jobs=None)One-vs-onemulticlassstrategyThisstrategyconsistsinfittingoneclassifierperclasspair.Atpredictiontime,theclasswhichreceivedthemostvotesisselected.Sinceitrequirestofitn_classes*(n_classes-1)/2classifiers,thismethodisusuallyslowerthanone-vs-the-rest,duetoitsO(n_classes^2)complexity.However,thismethodmaybeadvantageousforalgorithmssuchaskernelalgorithmswhichdon’tscalewellwithn_samples.Thisisbecauseeachindividuallearningproblemonlyinvolvesasmallsubsetofthedatawhereas,withone-vs-the-rest,thecompletedatasetisusedn_classestimes.https:///stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html#sklearn.multiclass.OneVsOneClassifier2023/11/4LinearModelforClassificationLesson5-24sklearn.multiclass.OutputCodeClassifier(Error-Correcting)Output-CodemulticlassstrategyOutput-codebasedstrategiesconsistinrepresentingeachclasswithabinarycode(anarrayof0sand1s).Atfittingtime,onebinaryclassifierperbitinthecodebookisfitted.Atpredictiontime,theclassifiersareusedtoprojectnewpointsintheclassspaceandtheclassclosesttothepointsischosen.Themainadvantageofthesestrategiesisthatthenumberofclassifiersusedcanbecontrolledbytheuser,eitherforcompressingthemodel(0<code_size<1)orformakingthemodelmorerobusttoerrors(code_size>1).Seethedocumentationformoredetails.https:///stable/modules/generated/sklearn.multiclass.OutputCodeClassifier.html#sklearn.multiclass.OutputCodeClassifier2023/11/4LinearModelforClassificationLesson5-25LinearModelforClassificationLogisticRegression(逻辑回归或者对数几率回归)Multi-classclassification(多分类学习)Class-imbalance(类别不平衡)问题Multi-labelclassification(多标签分类)2023/11/4LinearModelforClassificationLesson5-26class-imbalance(类别不平衡)问题类别不平衡(class-imbalance)就是指分类任务中不同类别的训练样例数目差别很大的情况.不失一般性,假定正类样例较少,反类样例较多.在现实的分类学习任务中,我们经常会遇到类别不平衡问题。第一类是直接对训练集里的反类样例进行“欠采样”(undersampling),即去除一些反例使得正、反例数日接近,然后再进行学习;第二类是对训练集里的正类样例进行“过采样"(oversampling),即增加一些正例使得正、反例数目接近,然后再进行学习;第三类则是直接基于原始训练集进行学习,但在用训练好的分类器进行预测时,将下式嵌入到其决策过程中,称为“阈值移动"(threshold-moving).2023/11/4LinearModelforClassificationLesson5-27InScikitlearntherearesomeimbalancecorrectiontechniques,whichvaryaccordingwithwhichlearningalgorithmareyouusing.Someoneofthem,likeSvmorlogisticregression,havetheclass_weightparameter.IfyouinstantiateanSVCwiththisparameterseton'auto',itwillweighteachclassexampleproportionallytotheinverseofitsfrequency.Unfortunately,thereisn'tapreprocessortoolwiththispurpose.2023/11/4LinearModelforClassificationLesson5-28https:///scikit-learn-contrib/imbalanced-learnItcontainsmanyalgorithmsinthefollowingcategories,includingSMOTEUnder-samplingthemajorityclass(es).Over-samplingtheminorityclass.Combiningover-andunder-sampling.Createensemblebalancedsets.2023/11/4LinearModelforClassificationLesson5-29SyntheticMinorityOversamplingTechniqueSMOTE,即合成少数类过采样技术它是基于随机过采样算法的一种改进方案,由于随机过采样采取简单复制样本的策略来增加少数类样本,这样容易产生模型过拟合的问题,即使得模型学习到的信息过于特别(Specific)而不够泛化(General),SMOTE算法流程如下:对于少数类中每一个样本x,以欧氏距离为标准计算它到少数类样本集Smin中所有样本的距离,得到其k近邻。根据样本不平衡比例设置一个采样比例以确定采样倍率N,对于每一个少数类样本x,从其k近邻中随机选择若干个样本,假设选择的近邻为xn。对于每一个随机选出的近邻xn,分别与原样本按照如下的公式构建新的样本xnew=x+rand(0,1)∗|x−xn|2023/11/4LinearModelforClassificationLesson5-30LinearModelforClassificationLogisticRegression(逻辑回归或者对数几率回归)TuningmodelswithgridsearchMulti-classclassification(多分类学习)Class-imbalance(类别不平衡)问题Multi-labelclassification(多标签分类)2023/11/4LinearModelforClassificationLesson5-31Multi-labelclassificationandproblemtransformationProblemtransformationmethodsaretechniquesthatcasttheoriginalmulti-labelproblemasasetofsingle-labelclassificationproblems.ConverteachsetoflabelsencounteredinthetrainingdatatoasinglelabelTrainonebinaryclassifierforeachofthelabelsinthetrainingset2023/11/4LinearModelforClassificationLesson5-322023/11/4LinearModelforClassificationLesson5-33sklearn.multiclass.OneVsRestClassifier(estimator,n_jobs=None)One-vs-the-rest(OvR)multiclass/multilabelstrategyAlsoknownaso
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 酒店管理经纪合同
- 网络营销合同范本合同协议速览
- 质量第一施工保证
- 国外材料采购合同的注意事项
- 软件开发服务合同
- 房屋宅基地买卖合同的重点提示
- 简单共享服务合同范本
- 观光电梯招标维修要求
- 室内外设计合同范本
- 网站策划合同样本
- 白酒食品安全与质量控制
- SH/T 3045-2024 石油化工管式炉热效率设计计算方法(正式版)
- 2024年高考英语复习 阅读理解之说明文(解析版)
- “双减”政策下小学课后服务校本课程的构建
- 广东省珠海市2024年高考仿真模拟化学试卷含解析
- MOOC 信息安全数学基础-电子科技大学 中国大学慕课答案
- DB32T3794-2020工业园区突发环境事件风险评估指南
- CJT 358-2019 非开挖工程用聚乙烯管
- 培优辅差教学工作计划
- 医疗保险知识培训
- 产品设计专业职业规划
评论
0/150
提交评论