FAFU机器学习 04-1eaturextractionndreprocessing中文_第1页
FAFU机器学习 04-1eaturextractionndreprocessing中文_第2页
FAFU机器学习 04-1eaturextractionndreprocessing中文_第3页
FAFU机器学习 04-1eaturextractionndreprocessing中文_第4页
FAFU机器学习 04-1eaturextractionndreprocessing中文_第5页
已阅读5页,还剩48页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

机器学习基础特征提取与预处理2020/12/3特征提取与预处理第4-1课特征提取与预处理在线性回归中讨论的例子使用了简单的数字解释变量,例如比萨饼的直径。许多机器学习问题需要从分类变量,文本或图像的观察中学习。在本课中,您将学习预处理数据和创建这些观察的特征表示的基本技术。这些技术可以用于回归模型,线性回归,以及我们将在下一课讨论的模型2020/12/3特征提取与预处理第4-2课特征提取与预处理从分类变量中提取特征从文本中提取特征从图像中提取特征数据规范化2020/12/3特征提取与预处理第4-3课从分类变量中提取特征变量的类型numinal(定类):类别,状态或“事物的名称”Hair_color={auburn,black,blond,brown,grey,red,white}婚姻状况,职业,身份证号码,邮政编码2020/12/3特征提取与预处理第4-4课从分类变量中提取特征变量的类型numinal(定类):类别,状态或“事物的名称”Hair_color={auburn,black,blond,brown,grey,red,white}婚姻状况,职业,身份证号码,邮政编码Binary(二类)只有2个状态(0和1)的标称属性对称二元:两种结果同等重要,例如性别非对称二元:结果不同等重要,例如医学检验(阳性与阴性)公约:最重要的结果(如HIV阳性)为12020/12/3特征提取与预处理第4-5课从分类变量中提取特征变量的类型名义上二进制Ordinal

(定序)值有一个有意义的顺序(排名),但连续值之间的大小是未知的。大小={小,中,大},等级,军队排名2020/12/3特征提取与预处理第4-6课从分类变量中提取特征变量的类型NominalBinaryOrdinalInterval(定距)以等大小单位为尺度测量的价值观是有秩序的例如温度(C度或F度),日历日期没有真正的零点2020/12/3特征提取与预处理第4-7课从分类变量中提取特征变量的类型NominalBinaryOrdinalIntervalRatio(定比)固有零点我们可以说值比测量单位大一个数量级(10K度是5K度的两倍)。例如开尔文温度,长度,计数,货币数量2020/12/3特征提取与预处理第4-8课范畴变量一个K或一个热(独热)分类变量通常使用一个K或一个热编码进行编码,其中解释变量使用每个变量可能值的一个二进制特征进行编码。例如,假设我们的模型有一个城市解释变量,可以取三个值之一:纽约、旧金山或教堂山。一个热编码使用三个可能的城市中的每一个使用一个二进制特征来表示这个解释变量。2023/11/4FeatureExtractionandPreprocessingLesson4-9sklearn.feature_extraction:FeatureExtraction特征提取2023/11/4FeatureExtractionandPreprocessingLesson4-10sklearn.feature_extraction.DictVectorizer将特征值映射列表转换为向量。这个转换器将特性名到特性值的映射(dict-like对象)列表转换成Numpy数组或稀疏稀疏使用SCIS矩阵学习工具。当特征值是字符串时,这个转换器将执行二进制onehot(aka-one-of-K)编码:为特性可以接受的每个可能的字符串值构造一个布尔值特征。例如,一个特性“f”可以采用值“ham”和“spam”,它将在输出中变成两个特性,一个表示“f=ham”,另一个表示“f=spam”。样本(映射)中没有出现的特征在结果数组/矩阵中的值为零。2023/11/4FeatureExtractionandPreprocessingLesson4-11ExampleuseofDictVectorizer:它是用于对特征进行抽取和向量化2023/11/4FeatureExtractionandPreprocessingLesson4-12>>>from

sklearn.feature_extraction

import

DictVectorizer

>>>v

=

DictVectorizer(sparse=False)>>>D

=[{'foo':1,'bar':2},{'foo':3,'baz':1}]>>>X

=

v.fit_transform(D)>>>X

array([[2.,0.,1.],

[0.,1.,3.]])

>>>v.inverse_transform(X)==[{'bar':2.0,'foo':1.0},{'baz':1.0,'foo':3.0}]True

>>>v.transform({'foo':4,'unseen_feature':3})array([[0.,0.,4.]])

ExampleuseofDictVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-13>>>fromsklearn.feature_extractionimportDictVectorizer>>>onehot_encoder=DictVectorizer(sparse=False)>>>D=[{'city':'NewYork'},{'city':'SanFrancisco'},{'city':'ChapelHill'}]>>>X=onehot_encoder.fit_transform(D)>>>Xarray([[0.,1.,0.],[0.,0.,1.],[1.,0.,0.]])>>>onehot_encoder.feature_names_['city=ChapelHill','city=NewYork','city=SanFrancisco']>>>我们能用一个整数特征来表示一个范畴解释变量的值吗?FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtext(从文本中提取特征)ExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-14ExtractingfeaturesfromtextManymachinelearningproblemsusetextasanexplanatoryvariable.Textmustbetransformedtoadifferentrepresentationthatencodesasmuchofitsmeaningaspossibleinafeaturevector.Inthefollowingsectionswewillreviewvariationsofthemostcommonrepresentationoftextthatisusedinmachinelearning:thebag-of-wordsmodel.许多机器学习问题使用文本作为解释变量。文本必须转换为一种不同的表示形式,在特征向量中尽可能多地编码其含义。在下面的章节中,我们将回顾机器学习中最常用的文本表示形式的变体:单词袋模型。2023/11/4FeatureExtractionandPreprocessingLesson4-15Extractingfeaturesfromtext词袋表征最常见的文本表示是单词包(词袋)模型。这种表示使用multiset或bag,它对文本中出现的单词进行编码;单词包不编码文本的任何语法,忽略单词的顺序,并忽略所有语法。单词包可以看作是一个热编码的扩展。它为文本中每个感兴趣的单词创建一个特征。单词袋模型的动机是直觉,即包含相似单词的文档通常具有相似的含义。尽管所编码的信息有限,但词袋模型可以有效地用于文档分类和检索。2023/11/4FeatureExtractionandPreprocessingLesson4-16ExtractingfeaturesfromtextThebag-of-wordsrepresentationThe

sklearn.feature_extraction.text

submodulegathersutilitiestobuildfeaturevectorsfromtextdocuments.feature_extraction.text.CountVectorizer([…])Convertacollectionoftextdocumentstoamatrixoftokencountsfeature_extraction.text.HashingVectorizer([…])Convertacollectionoftextdocumentstoamatrixoftokenoccurrencesfeature_extraction.text.TfidfTransformer([…])Transformacountmatrixtoanormalizedtfortf-idfrepresentationfeature_extraction.text.TfidfVectorizer([…])ConvertacollectionofrawdocumentstoamatrixofTF-IDFfeatures.词袋表征在sklearn.feature_提取.text子模块收集实用程序以从文本文档生成特征向量。特色_extraction.text.countVector([…])将文本文档集合转换为令牌计数矩阵特色_extraction.text.HashingVectorizer([…])将文本文档集合转换为令牌出现的矩阵特色_extraction.text.tfiddTransformer([…])将计数矩阵转换为规范化的tf或tfidf表示特色_tfidfTorizer文本提取([…])将原始文档集合转换为TF-IDF功能的矩阵。2023/11/4FeatureExtractionandPreprocessingLesson4-17ExtractingfeaturesfromtextThebag-of-wordsrepresentationThe

sklearn.feature_extraction.text

submodulegathersutilitiestobuildfeaturevectorsfromtextdocuments.sklearn.feature_extraction.text.CountVectorizerConvertacollectionoftextdocumentstoamatrixoftokencountsThisimplementationproducesasparserepresentationofthecountsusingscipy.sparse.csr_matrix.Ifyoudonotprovideana-prioridictionaryandyoudonotuseananalyzerthatdoessomekindoffeatureselectionthenthenumberoffeatureswillbeequaltothevocabularysizefoundbyanalyzingthedata.词袋表征在sklearn.feature_提取.text子模块收集实用程序以从文本文档生成特征向量sklearn.feature_extraction.text.CountVectorizer将文本文档集合转换为令牌计数矩阵此实现使用scipy.sparse.csr_矩阵.如果不提供先验字典,并且不使用进行某种特征选择的分析器,那么特征的数量将等于通过分析数据得到的词汇表大小。2023/11/4FeatureExtractionandPreprocessingLesson4-18Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-19>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame’]>>>vectorizer=CountVectorizer(binary=True)>>>print(vectorizer.fit_transform(corpus).todense())[[11010101][11101010]]>>>print(vectorizer.vocabulary_){'unc':7,'in':3,'the':6,'lost':4,'played':5,'basketball':0,'duke':1,'game':2}>>>Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizer2023/11/4FeatureExtractionandPreprocessingLesson4-20>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame','Iateasandwich’]>>>vectorizer=CountVectorizer(binary=True)>>>print(vectorizer.fit_transform(corpus).todense())[[0110101001][0111010010][1000000100]]>>>print(vectorizer.vocabulary_){'unc':9,'in':4,'the':8,'lost':5,'sandwich':7,'played':6,'basketball':1,'duke':2,'game':3,'ate':0}>>>Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizerNow,ourfeaturevectorsareasfollows:2023/11/4FeatureExtractionandPreprocessingLesson4-21UNCplayedDukeinbasketball=[[0110101001]]Dukelostthebasketballgame=[[0111010010]]Iateasandwich=[[1000000100]]Themeaningsofthefirsttwodocumentsaremoresimilartoeachotherthantheyaretothethirddocument,andtheircorrespondingfeaturevectorsaremoresimilartoeachotherthantheyaretothethirddocument'sfeaturevectorwhenusingametricsuchasEuclideandistance.当使用欧几里德距离等度量时,前两个文档的含义比它们与第三个文档的含义更相似,并且它们对应的特征向量彼此之间的相似性比它们与第三个文档的特征向量更相似。Extractingfeaturesfromtextsklearn.feature_extraction.text.CountVectorizerNow,ourfeaturevectorsareasfollows:2023/11/4FeatureExtractionandPreprocessingLesson4-22UNCplayedDukeinbasketball=[[0110101001]]Dukelostthebasketballgame=[[0111010010]]Iateasandwich=[[1000000100]]sklearn.metrics.pairwise.euclidean_distance>>>fromsklearn.metrics.pairwiseimporteuclidean_distances>>>counts=[[0,1,1,0,0,1,0,1],[0,1,1,1,1,0,0,0],[1,0,0,0,0,0,1,0]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[0],counts[1]))Distancebetween1stand2nddocuments:[[2.]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[0],counts[2]))Distancebetween1stand2nddocuments:[[2.44948974]]>>>print('Distancebetween1stand2nddocuments:',euclidean_distances(counts[1],counts[2]))Distancebetween1stand2nddocuments:[[2.44948974]]ExtractingfeaturesfromtextForrealapplications--High-dimensionalfeaturevectorsThefirstproblemisthathigh-dimensionalvectorsrequiremorememorythansmallervectors.Thesecondproblemisknownasthecurseofdimensionality(维数灾难/维度诅咒),ortheHugheseffect.Asthefeaturespace'sdimensionalityincreases,moretrainingdataisrequiredtoensurethatthereareenoughtraininginstanceswitheachcombinationofthefeature'svalues.Ifthereareinsufficienttraininginstancesforafeature,thealgorithmmayoverfitnoiseinthetrainingdataandfailtogeneralize.对于实际应用——高维特征向量第一个问题是高维向量比小向量需要更多的内存。第二个问题被称为维度的诅咒(数,难/度诅咒),或休斯效应。随着特征空间维数的增加,需要更多的训练数据来保证每个特征值组合都有足够的训练实例。如果一个特征没有足够的训练实例,则该算法可能会过度拟合训练数据中的噪声而无法进行泛化。2023/11/4FeatureExtractionandPreprocessingLesson4-23ExtractingfeaturesfromtextStop-wordfilteringRemovewordsthatarecommontomostofthedocumentsinthecorpus.Thesewords,calledstopwords,includedeterminerssuchasthe,a,andan;auxiliaryverbssuchasdo,be,andwill;andprepositionssuchason,around,andbeneath.Stopwordsareoftenfunctionalwordsthatcontributetothedocument‘smeaningthroughgrammarratherthantheirdenotations.TheCountVectorizerclasscanfilterstopwordsprovidedasthestop_wordskeywordargumentandalsoincludesabasicEnglishstoplist.2023/11/4FeatureExtractionandPreprocessingLesson4-24>>>fromsklearn.feature_extraction.textimportCountVectorizer>>>corpus=['UNCplayedDukeinbasketball','Dukelostthebasketballgame','Iateasandwich’]>>>vectorizer=CountVectorizer(binary=True,stop_words='english')>>>print(vectorizer.fit_transform(corpus).todense())[[01100101][01111000][10000010]]>>>print(vectorizer.vocabulary_){'unc':7,'lost':4,'sandwich':6,'played':5,'basketball':1,'duke':2,'game':3,'ate':0}ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationWhilestopfilteringisaneasystrategyfordimensionalityreduction,moststoplistscontainonlyafewhundredwords.Alargecorpusmaystillhavehundredsofthousandsofuniquewordsafterfiltering.Twosimilarstrategiesforfurtherreducingdimensionalityarecalledstemmingandlemmatization.(虽然停止过滤是一种简单的降维策略,但大多数停止列表只包含几百个单词。一个庞大的语料库经过过滤后可能仍然有成千上万个独特的词。进一步降低维度的两种类似策略称为词干和柠檬化。)词干提取(stemming)是抽取词的词干或词根形式(不一定能够表达完整语义)。词形还原(lemmatization),是把一个任何形式的语言词汇还原为一般形式(能表达完整语义)WecanusetheNaturalLanguageToolKit(NTLK)tostemandlemmatizethecorpus.2023/11/4FeatureExtractionandPreprocessingLesson4-25ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(词频-逆文档频率)weightsInsteadofusingabinaryvalueforeachelementinthefeaturevector,wewillnowuseanintegerthatrepresentsthenumberoftimesthatthewordsappearedinthedocument.现在我们将使用一个整数来表示单词在文档中出现的次数,而不是对特征向量中的每个元素使用二进制值。2023/11/4FeatureExtractionandPreprocessingLesson4-26ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(词频-逆文档频率)weightsNormalizedtermfrequency2023/11/4FeatureExtractionandPreprocessingLesson4-27ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(词频-逆文档频率)weightsnormalizedtermfrequencylogarithmicallyscaledtermfrequency(对数词频调整方法)2023/11/4FeatureExtractionandPreprocessingLesson4-28ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDF(词频-逆文档频率)weightsnormalizedtermfrequencylogarithmicallyscaledtermfrequency(对数词频调整方法)Augmentedtermfrequency(词频放大法)2023/11/4FeatureExtractionandPreprocessingLesson4-29ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsNormalization,logarithmicallyscaledtermfrequencies,andaugmentedtermfrequenciescanrepresentthefrequenciesoftermsinadocumentwhilemitigatingtheeffectsofdifferentdocumentsizes.However,anotherproblemremainswiththeserepresentations.Thefeaturevectorscontainlargeweightsfortermsthatoccurfrequentlyinadocument,evenifthosetermsoccurfrequentlyinmostdocumentsinthecorpus.使用TF-IDF权重扩展单词包规范化、对数缩放的术语频率和增强的术语频率可以表示文档中术语的频率,同时减轻不同文档大小的影响。然而,这些表示法的另一个问题仍然存在。对于文档中频繁出现的术语,即使这些术语在语料库中的大多数文档中频繁出现,特征向量也包含了较大的权重。2023/11/4FeatureExtractionandPreprocessingLesson4-30ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsinversedocumentfrequency(IDF)2023/11/4FeatureExtractionandPreprocessingLesson4-31ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsinversedocumentfrequency(IDF)TF-IDFvalueistheproductofitstermfrequencyandinversedocumentfrequency2023/11/4FeatureExtractionandPreprocessingLesson4-32ExtractingfeaturesfromtextStop-wordfilteringStemmingandlemmatizationExtendingbag-of-wordswithTF-IDFweightsTfidfVectorizerclasswrapsCountVectorizerandTfidfTransformer2023/11/4FeatureExtractionandPreprocessingLesson4-33>>>fromsklearn.feature_extraction.textimportTfidfVectorizer>>>corpus=['ThedogateasandwichandIateasandwich','Thewizardtransfiguredasandwich’]>>>vectorizer=TfidfVectorizer(stop_words='english')>>>print(vectorizer.fit_transform(corpus).todense())[[0.754583970.377291990.536892710.0.][0.0.0.449436420.63166720.6316672]]>>>print(vectorizer.vocabulary_){'sandwich':2,'dog':1,'transfigured':3,'ate':0,'wizard':4}ExtractingfeaturesfromtextTF-IDF+机器学习分类器基于深度学习的文本分类

FastText:将整篇文档的词及N-gram向量叠加平均得到文档向量,然后使用文档向量做softmax多分类。涉及两个技巧:字符级N-gram特征的引入以及分层Softmax分类。Word2Vec:Word2vec是WordEmbedding的方法之一。他是2013年由谷歌的Mikolov提出了一套新的词嵌入方法。由于Word2vec会考虑上下文,跟之前的Embedding方法相比,效果要更好(但不如18年之后的方法)BERT(BidirectionalEncoderRepresentationsfromTransformers)词向量模型,2018年10月在《BERT:Pre-trainingofDeepBidirectionalTransformersforLanguageUnderstanding》这篇论文中被Google提出,在11种不同NLP测试中创出最佳成绩。2023/11/4FeatureExtractionandPreprocessingLesson4-34FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization2023/11/4FeatureExtractionandPreprocessingLesson4-35ExtractingfeaturesfromimagesComputervisionisthestudyanddesignofcomputationalartifactsthatprocessandunderstandimages.Theseartifactssometimesemploymachinelearning.Anoverviewofcomputervisionisfarbeyondthescopeofthiscourse,butinthissectionwewillreviewsomebasictechniquesusedincomputervisiontorepresentimagesinmachinelearningproblems.计算机视觉是对处理和理解图像的计算伪影的研究和设计。这些工件有时使用机器学习。对计算机视觉的概述远远超出了本课程的范围,但在本节中,我们将回顾一些在计算机视觉中用来表示机器学习问题中的图像的基本技术。2023/11/4FeatureExtractionandPreprocessingLesson4-36ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesThedigitsdatasetincludedwithscikit-learncontainsgrayscaleimagesofmorethan1,700hand-writtendigitsbetweenzeroandnine.Eachimagehaseightpixelsonaside.Eachpixelisrepresentedbyanintensityvaluebetweenzeroand16;whiteisthemostintenseandisindicatedbyzero,andblackistheleastintenseandisindicatedby16.Thefollowingfigureisanimageofahand-writtendigittakenfromthedataset:从像素强度中提取特征scikitlearn附带的数字数据集包含1700多个0到9之间手写数字的灰度图像。每幅图像的一侧有八个像素。每个像素由0到16之间的强度值表示;白色最强烈,用0表示,黑色最不强烈,用16表示。下图是从数据集获取的手写数字图像:2023/11/4FeatureExtractionandPreprocessingLesson4-37ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesAbasicfeaturerepresentationforanimagecanbeconstructedbyreshapingthematrixintoavectorbyconcatenatingitsrowstogether.图像的基本特征表示可以通过将矩阵的行连接在一起,将矩阵重塑为向量来构造。2023/11/4FeatureExtractionandPreprocessingLesson4-38ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesAbasicfeaturerepresentationforanimagecanbeconstructedbyreshapingthematrixintoavectorbyconcatenatingitsrowstogether.2023/11/4FeatureExtractionandPreprocessingLesson4-39LargefeaturevectorsSensitivetochangesinthescale,rotation,andtranslationofimagesFurthermore,learningfrompixelintensitiesisitselfproblematic,asthemodelcanbecomesensitivetochangesinilluminationModerncomputervisionapplicationsfrequentlyuseeitherhand-engineeredfeatureextractionmethodsthatareapplicabletomanydifferentproblems,orautomaticallylearnfeatureswithoutsupervisionproblemusingtechniquessuchasdeeplearningExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesHumanscanquicklyrecognizemanyobjectswithoutobservingeveryattributeoftheobject.Thisintuitionismotivationtocreaterepresentationsofonlythemostinformativeattributesofanimage.Theseinformativeattributes,orpointsofinterest,arepointsthataresurroundedbyrichtexturesandcanbereproduceddespiteperturbingtheimage.Edgesandcornersaretwocommontypesofpointsofinterest.2023/11/4FeatureExtractionandPreprocessingLesson4-40ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesLet'susescikit-imagetoextractpointsofinterestfromthefollowingfigure:2023/11/4FeatureExtractionandPreprocessingLesson4-41ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeatures经典关键点检测器:HARRIS-1988HarrisCornerDetectorShi,Tomasi-1996GoodFeaturestoTrack(Shi,Tomasi)SIFT-1999ScaleInvariantFeatureTransform(Lowe)SURF-2006SpeededUpRobustFeatures现代关键点检测器:FAST-2006FeaturesfromAcceleratedSegmentTestBRIEF-2010BinaryRobustIndependentElementaryFeaturesORB-2011OrientedFASTandRotatedBRIEFBRISK-2011BinaryRobustInvariantScalableKeypointsFREAK-2012FastRetinaKeypointKAZE-2012KAZE2023/11/4FeatureExtractionandPreprocessingLesson4-42ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)(尺度不变特征转换)isamethodforextractingfeaturesfromanimagethatislesssensitivetothescale,rotation,andilluminationoftheimagethantheextractionmethodswehavepreviouslydiscussed.EachSIFTfeature,ordescriptor,isavectorthatdescribesedgesandcornersinaregionofanimage.Unlikethepointsofinterestinourpreviousexample,SIFTalsocapturesinformationaboutthecompositionofeachpointofinterestanditssurroundings.2023/11/4FeatureExtractionandPreprocessingLesson4-43ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)Speeded-UpRobustFeatures(SURF)(加速稳健特征)isanothermethodofextractinginterestingpointsofanimageandcreatingdescriptionsthatareinvariantoftheimage'sscale,orientation,andillumination.SURFcanbecomputedmorequicklythanSIFT,anditismoreeffectiveatrecognizingfeaturesacrossimagesthathavebeentransformedincertainways.2023/11/4FeatureExtractionandPreprocessingLesson4-44ExtractingfeaturesfromimagesExtractingfeaturesfrompixelintensitiesExtractingpointsofinterestasfeaturesSIFTandSURFScale-InvariantFeatureTransform(SIFT)Speeded-UpRobustFeatures(SURF)Liketheextractedpointsofinterest,theextractedSIFT(orSURF)areonlythefirststepincreatingafeaturerepresentationthatcouldbeusedinamachinelearningtask.2023/11/4FeatureExtractionandPreprocessingLesson4-45FeatureExtractionandPreprocessingExtractingfeaturesfromcategoricalvariablesExtractingfeaturesfromtextExtractingfeaturesfromimagesDatanormalization(数据标准化/归一化方法)2023/11/4FeatureExtractionandPreprocessingLesson4-46使用sklearn进行数据预处理--标准化/归一化/正则化一、标准化(Z-Score),或者去除均值和方差缩放公式为:(X-mean)/std计算时对每个属性/每列分别进行。将数据按期属性(按列进行)减去其均值,并处以其方差。得到的结果是,对于每个属性/每列来说所有数据都聚集在0附近,方差为1。实现时,有两种不同的方式:使用sklearn.preprocessing.scale()函数,可以直接将给定数据进行标准化。使用sklearn.preprocessing.StandardScaler类,使用该类的好处在于可以保存训练集中的参数(均值、方差)直接使用其对象转换测试集数据。2023/11/4FeatureExtractionandPreprocessingLesson4-472023/11/4FeatureExtractionandPreprocessingLesson4-48>>>fromsklearnimport

preprocessing>>>import

numpyas

np>>>X=np.array([[1.,-1.,

2.],...

[2.,

0.,

0.],...

[0.,

1.,-1.]])>>>X_scaled=preprocessing.scale(X)

>>>X_scaled

array([[0.

...,-1.22...,

1.33...],

[1.22...,

0.

...,-0.26...],

[-1.22...,

1.22...,-1.06...]])

>>>#处理后数据的均值和方差>>>X_scaled.mean(axis=0)array([0.,

0.,

0.])

>>>X_scaled.std(axis=0)arr

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论