三种特征选择方法及Spark MLlib调用实例Scala Java python_第1页
三种特征选择方法及Spark MLlib调用实例Scala Java python_第2页
三种特征选择方法及Spark MLlib调用实例Scala Java python_第3页
三种特征选择方法及Spark MLlib调用实例Scala Java python_第4页
三种特征选择方法及Spark MLlib调用实例Scala Java python_第5页
已阅读5页,还剩2页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

三种特征选择方法及SparkMLlib调用实例(Scala/Java/python)VectorSlicer算法介绍:VectorSlicer是一个转换器输入特征向量,输出原始特征向量子集。VectorSlicer接收带有特定索引的向量列,通过对这些索引的值进行筛选得到新的向量集。可接受如下两种索引.整数索引,setIndices()。.字符串索引代表向量中特征的名字,此类要求向量列有AttributeGroup,因为该工具根据Attribute来匹配名字字段。指定整数或者字符串类型都是可以的。另外,同时使用整数索引和字符串名字也是可以的。不允许使用重复的特征,所以所选的索引或者名字必须是没有独一的。注意如果使用名字特征,当遇到空值的时候将会报错。输出将会首先按照所选的数字索引排序(按输入顺序),其次按名字排序(按输入顺序)。示例:假设我们有一个DataFrame含有userFeatures歹ij:userFeatures[0.0,10.0,0.5]userFeatures是一个向量列包含3个用户特征。假设userFeatures的第一列全为0,我们希望删除它并且只选择后两项。我们可以通过索引setIndices(1,2)来选择后两项并产生一个新的features列:userFeatures|features|[0.0,10.0,0.5]|[10.0,0.5]假设我们还有如同["f1","f2","f3"]的属性,那可以通过名字setNames("f2","f3")的形式来选择:userFeatures|features|[0.0,10.0,0.5]|[10.0,0.5]["f1","f2","f3"]|["f2","f3"]调用示例:Scala:[plain]viewplaincopyimportjava.util.Arraysimportorg.apache.spark.ml.attribute.{Attribute,AttributeGroup,NumericAttribute}importorg.apache.spark.ml.feature.VectorSlicerimportorg.apache.spark.ml.linalg.Vectorsimportorg.apache.spark.sql.Rowimportorg.apache.spark.sql.types.StructTypevaldata=Arrays.asList(Row(Vectors.dense(-2.0,2.3,0.0)))valdefaultAttr=NumericAttribute.defaultAttrvalattrs=Array("f1","f2","f3").map(defaultAttr.withName)valattrGroup=newAttributeGroup("userFeatures",attrs.asInstanceOf[Array[Attribute]])valdataset=spark.createDataFrame(data,StructType(Array(attrGroup.toStructField())))valslicer=newVectorSlicer().setInputCol("userFeatures").setOutputCol("features")slicer.setIndices(Array(1)).setNames(Array("f3"))//orslicer.setIndices(Array(1,2)),orslicer.setNames(Array("f2","f3"))valoutput=slicer.transform(dataset)println(output.select("userFeatures","features").first())Java:[java]viewplaincopyimportjava.util.List;mon.collect.Lists;importorg.apache.spark.ml.attribute.Attribute;importorg.apache.spark.ml.attribute.AttributeGroup;importorg.apache.spark.ml.attribute.NumericAttribute;importorg.apache.spark.ml.feature.VectorSlicer;importorg.apache.spark.ml.linalg.Vectors;importorg.apache.spark.sql.Dataset;importorg.apache.spark.sql.Row;importorg.apache.spark.sql.RowFactory;importorg.apache.spark.sql.types.*;Attribute[]attrs=newAttribute[]{NumericAttribute.defaultAttr().withName("f1"),NumericAttribute.defaultAttr().withName("f2"),NumericAttribute.defaultAttr().withName("f3")};AttributeGroupgroup=newAttributeGroup("userFeatures",attrs);List<Row>data=Lists.newArrayList(RowFactory.create(Vectors.sparse(3,newint[]{0,1},newdouble[]{-2.0,2.3})),RowFactory.create(Vectors.dense(-2.0,2.3,0.0)));Dataset<Row>dataset=spark.createDataFrame(data,(newStructType()).add(group.toStructField()));VectorSlicervectorSlicer=newVectorSlicer().setInputCol("userFeatures").setOutputCol("features");vectorSlicer.setIndices(newint[]{1}).setNames(newString[]{"f3"});//orslicer.setIndices(newint[]{1,2}),orslicer.setNames(newString[]{"f2","f3"})Dataset<Row>output=vectorSlicer.transform(dataset);System.out.println(output.select("userFeatures","features").first());Python:[python]viewplaincopyfrompyspark.ml.featureimportVectorSlicerfrompyspark.ml.linalgimportVectorsfrompyspark.sql.typesimportRowdf=spark.createDataFrame([Row(userFeatures=Vectors.sparse(3,{0:-2.0,1:2.3}),),Row(userFeatures=Vectors.dense([-2.0,2.3,0.0]),)])slicer=VectorSlicer(inputCol="userFeatures",outputCol="features",indices=[1])output=slicer.transform(df)output.select("userFeatures","features").show()RFormula算法介绍:RFormula通过R模型公式来选择列。支持R操作中的部分操作,包括‘~‘,’.’,‘:’,‘+’以及‘-‘,基本操作如下:~分隔目标和对象+合并对象,“+0”意味着删除空格:交互(数值相乘,类别二值化).除了目标外的全部列假设a和b为两列:y~a+b表示模型y~w0+w1*a+w2*b其中w0为截距,w1和w2为相关系数。y~a+b+a:b-1表示模型y~w1*a+w2*b+w3*a*b,其中wl,w2,w3是相关系数。RFormula产生一个向量特征列以及一个double或者字符串标签列。如果类别列是字符串类型,它将通过Stringindexer转换为double类型。如果标签列不存在,则输出中将通过规定的响应变量创造一个标签列。示例:假设我们有一个DataFrame含有id,country,hour和clicked四列:id|country|hour|clicked---|||7|"US"|18|1.08|"CA"|12|0.09|"NZ"|15|0.0如果我们使用RFormula公式clicked~country+hour,则表明我们希望基于country和hour预测clicked,通过转换我们可以得到如下DataFrame:id|country|hour|clicked|features|label|||||7|"US"|18|1.0|[0.0,0.0,18.0]|1.08|"CA"|12|0.0|[0.0,1.0,12.0]|0.09|"NZ"|15|0.0|[1.0,0.0,15.0]|0.0调用示例:Scala:[plain]viewplaincopyimportorg.apache.spark.ml.feature.RFormulavaldataset=spark.createDataFrame(Seq((7,"US",18,1.0),(8,"CA",12,0.0),(9,"NZ",15,0.0))).toDF("id","country","hour","clicked")valformula=newRFormula().setFormula("clicked~country+hour").setFeaturesCol("features").setLabelCol("label")valoutput=formula.fit(dataset).transform(dataset)output.select("features","label").show()Java:[java]viewplaincopyimportjava.util.Arrays;importjava.util.List;importorg.apache.spark.ml.feature.RFormula;importorg.apache.spark.sql.Dataset;importorg.apache.spark.sql.Row;importorg.apache.spark.sql.RowFactory;importorg.apache.spark.sql.types.StructField;importorg.apache.spark.sql.types.StructType;importstaticorg.apache.spark.sql.types.DataTypes.*;StructTypeschema=createStructType(newStructField[]{createStructField("id",IntegerType,false),createStructField("country",StringType,false),createStructField("hour",IntegerType,false),createStructField("clicked",DoubleType,false)});List<Row>data=Arrays.asList(RowFactory.create(7,"US",18,1.0),RowFactory.create(8,"CA",12,0.0),RowFactory.create(9,"NZ",15,0.0));Dataset<Row>dataset=spark.createDataFrame(data,schema);RFormulaformula=newRFormula().setFormula("clicked~country+hour").setFeaturesCol("features").setLabelCol("label");Dataset<Row>output=formula.fit(dataset).transform(dataset);output.select("features","label").show();Python:[python]viewplaincopyfrompyspark.ml.featureimportRFormuladataset=spark.createDataFrame([(7,"US",18,1.0),(8,"CA",12,0.0),(9,"NZ",15,0.0)],["id","country","hour","clicked"])formula=RFormula(formula="clicked~country+hour",featuresCol="features",labelCol="label")output=formula.fit(dataset).transform(dataset)output.select("features","label").show()ChiSqSelector算法介绍:ChiSqSelector代表卡方特征选择。它适用于带有类别特征的标签数据。ChiSqSelector根据类别的独立卡方2检验来对特征排序,然后选取类别标签主要依赖的特征。它类似于选取最有预测能力的特征。示例:假设我们有一个DataFrame含有id,features和clicked三列,其中clicked为需要预测的目标:id|features|clicked---|||[0.0,0.0,18.0,1.0]|1.0|[0.0,1.0,12.0,0.0]|0.0|[1.0,0.0,15.0,0.1]|0.0如果我们使用ChiSqSelector并设置numTopFeatures为1,根据标签clicked,features中最后一列将会是最有用特征:id|features|clicked|selectedFeatures---|||TOC\o"1-5"\h\z7|[0.0,0.0,18.0,1.0]|1.0|[1.0]8|[0.0,1.0,12.0,0.0]|0.0|[0.0]9|[1.0,0.0,15.0,0.1]|0.0|[0.1]调用示例:Scala:[plain]viewplaincopyimportorg.apache.spark.ml.feature.ChiSqSelectorimportorg.apache.spark.ml.linalg.Vectorsvaldata=Seq((7,Vectors.dense(0.0,0.0,18.0,1.0),1.0),(8,Vectors.dense(0.0,1.0,12.0,0.0),0.0),(9,Vectors.dense(1.0,0.0,15.0,0.1),0.0))valdf=spark.createDataset(data).toDF("id","features","clicked")valselector=newChiSqSelector().setNumTopFeatures(1).setFeaturesCol("features").setLabelCol("clicked").setOutputCol("selectedFeatures")valresult=selector.fit(df).transform(df)result.show()Java:[java]viewplaincopyimportjava.util.Arrays;importjava.util.List;importorg.apache.spark.ml.feature.ChiSqSelector;importorg.apache.spark.ml.linalg.VectorUDT;importorg.apache.spark.ml.linalg.Vectors;importorg.apache.spark.sql.Row;importorg.apache.spark.sql.RowFactory;importorg.apache.spark.sql.types.DataTypes;importorg.apache.spark.sql.types.Metadata;importorg.apache.spark.sql.types.StructField;importorg.apache.spark.sql.types.StructType;List<Row>data=Arrays.asList(RowFactory.create(7,Vectors.dense(0.0,0.0,18.0,1.0),1.0),RowFactory.create(8,Vectors.dense(0.0,1.0,12.0,0.0),0.0),RowFactory.create(9,Vectors.dense(1.0,0.0,15.0,0.1),0.0));StructTypeschema=newStructType(newStructField[]{newStructField("id",DataTypes.IntegerType,false,Metadata.empty()),newStructFi

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论