大数据基础课程设计报告_第1页
大数据基础课程设计报告_第2页
大数据基础课程设计报告_第3页
大数据基础课程设计报告_第4页
大数据基础课程设计报告_第5页
已阅读5页,还剩15页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、-. z.大数据根底课程设计报告一、工程简介:使用hadoop中的hive、mapreduce以及HBASE对网上的一个搜狗五百万的数进展了一个比拟实际的数据分析。搜狗五百万数据,是经过处理后的搜狗搜索引擎生产数据,具有真实性,大数据性,能够较好的满足分布式计算应用开发课程设计的数据要求。搜狗数据的数据格式为:访问时间t 用户 IDt查询词t 该 URL 在返回结果中的排名t 用户点击的顺序号t 用户点击的 URL。其中,用户 ID 是根据用户使用浏览器访问搜索引擎时的 Cookie 信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户 ID。二、操作要求1.将原始数据加载到HDFS平

2、台。2.将原始数据中的时间字段拆分并拼接,添加年、月、日、小时字段。3.将处理后的数据加载到HDFS平台。4.以下操作分别通过MR和Hive实现。查询总条数非空查询条数无重复总条数独立UID总数查询频度排名频度最高的前50词查询次数大于2次的用户总数查询次数大于2次的用户占比Rank在10以的点击次数占比直接输入URL查询的比例查询搜索过仙剑奇侠传的uid,并且次数大于35.将4每步骤生成的结果保存到HDFS中。6.将5生成的文件通过Java API方式导入到HBase一表。7.通过HBase shell命令查询6导出的结果。三、实验流程将原始数据加载到HDFS平台将原始数据中的时间字段拆分并

3、拼接,添加年、月、日、小时字段编写1个脚本sogou-log-e*tend.sh,其中sogou-log-e*tend.sh的容为:#!/bin/bashinfile=$1outfile=$2awk -F t print $0tsubstr($1,0,4)年tsubstr($1,5,2)月tsubstr($1,7,2)日tsubstr($1,8,2)hour $infile $outfile处理脚本文件:结果为:将处理后的数据加载到HDFS平台hadoop fs -put sogou.500w.utf8.e*t /以下操作分别通过MR和Hive实现.hive实现1.查看数据库:show dat

4、abases;2.创立数据库: create database sogou;3.使用数据库: use sogou;4.查看所有表:show tables;5.创立sougou表:Create table sogou(time string,uuid string,name string,num1 int,num2 int,url string) Row format delimited fields terminated by t;6.将本地数据导入到Hive表里:Load data local inpath /root/sogou.500w.utf8 into table sogou;7.查

5、看表信息:desc sogou;查询总条数select count(*) from sogou;非空查询条数select count(*) from sogou where name is not null and name !=;无重复总条数select count(*) from (select * from sogou group by time,num1,num2,uuid,name,url having count(*)=1) a;独立UID总数select count(distinct uuid) from sogou;查询频度排名频度最高的前50词select name,coun

6、t(*) as pd from sogou group by name order by pd desc limit 50;6查询次数大于2次的用户总数select count(a.uuid) from (select uuid,count(*) ast from sogou group by uuid havingt 2) a;7查询次数大于2次的用户占比select count(*) from (select uuid,count(*) ast from sogou group by uuid havingt 2) a;Rank在10以的点击次数占比select count(*) from

7、 sogou where num13;.MapReduce实现(import的各种包省略)查询总条数public class MRCountAll public static Integer i = 0; public static boolean flag = true; public static class CountAllMap e*tends Mapper Override protected void map(Object key, Te*t value, Mapper.Conte*t conte*t) throws IOE*ception, InterruptedE*ceptio

8、n i+; public static void runcount(String Inputpath, String Outpath) Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/0:9000); Job job = null; try job = Job.getInstance(conf, count); catch (IOE*ception e) / TODO Auto-generated catch block e.printStackTrace(); job.setJa

9、rByClass(MRCountAll.class); job.setMapperClass(CountAllMap.class); job.setOutputKeyClass(Te*t.class); job.setOutputValueClass(Te*t.class); try FileInputFormat.addInputPath(job, new Path(Inputpath); catch (IllegalArgumentE*ception e) / TODO Auto-generated catch block e.printStackTrace(); catch (IOE*c

10、eption e) / TODO Auto-generated catch block e.printStackTrace(); FileOutputFormat.setOutputPath(job, new Path(Outpath); try job.waitForpletion(true); catch (ClassNotFoundE*ception e) / TODO Auto-generated catch block e.printStackTrace(); catch (IOE*ception e) / TODO Auto-generated catch block e.prin

11、tStackTrace(); catch (InterruptedE*ception e) / TODO Auto-generated catch block e.printStackTrace(); public static void main(String args) throws E*ception runcount(/sogou/data/sogou.500w.utf8, /sogou/data/CountAll); System.out.println(总条数: + i); 非空查询条数public class CountNotNull public static String S

12、tr = ; public static int i = 0; public static boolean flag = true; public static class wyMap e*tends Mapper Override protected void map(Object key, Te*t value, Mapper.Conte*t conte*t) throws IOE*ception, InterruptedE*ception String values = value.toString().split(t); if (!values2.equals(null) & valu

13、es2 != ) conte*t.write(new Te*t(values1), new IntWritable(1); i+; public static void run(String inputPath, String outputPath) Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/0:9000); Job job = null; try job = Job.getInstance(conf, countnotnull); catch (IOE*ception e)

14、 / TODO Auto-generated catch block e.printStackTrace(); assert job != null; job.setJarByClass(CountNotNull.class); job.setMapperClass(wyMap.class); job.setOutputKeyClass(Te*t.class); job.setOutputValueClass(IntWritable.class); try FileInputFormat.addInputPath(job, new Path(inputPath); catch (Illegal

15、ArgumentE*ception e) e.printStackTrace(); catch (IOE*ception e) e.printStackTrace(); try FileOutputFormat.setOutputPath(job, new Path(outputPath); job.waitForpletion(true); catch (ClassNotFoundE*ception e) e.printStackTrace(); catch (IOE*ception e) e.printStackTrace(); catch (InterruptedE*ception e)

16、 e.printStackTrace(); public static void main(String args) run(/sogou/data/sogou.500w.utf8, /sogou/data/CountNotNull); System.out.println(非空条数: + i); 无重复总条数public class CountNotRepeat public static int i = 0; public static class NotRepeatMap e*tends Mapper Override protected void map(Object key, Te*

17、t value, Mapper.Conte*t conte*t) throws IOE*ception, InterruptedE*ception String te*t = value.toString(); String values = te*t.split(t); String time = values0; String uid = values1; String name = values2; String url = values5; conte*t.write(new Te*t(time+uid+name+url), new Te*t(1); public static cla

18、ss NotRepeatReduc e*tends Reducer Override protected void reduce(Te*t key, Iterable values, Reducer.Conte*t conte*t) throws IOE*ception, InterruptedE*ception i+; conte*t.write(new Te*t(key.toString(),new IntWritable(i); public static void main(String args) throws IOE*ception, ClassNotFoundE*ception,

19、 InterruptedE*ception Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/0:9000); Job job = null; try job = Job.getInstance(conf, countnotnull); catch (IOE*ception e) / TODO Auto-generated catch block e.printStackTrace(); assert job != null; job.setJarByClass(CountNotRe

20、peat.class); job.setMapperClass(NotRepeatMap.class); job.setReducerClass(NotRepeatReduc.class); job.setOutputKeyClass(Te*t.class); job.setOutputValueClass(Te*t.class); try FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); catch (IllegalArgumentE*ception e) e.printStackTrace();

21、 catch (IOE*ception e) e.printStackTrace(); try FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountNotRepeat); job.waitForpletion(true); catch (ClassNotFoundE*ception e) e.printStackTrace(); catch (IOE*ception e) e.printStackTrace(); catch (InterruptedE*ception e) e.printStackTrace(); Sys

22、tem.out.println(无重复总条数为: + i); 独立UID总数public class CountNotMoreUid public static int i = 0; public static class UidMap e*tends Mapper Override protected void map(Object key, Te*t value, Mapper.Conte*t conte*t) throws IOE*ception, InterruptedE*ception String te*t = value.toString(); String values = t

23、e*t.split(t); String uid = values1; conte*t.write(new Te*t(uid), new Te*t(1); public static class UidReduc e*tends Reducer Override protected void reduce(Te*t key, Iterable values, Reducer.Conte*t conte*t) throws IOE*ception, InterruptedE*ception i+; conte*t.write(new Te*t(key.toString(),new IntWrit

24、able(i); public static void main(String args) throws IOE*ception, ClassNotFoundE*ception, InterruptedE*ception Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/0:9000); Job job = null; try job = Job.getInstance(conf, countnotnull); catch (IOE*ception e) / TODO Auto-ge

25、nerated catch block e.printStackTrace(); assert job != null; job.setJarByClass(CountNotNull.class); job.setMapperClass(UidMap.class); job.setReducerClass(UidReduc.class); job.setOutputKeyClass(Te*t.class); job.setOutputValueClass(Te*t.class); try FileInputFormat.addInputPath(job, new Path(/sogou/dat

26、a/sogou.500w.utf8); catch (IllegalArgumentE*ception e) e.printStackTrace(); catch (IOE*ception e) e.printStackTrace(); try FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountNotMoreUid); job.waitForpletion(true); catch (ClassNotFoundE*ception e) e.printStackTrace(); catch (IOE*ception e)

27、e.printStackTrace(); catch (InterruptedE*ception e) e.printStackTrace(); System.out.println(独立UID条数: + i); 查询频度排名频度最高的前50词public class CountTop50 public static class TopMapper e*tends Mapper Te*t te*t =new Te*t(); Override protected void map(LongWritable key, Te*t value,Conte*t conte*t) throws IOE*c

28、eption, InterruptedE*ception String line= value.toString().split(t); String keys = line2; te*t.set(keys); conte*t.write(te*t,new LongWritable(1); public static class TopReducer e*tends Reducer Te*t te*t = new Te*t(); TreeMap map = new TreeMap(); Override protected void reduce(Te*t key, Iterable valu

29、e, Conte*t conte*t) throws IOE*ception, InterruptedE*ception int sum=0;/key出现次数 for (LongWritable lte*t : value) sum+=lte*t.get(); map.put(sum,key.toString(); /去前50条数据 if(map.size()50) map.remove(map.firstKey(); Override protected void cleanup(Conte*t conte*t) throws IOE*ception, InterruptedE*ceptio

30、n for(Integer count:map.keySet() conte*t.write(new Te*t(map.get(count), new LongWritable(count); public static void main(String args) throws IOE*ception, ClassNotFoundE*ception, InterruptedE*ception Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/0:9000); Job job = J

31、ob.getInstance(conf, count); job.setJarByClass(CountTop50.class); job.setJobName(Five); job.setOutputKeyClass(Te*t.class); job.setOutputValueClass(LongWritable.class); job.setMapperClass(TopMapper.class); job.setReducerClass(TopReducer.class); FileInputFormat.addInputPath(job, new Path(/sogou/data/s

32、ogou.500w.utf8); FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountTop50); job.waitForpletion(true); 查询次数大于2次的用户总数public class CountQueriesGreater2 public static int total = 0; public static class MyMaper e*tends Mapper protected void map(Object key, Te*t value, Mapper.Conte*t conte*t) t

33、hrows IOE*ception, InterruptedE*ception String str = value.toString().split(t); Te*t word; IntWritable one = new IntWritable(1); word = new Te*t(str1); conte*t.write(word, one); public static class MyReducer e*tends Reducer Override protected void reduce(Te*t arg0, Iterable arg1, Reducer.Conte*t arg

34、2) throws IOE*ception, InterruptedE*ception / arg0是一个单词 arg1是对应的次数 int sum = 0; for (IntWritable i : arg1) sum += i.get(); if(sum2) total=total+1; /arg2.write(arg0, new IntWritable(sum); public static void main(String args) throws IOE*ception, ClassNotFoundE*ception, InterruptedE*ception Configurati

35、on conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/0:9000); / 1.实例化一个Job Job job = Job.getInstance(conf, si*); / 2.设置mapper类 job.setMapperClass(MyMaper.class); / 3.设置biner类不是必须的 / job.setbinerClass(MyReducer.class); / 4.设置Reducer类 job.setReducerClass(MyReducer.class); / 5.设置输出key的

36、数据类型 job.setOutputKeyClass(Te*t.class); / 6.设置输出value的数据类型 job.setOutputValueClass(IntWritable.class); / 设置通过哪个类查找job的Jar包 job.setJarByClass(CountQueriesGreater2.class); / 7.设置输入路径 FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); / 8.设置输出路径 FileOutputFormat.setOutputPath(job,

37、 new Path(/sogou/data/CountQueriesGreater2); / 9.执行该作业 job.waitForpletion(true); System.out.println(查询次数大于2次的用户总数: + total + 条); 查询次数大于2次的用户占比public class CountQueriesGreaterPro public static int total1 = 0; public static int total2 = 0; public static class MyMaper e*tends Mapper Override protected

38、void map(Object key, Te*t value, Mapper.Conte*t conte*t) throws IOE*ception, InterruptedE*ception total2+; String str = value.toString().split(t); Te*t word; IntWritable one = new IntWritable(1); word = new Te*t(str1); conte*t.write(word, one); / 执行完毕后就是一个单词对应一个value(1) public static class MyReducer

39、 e*tends Reducer Override protected void reduce(Te*t arg0, Iterable arg1, Reducer.Conte*t arg2) throws IOE*ception, InterruptedE*ception / arg0是一个单词 arg1是对应的次数 int sum = 0; for (IntWritable i : arg1) sum += i.get(); if(sum2) total1+; arg2.write(arg0, new IntWritable(sum); public static void main(Str

40、ing args) throws IOE*ception, ClassNotFoundE*ception, InterruptedE*ception System.out.println(seven begin); Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/0:9000); / 1.实例化一个Job Job job = Job.getInstance(conf, seven); / 2.设置mapper类 job.setMapperClass(MyMaper.class);

41、/ 3.设置biner类不是必须的 / job.setbinerClass(MyReducer.class); / 4.设置Reducer类 job.setReducerClass(MyReducer.class); / 5.设置输出key的数据类型 job.setOutputKeyClass(Te*t.class); / 6.设置输出value的数据类型 job.setOutputValueClass(IntWritable.class); / 设置通过哪个类查找job的Jar包 job.setJarByClass(CountQueriesGreaterPro.class); / 7.设置输

42、入路径 FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); / 8.设置输出路径 FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountQueriesGreaterPro); / 9.执行该作业 job.waitForpletion(true); System.out.println(total1=+total1+ttotal2=+total2); float percentage = (float)total1/(float)to

43、tal2; System.out.println(查询次数大于2次的用户占比为: + percentage*100+%); System.out.println(over); Rank在10以的点击次数占比public class CountRank public static int sum1 = 0; public static int sum2 = 0; public static class MyMapper e*tends Mapper Override protected void map(Object key, Te*t value, Mapper.Conte*t conte*t

44、) throws IOE*ception, InterruptedE*ception sum2+; String str = value.toString().split(t); int rank = Integer.parseInt(str3); if(rank11) sum1=sum1+1; public static void main(String args) throws IOE*ception, ClassNotFoundE*ception, InterruptedE*ception Configuration conf = new Configuration(); conf.se

45、t(fs.defaultFS, hdfs:/0:9000); Job job = Job.getInstance(conf, eight); job.setMapperClass(MyMapper.class); job.setOutputKeyClass(Te*t.class); job.setOutputValueClass(Te*t.class); job.setJarByClass(CountRank.class); FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); Fi

46、leOutputFormat.setOutputPath(job, new Path(/sogou/data/CountRank); job.waitForpletion(true); System.out.println(sum1=+sum1+tsum2=+sum2); float percentage = (float)sum1/(float)sum2; System.out.println(Rank在10以的点击次数占比: +percentage*100+%); 直接输入URL查询的比例public class CountURL public static int sum1 = 0; p

47、ublic static int sum2 = 0; public static class MyMapper e*tends Mapper Override protected void map(Object key, Te*t value, Mapper.Conte*t conte*t) throws IOE*ception, InterruptedE*ception String str = value.toString().split(t); Pattern p = Pattern.pile(); Matcher matcher = p.matcher(str2); matcher.f

48、ind(); try if(matcher.group()!=null) sum1+; sum2+; catch (E*ception e) sum2+; public static void main(String args) throws IOE*ception, ClassNotFoundE*ception, InterruptedE*ception Configuration conf = new Configuration(); conf.set(fs.defaultFS, hdfs:/0:9000); Job job = Job.getInstance(conf

49、, nine); job.setMapperClass(MyMapper.class); job.setOutputKeyClass(Te*t.class); job.setOutputValueClass(Te*t.class); job.setJarByClass(CountURL.class); FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountURL); job.wai

50、tForpletion(true); System.out.println(sum1=+sum1+tsum2=+sum2); float percentage = (float)sum1/(float)sum2; System.out.println(直接用url%查询的用户占比: +percentage*100+%); 查询搜索过仙剑奇侠传的uid,并且次数大于3public class CountUidGreater3 public static String Str=; public static int i=0; public static class Map e*tends Mapp

51、er Override protected void map(Object key, Te*t value, Mapper.Conte*t conte*t) throws IOE*ception, InterruptedE*ception String values=value.toString().split(t); String pattern=仙剑奇侠传; if(values2.equals(pattern) conte*t.write(new Te*t(values1), new IntWritable(1); public static class Reduce e*tends Re

52、ducer Override protected void reduce(Te*t key, Iterable value, Reducer.Conte*t conte*t) throws IOE*ception, InterruptedE*ception int sum=0; for(IntWritable v:value) sum=sum+v.get(); if(sum3) Str=Str+key.toString()+n; i+; public static void main(String args) Configuration conf=new Configuration(); co

53、nf.set(fs.defaultFS, hdfs:/0:9000); Job job = null; try job = Job.getInstance(conf, count); catch (IOE*ception e) / TODO Auto-generated catch block e.printStackTrace(); job.setJarByClass(CountUidGreater3.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutpu

54、tKeyClass(Te*t.class); job.setOutputValueClass(IntWritable.class); try FileInputFormat.addInputPath(job, new Path(/sogou/data/sogou.500w.utf8); catch (IllegalArgumentE*ception e) / TODO Auto-generated catch block e.printStackTrace(); catch (IOE*ception e) / TODO Auto-generated catch block e.printSta

55、ckTrace(); try FileOutputFormat.setOutputPath(job, new Path(/sogou/data/CountUidGreater3); job.waitForpletion(true); catch (ClassNotFoundE*ception e) / TODO Auto-generated catch block e.printStackTrace(); catch (IOE*ception e) / TODO Auto-generated catch block e.printStackTrace(); catch (Interrupted

56、E*ception e) / TODO Auto-generated catch block e.printStackTrace(); System.out.println(i: +i); System.out.println(Str); 将4每步骤生成的结果保存到HDFS中使用INSERT OVERWRITE DIRECTORY可完成操作例如:将5生成的文件通过Java API方式导入到HBase一表将中5生成的文件通过Java API方式导入到HBase一表public class HBaseImport / reduce输出的表名 private static String tableN

57、ame = test; / 初始化连接 static Configuration conf = null; static conf = HBaseConfiguration.create(); conf.set(hbase.rootdir, hdfs:/0:9000/hbase); conf.set(hbase.master, hdfs:/0:60000); conf.set(perty.clientPort, 2181); conf.set(hbase.zookeeper.quorum, master,slave1,slave2); conf.set(TableOutputFormat.OUTPUT_TABLE, tableName); public static class BatchMapper e*tends Mapper protected void map(LongWritable key, Te*t value, Mapper.Conte*t conte*t) th

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论