mapreduce倒排索引算法

上传人：灰*** IP属地：宁夏上传时间：2021-07-04 格式：DOC 页数：7 大小：258.01KB 积分：10.8 举报 版权申诉

已阅读5页，还剩2页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、mapreduce程序设计报告姓名：学号：题目：莎士比亚文集倒排索引算法 1、实验环境联想pc机虚拟机：vm 10.0操作系统：centos 6.4hadoop版本：hadoop 1.2.1jdk版本：jdk-7u25eclipse版本：eclipse-sdk-4.2.2-linux-gtk-x86_642、实验设计及源程序2.1实验说明对莎士比亚文集文档数据进行处理，对莎士比亚文集文档数据进行倒排索引处理，结果输出到指定文件2.2实验设计（1）invertedindexmapper类这个类实现 mapper 接口中的 map 方法，输入参数中的 value 是文本文件中的一行，利用

2、正则表达式对数据进行处理，使文本中的非字母和数字符号转换成空格，然后利用stringtokenizer 将这个字符串拆成单词，最后将输出结果,outkey为单词+单词所在的文件名，outvalue为1。public static class invertedindexmapper extends mapper private final static intwritable one = new intwritable(1); public void map(object key, text value, context context ) throws ioexception, interru

3、ptedexception /获取文件名以及预处理 filesplit filesplit =(filesplit)context.getinputsplit(); string filename =filesplit.getpath().getname(); string line=value.tostring(); string s; /利用正则表达式除去非数字和字母的符号 pattern p =ppile(w+); matcher m=p.matcher(line); string line2=m.replaceall( ); stringtokenizer itr = new stri

4、ngtokenizer(line2); /按照空格对字符串进行划分 while (itr.hasmoretokens() s=itr.nexttoken().tolowercase(); if(!ls.contains(s) text filename_num=new text(s+,+filename);/将单词和单词所在的文件名进行合并 context.write(filename_num, one); (2)invertedindexpartitioner类这个类是自定义的partitioner类,通过复写getpartition() 方法来自定义子集的分区key。将 key按照分隔符进

5、行分割，取key的前面部分进行分区，将相同的（即单词相同）分入同一个reduce。public static class invertedindexpartitioner extends hashpartitioner public int getpartition(text key,intwritable value,int numreducetasks)text key1 =new text(key.tostring().split(,)0); super.getpartition(key1,value,numreducetasks); return 0; （3）combinereduce

6、r类这个类是在map输出结果之后输入reduce之前做的一个操作，是一个小型的reduce操作，这个操作可以减少reduce阶段的工作量，从而优化性能。public static class combinereducer extends reducer public void reduce(text key, iterable values, context context ) throws ioexception, interruptedexception int sum = 0; for (intwritable val : values) sum += val.get(); contex

7、t.write(key,new intwritable(sum);（4）invertedindexredecuer类这个类实现了reducer接口中的reduce方法，map的结果经过combine处理之后，数据输入reduce，key为单词+单词所在文件名，value为单词的词频数，由于要实现倒排，所以key只能为单词，取key的第一部分即单词，把key的第二部分即文件名和原有value合并和新的value，作为新的key和value，然后输出结果，outkey为单词，outvalue为文件名+单词词频数。public static class invertedindexreducer ex

8、tends reducer private text filename_num=new text(); stringbuilder all=new stringbuilder(); public void reduce(text key, iterable values, context context ) throws ioexception, interruptedexception text key1=new text(key.tostring().split(,)0); /表示单词 int sum = 0;/p为定义的一个list类型的全局变量，用来存储每个单词的所在文件名和词频数 n

9、ewkey为定义的一个text的全局变量 for (intwritable val : values) sum += val.get(); if(newkey = null | !newkey.equals(key1) if(newkey!= null) stringbuffer all =new stringbuffer(); for(text t:p) all.append(t.tostring(); all.append(;); context.write(newkey, new text(all.tostring(); p.clear(); /每一个单词的结果输出完毕后，p要格式化 n

10、ewkey.set(key1);/每一个单词的结果输出完毕后，换成另一个单词开始计数 filename_num=new text(key.tostring().split(,)1+sum); p.add(filename_num); /reduce阶段的清理工作，用来输出最后一个单词的结果 public void cleanup(context context) throws ioexception, interruptedexception stringbuffer all =new stringbuffer(); for(text t:p) all.append(t); all.appen

11、d(;); context.write(newkey, new text(all.tostring(); （6）主程序定义了一个job，进行一个必要的设置。 public static void main(string args) throws exception configuration conf = new configuration(); string otherargs = new genericoptionsparser(conf, args).getremainingargs(); if (otherargs.length != 2) system.err.println(usa

12、ge: wordcount ); system.exit(2); string uri=hdfs:/localhost:8000/user/tzj/stop_words;/从hdfs读取停词 filesystem fs=filesystem.get(uri.create(uri), conf); fsdatainputstream in =fs.open(new path(uri); inputstreamreader lsr=new inputstreamreader(in); bufferedreader buf=new bufferedreader(lsr); string input;

13、 while(input=buf.readline()!=null) ls.add(input); system.out.println(the stop_words are:); iterator it =ls.iterator(); while(it.hasnext()system.out.print(it.next()+ ); system.out.println(); job job = new job(conf, word count); fileinputformat.addinputpath(job, new path(otherargs0); fileoutputformat.

14、setoutputpath(job, new path(otherargs1); job.setjarbyclass(invertedindex.class); job.setmapperclass(invertedindexmapper.class);/设置partitioner类job.setpartitionerclass(invertedindexpartitioner.class);/设置combiner类 job.setcombinerclass(combinereducer.class); job.setreducerclass(invertedindexreducer.class); job.setinputformatclass(textinputformat.class); job.setmapoutputkeyclass(text.class); job.setmapoutputvalueclass(intwritable.class); job.setoutpu

人人文库> 全部分类> 生活休闲 > 科普知识

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

mapreduce倒排索引算法

文档简介

温馨提示

最新文档

评论

mapreduce倒排索引算法

文档简介

温馨提示

最新文档

评论

相关文档