基于hadoop的大规模数据排序算法.doc_第1页
基于hadoop的大规模数据排序算法.doc_第2页
基于hadoop的大规模数据排序算法.doc_第3页
基于hadoop的大规模数据排序算法.doc_第4页
基于hadoop的大规模数据排序算法.doc_第5页
已阅读5页,还剩5页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

/sundae_meng基于hadoop的大规模数据排序算法 Hadoop TeraSort 基准测试实验组长:万虎成员:牛庆亚、宋思梦、文滔、胡海绅时间:2011年11月6日 23:14:21关于Hadoop Terasort的分析会在另外一篇文章中单独分析,或等韩旭红组分析。我们为了能够更好的理解Hadoop Example里面的排序程序,在Hadoop环境下对Terasort进行了测试实验。由于是在虚拟机环境中,生成的测试数据大小选择为100M,我们开始时选择对1G的数据进行测试,实验了两次,但是每次在排序的时候机器都会死掉。第一次排序在我们吃饭回来后还没有完成,机器卡死了。最终选择对100M数据进行排序,运行成功。参考资料:Hadoop TeraSort 基准测试实验/zklth/article/details/6295517测试眼里的Hadoop系列之Terasort /leafy1980/article/details/6633828相关资料没有具体看:Hadoop MapReduce扩展性的测试: /a/20100901/278934.html用MPI实现Hadoop: Map/Reduce的TeraSort /166546157.htmlHadoop中TeraSort算法分析: /mapreduce/hadoop-terasort-analyse/hadoop的1TB排序terasort:/dtzw/blog/item/cffc8e1830f908b94bedbc12.htmlSort Benchmark: /Trir树:/cherish_yimi/archive/2009/10/12/1581666.html运行环境:VMware虚拟机 ubuntu10.10 java version 1.7.0Java(TM) SE Runtime Environment (build 1.7.0-b147)Java HotSpot(TM) Client VM (build 21.0-b17, mixed mode)hadoop-Hadoop 安装目录为 /home/apple/hadoop-/下面是整个运行Terasort过程中的输入命令及输出。(注:橙色为终端的提示符及输入命令,蓝色为解释性文字,默认颜色为Hadoop输出。)整个过程运行了如下命令:/cd hadoop-/ bin/stop-all.shbin/hadoop namenode -formatbin/start-all.shbin/hadoop jar hadoop-examples-.jar teragen 100000 terasort/100000-inputbin/hadoop fs -ls /user/apple/terasort/100000-inputbin/hadoop jar hadoop-examples-.jar teragen 10 terasort/100000-input2bin/hadoop jar hadoop-examples-.jar teragen 1000000 terasort/100M-inputbin/hadoop jar hadoop-examples-.jar terasort terasort/100M-input terasort/100M-outputbin/hadoop fs -ls terasort/100M-outputbin/hadoop jar hadoop-examples-.jar teravalidate terasort/100M-output terasort/100M-validate/运行过程及部分注释:由于运行了Hadoop,为了防止出现乱七八糟的问题,我们先停止Hadoop,并对Hadoop的namenode进行重新格式化,并运行。appleubuntu:/hadoop-$ bin/stop-all.sh no jobtracker to stoplocalhost: no tasktracker to stopno namenode to stoplocalhost: no datanode to stoplocalhost: no secondarynamenode to stopappleubuntu:/hadoop-$ bin/hadoop namenode -format11/11/06 04:15:10 INFO namenode.NameNode: STARTUP_MSG: /*STARTUP_MSG: Starting NameNodeSTARTUP_MSG: host = ubuntu/STARTUP_MSG: args = -formatSTARTUP_MSG: version = STARTUP_MSG: build = /repos/asf/hadoop/common/branches/branch-0.20-security-203 -r 1099333; compiled by oom on Wed May 4 07:57:50 PDT 2011*/11/11/06 04:15:10 INFO util.GSet: VM type = 32-bit11/11/06 04:15:10 INFO util.GSet: 2% max memory = 19.33375 MB11/11/06 04:15:10 INFO util.GSet: capacity = 222 = 4194304 entries11/11/06 04:15:10 INFO util.GSet: recommended=4194304, actual=419430411/11/06 04:15:11 INFO namenode.FSNamesystem: fsOwner=apple11/11/06 04:15:11 INFO namenode.FSNamesystem: supergroup=supergroup11/11/06 04:15:11 INFO namenode.FSNamesystem: isPermissionEnabled=true11/11/06 04:15:11 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=10011/11/06 04:15:11 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)11/11/06 04:15:11 INFO namenode.NameNode: Caching file names occuring more than 10 times 11/11/06 04:15:11 INFO common.Storage: Image file of size 111 saved in 0 seconds.11/11/06 04:15:11 INFO common.Storage: Storage directory /tmp/hadoop-apple/dfs/name has been successfully formatted.11/11/06 04:15:11 INFO namenode.NameNode: SHUTDOWN_MSG: /*SHUTDOWN_MSG: Shutting down NameNode at ubuntu/*/appleubuntu:/hadoop-$ bin/start-all.sh starting namenode, logging to /home/apple/hadoop-/bin/./logs/hadoop-apple-namenode-ubuntu.outlocalhost: starting datanode, logging to /home/apple/hadoop-/bin/./logs/hadoop-apple-datanode-ubuntu.outlocalhost: starting secondarynamenode, logging to /home/apple/hadoop-/bin/./logs/hadoop-apple-secondarynamenode-ubuntu.outstarting jobtracker, logging to /home/apple/hadoop-/bin/./logs/hadoop-apple-jobtracker-ubuntu.outlocalhost: starting tasktracker, logging to /home/apple/hadoop-/bin/./logs/hadoop-apple-tasktracker-ubuntu.out利用TeraGen生成排序输入数据:(1)teragen后的数值单位是行数;因为每行100个字节,所以如果要产生1T的数据量,则这个数值应为1T/100=10000000000(10个0)。我们生成100M的数据,则为100000。(2)后面的terasort目录为分布式文件系统中目录,我们的环境中为/user/apple/terasort,此目录会由Hadoop自动创建。(感谢沈岩提醒)100000-input目录,名字可以任意选择,为便于和后面的目录区别,我们的数据目录分别命名如下:100000-input,100000-input2,100M-input,100M-outputappleubuntu:/hadoop-$ bin/hadoop jar hadoop-examples-.jar teragen 100000 terasort/100000-inputGenerating 100000 using 2 maps with step of 5000011/11/06 04:33:37 INFO mapred.JobClient: Running job: job_201111060257_001711/11/06 04:33:38 INFO mapred.JobClient: map 0% reduce 0%11/11/06 04:34:32 INFO mapred.JobClient: map 50% reduce 0%11/11/06 04:34:38 INFO mapred.JobClient: map 100% reduce 0%11/11/06 04:34:47 INFO mapred.JobClient: Job complete: job_201111060257_001711/11/06 04:34:47 INFO mapred.JobClient: Counters: 1511/11/06 04:34:47 INFO mapred.JobClient: Job Counters 11/11/06 04:34:47 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=7282411/11/06 04:34:47 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=011/11/06 04:34:47 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=011/11/06 04:34:47 INFO mapred.JobClient: Launched map tasks=211/11/06 04:34:47 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=011/11/06 04:34:47 INFO mapred.JobClient: File Input Format Counters 11/11/06 04:34:47 INFO mapred.JobClient: Bytes Read=011/11/06 04:34:47 INFO mapred.JobClient: File Output Format Counters 11/11/06 04:34:47 INFO mapred.JobClient: Bytes Written=1000000011/11/06 04:34:47 INFO mapred.JobClient: FileSystemCounters11/11/06 04:34:47 INFO mapred.JobClient: HDFS_BYTES_READ=16411/11/06 04:34:47 INFO mapred.JobClient: FILE_BYTES_WRITTEN=4178211/11/06 04:34:47 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1000000011/11/06 04:34:47 INFO mapred.JobClient: Map-Reduce Framework11/11/06 04:34:47 INFO mapred.JobClient: Map input records=10000011/11/06 04:34:47 INFO mapred.JobClient: Spilled Records=011/11/06 04:34:47 INFO mapred.JobClient: Map input bytes=10000011/11/06 04:34:47 INFO mapred.JobClient: Map output records=10000011/11/06 04:34:47 INFO mapred.JobClient: SPLIT_RAW_BYTES=164下面的命令是查看生成的目录,来证明确实生成了相应的数据,使用分布式文件系统的命令,如下,路径如上面注释所提/user/apple/terasort/。结果为生成两个数据,每个的大小是 5000000 B = 5 Mappleubuntu:/hadoop-$ bin/hadoop fs -ls /user/apple/terasort/100000-inputFound 4 items-rw-r-r- 1 apple supergroup 0 2011-11-06 04:34 /user/apple/terasort/100000-input/_SUCCESSdrwxr-xr-x - apple supergroup 0 2011-11-06 04:33 /user/apple/terasort/100000-input/_logs-rw-r-r- 1 apple supergroup 5000000 2011-11-06 04:34 /user/apple/terasort/100000-input/part-00000-rw-r-r- 1 apple supergroup 5000000 2011-11-06 04:34 /user/apple/terasort/100000-input/part-00001将生成两个 500 B 的数据,加起来是 1000 B = 1 kb产生的数据一行是100B,参数10表示产生10行,共1000B;100,000 行就有 100,000,000 B = 10 M;teragen是用两个 map 来完成数据的生成,每个 map 生成一个文件,两个文件大小共 10M,每个就是 5 M .appleubuntu:/hadoop-$ bin/hadoop jar hadoop-examples-.jar teragen 10 terasort/100000-input2Generating 10 using 2 maps with step of 511/11/06 04:37:59 INFO mapred.JobClient: Running job: job_201111060257_001811/11/06 04:38:00 INFO mapred.JobClient: map 0% reduce 0%11/11/06 04:38:25 INFO mapred.JobClient: map 50% reduce 0%11/11/06 04:38:32 INFO mapred.JobClient: map 100% reduce 0%11/11/06 04:38:37 INFO mapred.JobClient: Job complete: job_201111060257_001811/11/06 04:38:37 INFO mapred.JobClient: Counters: 1511/11/06 04:38:37 INFO mapred.JobClient: Job Counters 11/11/06 04:38:37 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3609111/11/06 04:38:37 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=011/11/06 04:38:37 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=011/11/06 04:38:37 INFO mapred.JobClient: Launched map tasks=211/11/06 04:38:37 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=011/11/06 04:38:37 INFO mapred.JobClient: File Input Format Counters 11/11/06 04:38:37 INFO mapred.JobClient: Bytes Read=011/11/06 04:38:37 INFO mapred.JobClient: File Output Format Counters 11/11/06 04:38:37 INFO mapred.JobClient: Bytes Written=100011/11/06 04:38:37 INFO mapred.JobClient: FileSystemCounters11/11/06 04:38:37 INFO mapred.JobClient: HDFS_BYTES_READ=15811/11/06 04:38:37 INFO mapred.JobClient: FILE_BYTES_WRITTEN=4177611/11/06 04:38:37 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=100011/11/06 04:38:37 INFO mapred.JobClient: Map-Reduce Framework11/11/06 04:38:37 INFO mapred.JobClient: Map input records=1011/11/06 04:38:37 INFO mapred.JobClient: Spilled Records=011/11/06 04:38:37 INFO mapred.JobClient: Map input bytes=1011/11/06 04:38:37 INFO mapred.JobClient: Map output records=1011/11/06 04:38:37 INFO mapred.JobClient: SPLIT_RAW_BYTES=158如果产生 1G 的数据,由于数据块是 64 M 一块,这会被分成16个数据块,当运行terasort时会有64个map task。但是我们产生的是100M的数据,从下面的输出中可以看到一些信息,Launched map tasks=2appleubuntu:/hadoop-$ bin/hadoop jar hadoop-examples-.jar teragen 1000000 terasort/100M-inputGenerating 1000000 using 2 maps with step of 50000011/11/06 04:41:11 INFO mapred.JobClient: Running job: job_201111060257_001911/11/06 04:41:12 INFO mapred.JobClient: map 0% reduce 0%11/11/06 04:41:43 INFO mapred.JobClient: map 10% reduce 0%11/11/06 04:41:58 INFO mapred.JobClient: map 11% reduce 0%11/11/06 04:42:04 INFO mapred.JobClient: map 50% reduce 0%11/11/06 04:42:27 INFO mapred.JobClient: map 96% reduce 0%11/11/06 04:42:33 INFO mapred.JobClient: map 100% reduce 0%11/11/06 04:42:45 INFO mapred.JobClient: Job complete: job_201111060257_001911/11/06 04:42:45 INFO mapred.JobClient: Counters: 1511/11/06 04:42:45 INFO mapred.JobClient: Job Counters 11/11/06 04:42:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12160311/11/06 04:42:45 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=011/11/06 04:42:45 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=011/11/06 04:42:45 INFO mapred.JobClient: Launched map tasks=211/11/06 04:42:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=011/11/06 04:42:45 INFO mapred.JobClient: File Input Format Counters 11/11/06 04:42:45 INFO mapred.JobClient: Bytes Read=011/11/06 04:42:45 INFO mapred.JobClient: File Output Format Counters 11/11/06 04:42:45 INFO mapred.JobClient: Bytes Written=10000000011/11/06 04:42:45 INFO mapred.JobClient: FileSystemCounters11/11/06 04:42:45 INFO mapred.JobClient: HDFS_BYTES_READ=16711/11/06 04:42:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=4178011/11/06 04:42:45 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10000000011/11/06 04:42:45 INFO mapred.JobClient: Map-Reduce Framework11/11/06 04:42:45 INFO mapred.JobClient: Map input records=100000011/11/06 04:42:45 INFO mapred.JobClient: Spilled Records=011/11/06 04:42:45 INFO mapred.JobClient: Map input bytes=100000011/11/06 04:42:45 INFO mapred.JobClient: Map output records=100000011/11/06 04:42:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=167执行 terasort 程序,将会执行 2 个 MapTask,特别容易死在这儿。从下面输出的时间可以看出,仅仅排序100M 数据就从用来15分钟。appleubuntu:/hadoop-$ bin/hadoop jar hadoop-examples-.jar terasort terasort/100M-input terasort/100M-output11/11/06 04:44:24 INFO terasort.TeraSort: starting11/11/06 04:44:26 INFO mapred.FileInputFormat: Total input paths to process : 211/11/06 04:44:30 INFO util.NativeCodeLoader: Loaded the native-hadoop library11/11/06 04:44:30 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library11/11/06 04:44:30 INFO compress.CodecPool: Got brand-new compressorMaking 1 from 100000 recordsStep size is 100000.011/11/06 04:44:32 INFO mapred.FileInputFormat: Total input paths to process : 211/11/06 04:44:34 INFO mapred.JobClient: Running job: job_201111060257_002011/11/06 04:44:35 INFO mapred.JobClient: map 0% reduce 0%11/11/06 04:49:00 INFO mapred.JobClient: map 1% reduce 0%11/11/06 04:49:16 INFO mapred.JobClient: map 2% reduce 0%11/11/06 04:49:19 INFO mapred.JobClient: map 4% reduce 0%11/11/06 04:49:26 INFO mapred.JobClient: map 7% reduce 0%11/11/06 04:49:27 INFO mapred.JobClient: map 8% reduce 0%11/11/06 04:49:32 INFO mapred.JobClient: map 10% reduce 0%11/11/06 04:49:39 INFO mapred.JobClient: map 11% reduce 0%11/11/06 04:49:40 INFO mapred.JobClient: map 14% reduce 0%11/11/06 04:49:49 INFO mapred.JobClient: map 16% reduce 0%11/11/06 04:49:53 INFO mapred.JobClient: map 20% reduce 0%11/11/06 04:49:55 INFO mapred.JobClient: map 23% reduce 0%11/11/06 04:50:00 INFO mapred.JobClient: map 24% reduce 0%11/11/06 04:50:01 INFO mapred.JobClient: map 26% reduce 0%11/11/06 04:50:05 INFO mapred.JobClient: map 29% reduce 0%11/11/06 04:50:08 INFO mapred.JobClient: map 30% reduce 0%11/11/06 04:50:11 INFO mapred.JobClient: map 33% reduce 0%11/11/06 04:50:16 INFO mapred.JobClient: map 35% reduce 0%11/11/06 04:50:19 INFO mapred.JobClient: map 36% reduce 0%11/11/06 04:50:22 INFO mapred.JobClient: map 38% reduce 0%11/11/06 04:50:28 INFO mapred.JobClient: map 39% reduce 0%11/11/06 04:51:31 INFO mapred.JobClient: map 41% reduce 0%11/11/06 04:52:19 INFO mapred.JobClient: map 44% reduce 0%11/11/06 04:52:27 INFO mapred.JobClient: map 51% reduce 0%11/11/06 04:52:31 INFO mapred.JobClient: map 52% reduce 0%11/11/06 04:52:34 INFO mapred.JobClient: map 55% reduce 0%11/11/06 04:52:43 INFO mapred.JobClient: map 56% reduce 0%11/11/06 04:53:01 INFO mapred.JobClient: map 57% reduce 0%11/11/06 04:53:06 INFO mapred.JobClient: map 59% reduce 0%11/11/06 04:53:10 INFO mapred.JobClient: map 60% reduce 0%11/11/06 04:53:18 INFO mapred.JobClient: map 67% reduce 0%11/11/06 04:54:59 INFO mapred.JobClient: map 69% reduce 0%11/11/06 04:55:05 INFO mapred.JobClient: map 71% reduce 0%11/11/06 04:55:30 INFO mapred.JobClient: map 86% reduce 0%11/11/06 04:55:38 INFO mapred.JobClient: map 91% reduce 0%11/11/06 04:55:48 INFO mapred.JobClient: map 92% reduce 0%11/11/06 04:55:55 INFO mapred.JobClient: map 95% reduce 0%11/11/06 04:56:00 INFO mapred.JobClient: map 96% reduce 0%11/11/06 04:56:10 INFO mapred.JobClient: map 97% reduce 0%11/11/06 04:56:19 INFO mapred.JobClient: map 99% reduce 0%11/11/06 04:57:57 INFO mapred.JobClient: map 100% reduce 0%11/11/06 04:58:36 INFO mapred.JobClient: map 100% reduce 16%11/11/06 04:58:41 INFO mapred.JobClient: map 100% reduce 33%11/11/06 04:58:47 INFO mapred.JobClient: map 100% reduce 66%11/11/06 04:58:50 INFO mapred.JobClient: map 100% reduce 68%11/11/06 04:58:54 INFO mapred.JobClient: map 100% reduce 78%11/11/06 04:59:04 INFO mapred.JobClient: map 100% reduce 80%11/11/06 04:59:10 INFO mapred.JobClient: map 100% reduce 82%11/11/06 04:59:16 INFO mapred.JobClient: map 100% reduce 89%11/11/06 04:59:19 INFO mapred.JobClient: map 100% reduce 96%11/11/06 04:59:22 INFO mapred.JobClient: map 100% reduce 99%11/11/06 04:59:28 INFO mapred.JobClient: map 100% reduce 100%11/11/06 04:59:40 INFO mapred.JobClient: Job complete: job_201111060257_002011/11/06 04:59:42 INFO mapred.JobClient: Counters: 2611/11/06 04:59:42 INFO mapred.JobClient: Job Counters 11/11/06 04:59:42 INFO mapred.JobClient: Launched reduce tasks=111/11/06 04:59:42 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=148551711/11/06 04:59:42 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=011/11/06 04:59:42 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=011/11/06 04:59:42 INFO mapred.JobClient: Launched map tasks=211/11/06 04:59:42 INFO mapred.JobClient: Data-local map tasks=211/11/06 04:59:42 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14338211/11/06 04:59:42 INFO mapred.JobClient: File Input Format Counters 11/11/06 04:59:42 INFO mapred.JobClient: Bytes Read=10000000011/11/06 04:59:42 INFO mapred.JobClient: File Output Format Counters 11/11/06 04:59:42 INFO mapred.JobClient: Bytes Written=10000000011/11/06 04:59:42 INFO mapred.JobClient: FileSystemCounters11/11/06 04:59:42 INFO mapred.JobClient: FILE_BYTES_READ=20400029411/11/06 04:59:42 INFO mapred.JobClient: HDFS_BYTES_READ=10000023211/11/06 04:59:42 INFO mapred.JobClient: FILE_BYTES_WRITTEN=30606554311/11/06 04:59:42 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=10000000011/11/06 04:59:42 INFO mapred.JobClient: Map-Reduce Framework11/11/06 04:59:42 INFO mapred.JobClient: Map output materialized bytes=10200001211/11/06 04:59:42 INFO mapred.JobClient: Map input records=100000011/11/06 04:59:42 INFO mapred.JobClient: Reduce shuffle bytes=10200001211/11

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论