利用Hadoop构建云计算基础教程_第1页
利用Hadoop构建云计算基础教程_第2页
利用Hadoop构建云计算基础教程_第3页
利用Hadoop构建云计算基础教程_第4页
利用Hadoop构建云计算基础教程_第5页
已阅读5页,还剩58页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

TopofForm

BottomofForm

\o"Home"

Home

\o"WhatisBigData?"

BigData

\o"FindHadoopTutorialshere"

HadoopTutorials

\o"CassandraandCQL"

Cassandra

\o"CassandraHectorAPI"

HectorAPI

\o"AskforaTutorial"

RequestTutorial

\o"AboutMeandBigDataPlanet"

About

LABELS:

HADOOP-TUTORIAL

,

HDFS

3OCTOBER2013

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

HadoopisanopensourcesoftwareframeworkthatsupportsdataintensivedistributedapplicationswhichislicensedunderApachev2license.

At-leastthisiswhatyouaregoingtofindasthefirstlineofdefinitiononHadoopinWikipedia.So

whatisdataintensivedistributedapplications?

Well

dataintensive

isnothingbut

BigData

(datathathasoutgrowninsize)anddistributedapplications

aretheapplicationsthatworksonnetworkbycommunicatingand

coordinatingwitheachotherbypassingmessages.(sayusingaRPCinterprocesscommunicationorthroughMessage-Queue)

HenceHadoopworksonadistributedenvironmentandisbuildtostore,handleandprocesslargeamountofdataset(inpetabytes,exabyteandmore).Nowheresinceiamsayingthathadoopstorespetabytesofdata,thisdoesn'tmeanthatHadoopisadatabase.Againrememberitsaframeworkthathandleslargeamountofdataforprocessing.YouwillgettoknowthedifferencebetweenHadoopandDatabases(orNoSQLDatabases,wellthat'swhatwecallBigData'sdatabases)asyougodownthelineinthecomingtutorials.

HadoopwasderivedfromtheresearchpaperpublishedbyGoogleon

GoogleFileSystem(GFS)

and

Google'sMapReduce.SotherearetwointegralpartsofHadoop:

HadoopDistributedFileSystem(HDFS)

and

HadoopMapReduce.

HadoopDistributedFileSystem(HDFS)

HDFSisafilesystemdesignedforstoring

verylargefiles

with

streamingdataaccesspatterns,runningonclustersof

commodityhardware.

WellLetsgetintothedetailsofthestatementmentionedabove:

VeryLargefiles:

Nowwhenwesayverylargefileswemeanherethatthesizeofthefilewillbeinarangeofgigabyte,terabyte,petabyteormaybemore.

Streamingdataaccess:

HDFSisbuiltaroundtheideathatthemostefficientdataprocessingpatternisawrite-once,read-many-timespattern.Adatasetistypicallygeneratedorcopiedfromsource,andthenvariousanalysesareperformedonthatdatasetovertime.Eachanalysiswillinvolvealargeproportion,ifnotall,ofthedataset,sothetimetoreadthewholedatasetismoreimportantthanthelatencyinreadingthefirstrecord.

CommodityHardware:

Hadoopdoesn'trequireexpensive,highlyreliablehardware.It’sdesignedtorun

onclustersofcommodityhardware(commonlyavailablehardwarethatcanbeobtainedfrommultiplevendors)forwhichthechanceofnodefailureacrosstheclusterishigh,atleastforlargeclusters.HDFSisdesignedtocarryonworkingwithoutanoticeableinterruptiontotheuserinthefaceofsuchfailure.

NowherewearetalkingaboutaFileSystem,HadoopDistributedFileSystem.AndweallknowaboutafewoftheotherFileSystemslikeLinuxFileSystemandWindowsFileSystem.Sothenextquestioncomesis...

WhatisthedifferencebetweennormalFileSystemandHadoopDistributedFileSystem?

ThemajortwodifferencesthatisnotablebetweenHDFSandotherFilesystemsare:

BlockSize:

Everydiskismadeupofablocksize.Andthisisthe

minimum

amountofdatathatiswrittenandreadfromaDisk.NowaFilesystemalsoconsistsofblockswhichismadeoutoftheseblocksonthedisk.Normallydiskblocksareof512bytesandthoseoffilesystemareofafewkilobytes.

Incaseof

HDFS

wealsohavetheblocksconcept.Buthereoneblocksizeisof64MBbydefaultandwhichcanbeincreasedinanintegralmultipleof64i.e.128MB,256MB,512MBorevenmoreinGB's.Italldependontherequirementanduse-cases.

SoWhyaretheseblockssizesolargeforHDFS?keeponreadingandyouwillgetitinanextfewtutorials:)

Metadata

Storage:

Innormalfilesystem

thereisa

hierarchical

storageofmetadatai.e.letssaythereisafolder

ABC,

insidethatfolderthereisagainoneanotherfolder

DEF,

andinsidethatthereis

hello.txt

file.Nowtheinformationabout

hello.txt

(i.e.metadatainfoofhello.txt)

filewillbewith

DEF

andagainthemetadataof

DEF

willbewith

ABC.Hencethisformsa

hierarchy

andthishierarchyismaintaineduntiltherootofthefilesystem.Butin

HDFS

wedon'thaveahierarchyofmetadata.Allthemetadatainformationresideswithasinglemachineknownas

Namenode

(orMasterNode)onthecluster.Andthisnodecontainsalltheinformationaboutotherfilesandfolderandlotsofotherinformationtoo,whichwewilllearninthenextfewtutorials.:)

WellthiswasjustanoverviewofHadoopandHadoopDistributedFileSystem.NowinthenextpartiwillgointothedepthofHDFSandthereafterMapReduceandwillcontinuefromhere...

Letmeknowifyouhaveanydoubtsin

understanding

anythingintothecommentsectionandiwillbereallygladtoanswerthesame:)

IfyoulikewhatyoujustreadandwanttocontinueyourlearningonBIGDATAyoucan

subscribetoourEmail

andLikeour

facebookpage

Thesemightalsohelpyou:,

HadoopTutorial:Part4-WriteOperationsinHDFS

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

BestofBooksandResourcestoGetStartedwithHadoop

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

HadoopInstallationonLocalMachine(SinglenodeCluster)

FindCommentsbeloworAddone

RomainRigaux

said...

Nicesummary!

\o"commentpermalink"

October03,2013

pragyakhare

said...

Iknowi'mabeginnerandthisquestionmytbeasilly1butcanyoupleaseexplaintomethathowPARALLELISMisachievedviamap-reduceattheprocessorlevel???ifI'veadualcoreprocessor,isitthatonly2jobswillrunatatimeinparallel?

\o"commentpermalink"

October05,2013

Anonymoussaid...

HiIamfromMainframebackgroundandwithlittleknowledgeofcorejava...DoyouthinkJavaisneededforlearningHadoopinadditiontoHive/PIG?EvenwanttolearnJavaformapreducebutcouldn'tfindwhatallwillbeusedinrealtime..anddefinitiveguidebooksseemstoughforlearningmapreducewithJava..anyoptionwhereIcanlearnitstepbystep?

Sorryforlongcomment..butitwouldbehelpfulifyoucanguideme..

\o"commentpermalink"

October05,2013

DeepakKumar

said...

@PragyaKhare...

Firstthingalwaysremember...theonePopularsayingNOQuestionsareFoolish:)Andbtwitisaverygoodquestion.

Actuallytherearetwothings:

Oneiswhatwillbethebestpractice?andotheriswhathappensintherebydefault?...

Wellbydefaultthenumberofmapperandreducerissetto2foranytasktracker,henceoneseesamaximumof2mapsand2reducesatagiveninstanceonaTaskTracker(whichisconfigurable)..WellthisDoesn'tonlydependontheProcessorbutonlotsofotherfactoraswelllikeram,cpu,power,diskandothers

/blog/best-practices-for-selecting-apache-hadoop-hardware/

Andfortheotherfactori.eforBestPracticesitdependsonyourusecase.Youcangothroughthe3rdpointofthebelowlinktounderstanditmoreconceptually

/blog/2009/12/7-tips-for-improving-mapreduce-performance/

WelliwillexplainallthesewheniwillreachtheadvanceMapReducetutorials..Tillthenkeepreading!!:)

\o"commentpermalink"

October05,2013

DeepakKumar

said...

@Anonymous

AsHadoopiswritteninJava,somostofitsAPI'sarewrittenincoreJava...WelltoknowabouttheHadooparchitectureyoudon'tneedJava...ButtogotoitsAPILevelandstartprogramminginMapReduceyouneedtoknowCoreJava.

Andasfortherequirementinjavayouhaveaskedfor...youjustneedsimplecorejavaconceptsandprogrammingforHadoopandMapReduce..AndHive/PIGaretheSQLkindofdataflowlanguagesthatisreallyeasytolearn...Andsinceyouarefromaprogrammingbackgrounditwon'tbeverydifficulttolearnjava:)youcanalsogothroughthelinkbelowforfurtherdetails:)

/2013/09/What-are-the-Pre-requsites-for-getting-started-with-Big-Data-Technologies.html

\o"commentpermalink"

October05,2013

PostaComment

\o"NewerPost"

NewerPost→

\o"OlderPost"

←OlderPost

ABOUTTHEAUTHOR

DEEPAKKUMAR

BigData/HadoopDeveloper,SoftwareEngineer,Thinker,Learner,Geek,Blogger,Coder

IlovetoplayaroundData.

BigData

!

SubscribeupdatesviaEmail

TopofForm

JoinBigDataPlanettocontinueyourlearningonBigDataTechnologies

BottomofForm

GetUpdatesonFacebook

BigDataLibraries

BIGDATANEWS

CASSANDRA

HADOOP-TUTORIAL

HDFS

HECTOR-API

INSTALLATION

SQOOP

WhichNoSQLDatabasesaccordingtoyouisMostPopular?

GetConnectedonGoogle+

MostPopularBlogArticle

HadoopInstallationonLocalMachine(SinglenodeCluster)

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

WhatarethePre-requisitesforgettingstartedwithBigDataTechnologies

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part4-WriteOperationsinHDFS

BestofBooksandResourcestoGetStartedwithHadoop

HowtouseCassandraCQLinyourJavaApplication

BacktoTop▲

#Note:UseScreenResolutionof1280pxandmoretoviewthewebsite@itsbest.AlsousethelatestversionofthebrowserasthewebsiteusesHTML5andCSS3:)

\o"Twitter:@bigdataplanet"

Twitter

\o"Facebook:BigDataPlanet"

Facebook

\o"RSSFeed:Blog"

RSS

\o"GooglePlus:BigDataPlanet"

Google

ABOUTME

CONTACT

PRIVACYPOLICY

©2013AllRightsReserved

BigDataPlanet.

Allarticlesonthiswebsite

by

DeepakKumar

islicensedundera

CreativeCommonsAttribution-NonCommercial-ShareAlike3.0UnportedLicense

TopofForm

BottomofForm

\o"Home"

Home

\o"WhatisBigData?"

BigData

\o"FindHadoopTutorialshere"

HadoopTutorials

\o"CassandraandCQL"

Cassandra

\o"CassandraHectorAPI"

HectorAPI

\o"AskforaTutorial"

RequestTutorial

\o"AboutMeandBigDataPlanet"

About

LABELS:

HADOOP-TUTORIAL

,

HDFS

6OCTOBER2013

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

Inthelasttutorialon

WhatisHadoop?

ihavegivenyouabriefideaaboutHadoop.SothetwointegralpartsofHadoopisHadoop

HDFS

andHadoop

MapReduce.

LetsgofurtherdeepinsideHDFS.

HadoopDistributedFileSystem

(HDFS)

Concepts:

FirsttakealookatthefollowingtwoterminologiesthatwillbeusedwhiledescribingHDFS.

Cluster:Ahadoopclusterismadebyhavingmanymachinesinanetwork,eachmachineistermedasanode,andthesenodestalkstoeachotheroverthenetwork.

BlockSize:

Thisistheminimumamountofsizeofoneblockinafilesystem,inwhichdatacanbekeptcontiguously.

ThedefaultsizeofasingleblockinHDFSis64Mb.

InHDFS,Dataiskeptbysplittingitintosmallchunksorparts.Letssayyouhaveatextfileof200MBandyouwanttokeepthisfileinaHadoopCluster.Thenwhathappensisthat,

thefilebreaksorsplitsintoalargenumberofchunks,whereeachchunkisequaltotheblocksizethatissetfortheHDFScluster(whichis64MBbydefault).

Hencea200Mboffilegetssplitinto4parts,3partsof64mband1partof8mb,andeachpartwillbekeptonadifferentmachine.OnwhichmachinewhichsplitwillbekeptisdecidedbyNamenode,aboutwhichwewillbediscussingindetailsbelow.

NowinaHadoopDistributedFileSystemorHDFSCluster,therearetwokindsofnodes,AMasterNodeandmanyWorkerNodes.Theseareknownas:

Namenode(masternode)andDatanode(workernode).

Namenode:

Thenamenodemanagesthefilesystemnamespace.Itmaintainsthefilesystemtreeandthemetadataforallthefilesanddirectoriesinthetree.Soitcontainstheinformationofallthefiles,directoriesandtheirhierarchyintheclusterintheformofa

NamespaceImage

and

editlogs.AlongwiththefilesysteminformationitalsoknowsabouttheDatanodeonwhich

alltheblocksofafileiskept.

Aclientaccessesthefilesystemonbehalfoftheuserbycommunicatingwiththenamenodeanddatanodes.TheclientpresentsafilesysteminterfacesimilartoaPortableOperatingSystemInterface(POSIX),sotheusercodedoesnotneedtoknowaboutthenamenodeanddatanodetofunction.

Datanode:

Thesearetheworkersthatdoestherealwork.Andherebyrealworkwemeanthatthestorageofactualdataisdonebythedatanode.Theystoreandretrieveblockswhentheyaretoldto(byclientsorthenamenode),andtheyreportbacktothenamenodeperiodicallywithlistsofblocksthattheyarestoring.

Hereoneimportantthingthatistheretonote:

InoneclustertherewillbeonlyoneNamenodeandtherecanbeNnumberofdatanodes.

SincetheNamenodecontainsthemetadataofallthefilesanddirectoriesandalsoknowsaboutthedatanodeonwhicheachsplitoffilesarestored.SoletssayNamenodegoesdownthenwhatdoyouthinkwillhappen?.

Yes,iftheNamenodeisDownwecannotaccessanyofthefilesanddirectoriesinthecluster.

Evenwewillnotbeabletoconnectwithanyofthedatanodestogetanyofthefiles.

Nowthinkofit,sincewehavekeptourfilesbysplittingitin

different

chunksandalsowehavekeptthemindifferentdatanodes.AnditistheNamenodethatkeepstrackofallthefilesmetadata.SoonlyNamenodeknowshowtoreconstructafilebackintoonefromallthesplits.andthisisthereasonthatifNamenodeisdowninahadoopclustersoeverythingisdown.

Thisisalsothereason

that's

whyHadoopisknownasaSinglePointoffailure.

NowsinceNamenodeissoimportant,wehavetomakethenamenoderesilienttofailure.Andforthathadoopprovidesuswithtwomechanism.

Thefirstwayistobackupthefilesthatmakeupthepersistentstateofthefilesystemmetadata.Hadoopcanbeconfiguredsothatthenamenodewritesitspersistentstatetomultiplefilesystems.Thesewritesaresynchronousandatomic.TheusualconfigurationchoiceistowritetolocaldiskaswellasaremoteNFSmount.

Thesecondwayisrunninga

SecondaryNamenode.

Wellasthenamesuggests,it

doesnot

actlikeaNamenode.Soifitdoesn'tactlikeanamenodehowdoesitpreventsfromthefailure.

Wellthe

Secondarynamenode

alsocontainsa

namespaceimage

and

editlogs

likenamenode.Nowaftereverycertainintervaloftime(whichisonehourbydefault)

itcopiesthe

namespaceimage

from

namenode

andmergethis

namespaceimage

withthe

editlog

andcopyitbacktothe

namenode

sothat

namenode

willhavethefreshcopyof

namespaceimage.Nowletssupposeatanyinstanceoftimethe

namenodegoesdownandbecomescorruptthenwecanrestart

someothermachinewiththenamespaceimageandtheeditlogthat'swhatwehavewiththe

secondarynamenodeandhencecanbepreventedfromatotalfailure.

SecondaryNamenodetakesalmostthesameamountofmemoryandCPUforitsworkingastheNamenode.Soitisalsokeptinaseparatemachinelikethatofanamenode.Henceweseeherethat

inasingleclusterwehaveoneNamenode,oneSecondarynamenodeandmanyDatanodes,andHDFSconsistsofthesethreeelements.

ThiswasagainanoverviewofHadoopDistributedFileSystemHDFS,InthenextpartofthetutorialwewillknowabouttheworkingofNamenodeandDatanodeinamoredetailedmanner.WewillknowhowreadandwritehappensinHDFS.

Letmeknowifyouhaveanydoubtsin

understanding

anythingintothecommentsectionandiwillbereallygladtoansweryourquestions:)

IfyoulikewhatyoujustreadandwanttocontinueyourlearningonBIGDATAyoucan

subscribetoourEmail

andLikeour

facebookpage

Thesemightalsohelpyou:,

HadoopInstallationonLocalMachine(SinglenodeCluster)

HadoopTutorial:Part4-WriteOperationsinHDFS

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

BestofBooksandResourcestoGetStartedwithHadoop

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

FindCommentsbeloworAddone

vishwash

said...

veryinformative...

\o"commentpermalink"

October07,2013

TusharKarande

said...

Thanksforsuchainformatictutorials:)

pleasekeepposting..waitingformore...:)

\o"commentpermalink"

October08,2013

Anonymoussaid...

NiceinformationButIhaveonedoubtlike,whatistheadvantageofkeepingthefileinpartofchunksondifferent-2datanodes?Whatkindofbenefitwearegettinghere?

\o"commentpermalink"

October08,2013

DeepakKumar

said...

@Anonymous:Welltherearelotsofreasons...iwillexplainthatwithgreatdetailsinthenextfewarticles...

Butfornowletusunderstandthis...sincewehavesplitthefileintotwo,nowwecantakethepoweroftwoprocessors(parallelprocessing)ontwodifferentnodestodoouranalysis(likesearch,calculation,predictionandlotsmore)..Againletssaymyfilesizeisinsomepetabytes...Yourwon'tfindoneHarddiskthatbig..andletssayifitisthere...howdoyouthinkthatwearegoingtoreadandwriteonthatharddisk(thelatencywillbereallyhightoreadandwrite)...itwilltakelotsoftime...Againtherearemorereasonsforthesame...Iwillmakeyouunderstandthisinmoretechnicalwaysinthecomingtutorials...Tillthenkeepreading:)

\o"commentpermalink"

October08,2013

PostaComment

\o"NewerPost"

NewerPost→

\o"OlderPost"

←OlderPost

ABOUTTHEAUTHOR

DEEPAKKUMAR

BigData/HadoopDeveloper,SoftwareEngineer,Thinker,Learner,Geek,Blogger,Coder

IlovetoplayaroundData.

BigData

!

SubscribeupdatesviaEmail

TopofForm

JoinBigDataPlanettocontinueyourlearningonBigDataTechnologies

BottomofForm

GetUpdatesonFacebook

BigDataLibraries

BIGDATANEWS

CASSANDRA

HADOOP-TUTORIAL

HDFS

HECTOR-API

INSTALLATION

SQOOP

WhichNoSQLDatabasesaccordingtoyouisMostPopular?

GetConnectedonGoogle+

MostPopularBlogArticle

HadoopInstallationonLocalMachine(SinglenodeCluster)

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

WhatarethePre-requisitesforgettingstartedwithBigDataTechnologies

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part4-WriteOperationsinHDFS

BestofBooksandResourcestoGetStartedwithHadoop

HowtouseCassandraCQLinyourJavaApplication

BacktoTop▲

#Note:UseScreenResolutionof1280pxandmoretoviewthewebsite@itsbest.AlsousethelatestversionofthebrowserasthewebsiteusesHTML5andCSS3:)

\o"Twitter:@bigdataplanet"

Twitter

\o"Facebook:BigDataPlanet"

Facebook

\o"RSSFeed:Blog"

RSS

\o"GooglePlus:BigDataPlanet"

Google

ABOUTME

CONTACT

PRIVACYPOLICY

©2013AllRightsReserved

BigDataPlanet.

Allarticlesonthiswebsite

by

DeepakKumar

islicensedundera

CreativeCommonsAttribution-NonCommercial-ShareAlike3.0UnportedLicense

TopofForm

BottomofForm

\o"Home"

Home

\o"WhatisBigData?"

BigData

\o"FindHadoopTutorialshere"

HadoopTutorials

\o"CassandraandCQL"

Cassandra

\o"CassandraHectorAPI"

HectorAPI

\o"AskforaTutorial"

RequestTutorial

\o"AboutMeandBigDataPlanet"

About

LABELS:

HADOOP-TUTORIAL

,

HDFS

3OCTOBER2013

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

HadoopisanopensourcesoftwareframeworkthatsupportsdataintensivedistributedapplicationswhichislicensedunderApachev2license.

At-leastthisiswhatyouaregoingtofindasthefirstlineofdefinitiononHadoopinWikipedia.So

whatisdataintensivedistributedapplications?

Well

dataintensive

isnothingbut

BigData

(datathathasoutgrowninsize)anddistributedapplications

aretheapplicationsthatworksonnetworkbycommunicatingand

coordinatingwitheachotherbypassingmessages.(sayusingaRPCinterprocesscommunicationorthroughMessage-Queue)

HenceHadoopworksonadistributedenvironmentandisbuildtostore,handleandprocesslargeamountofdataset(inpetabytes,exabyteandmore).Nowheresinceiamsayingthathadoopstorespetabytesofdata,thisdoesn'tmeanthatHadoopisadatabase.Againrememberitsaframeworkthathandleslargeamountofdataforprocessing.YouwillgettoknowthedifferencebetweenHadoopandDatabases(orNoSQLDatabases,wellthat'swhatwecallBigData'sdatabases)asyougodownthelineinthecomingtutorials.

HadoopwasderivedfromtheresearchpaperpublishedbyGoogleon

GoogleFileSystem(GFS)

and

Google'sMapReduce.SotherearetwointegralpartsofHadoop:

HadoopDistributedFileSystem(HDFS)

and

HadoopMapReduce.

HadoopDistributedFileSystem(HDFS)

HDFSisafilesystemdesignedforstoring

verylargefiles

with

streamingdataaccesspatterns,runningonclustersof

commodityhardware.

WellLetsgetintothedetailsofthestatementmentionedabove:

VeryLargefiles:

Nowwhenwesayverylargefileswemeanherethatthesizeofthefilewillbeinarangeofgigabyte,terabyte,petabyteormaybemore.

Streamingdataaccess:

HDFSisbuiltaroundtheideathatthemostefficientdataprocessingpatternisawrite-once,read-many-timespattern.Adatasetistypicallygeneratedorcopiedfromsource,andthenvariousanalysesareperformedonthatdatasetovertime.Eachanalysiswillinvolvealargeproportion,ifnotall,ofthedataset,sothetimetoreadthewholedatasetismoreimportantthanthelatencyinreadingthefirstrecord.

CommodityHardware:

Hadoopdoesn'trequireexpensive,highlyreliablehardware.It’sdesignedtorun

onclustersofcommodityhardware(commonlyavailablehardwarethatcanbeobtainedfrommultiplevendors)forwhichthechanceofnodefailureacrosstheclusterishigh,atleastforlargeclusters.HDFSisdesignedtocarryonworkingwithoutanoticeableinterruptiontotheuserinthefaceofsuchfailure.

NowherewearetalkingaboutaFileSystem,HadoopDistributedFileSystem.AndweallknowaboutafewoftheotherFileSystemslikeLinuxFileSystemandWindowsFileSystem.Sothenextquestioncomesis...

WhatisthedifferencebetweennormalFileSystemandHadoopDistributedFileSystem?

ThemajortwodifferencesthatisnotablebetweenHDFSandotherFilesystemsare:

BlockSize:

Everydiskismadeupofablocksize.Andthisisthe

minimum

amountofdatathatiswrittenandreadfromaDisk.NowaFilesystemalsoconsistsofblockswhichismadeoutoftheseblocksonthedisk.Normallydiskblocksareof512bytesandthoseoffilesystemareofafewkilobytes.

Incaseof

HDFS

wealsohavetheblocksconcept.Buthereoneblocksizeisof64MBbydefaultandwhichcanbeincreasedinanintegralmultipleof64i.e.128MB,256MB,512MBorevenmoreinGB's.Italldependontherequirementanduse-cases.

SoWhyaretheseblockssizesolargeforHDFS?keeponreadingandyouwillgetitinanextfewtutorials:)

Metadata

Storage:

Innormalfilesystem

thereisa

hierarchical

storageofmetadatai.e.letssaythereisafolder

ABC,

insidethatfolderthereisagainoneanotherfolder

DEF,

andinsidethatthereis

hello.txt

file.Nowtheinformationabout

hello.txt

(i.e.metadatainfoofhello.txt)

filewillbewith

DEF

andagainthemetadataof

DEF

willbewith

ABC.Hencethisformsa

hierarchy

andthishierarchyismaintaineduntiltherootofthefilesystem.Butin

HDFS

wedon'thaveahierarchyofmetadata.Allthemetadatainformationresideswithasinglemachineknownas

Namenode

(orMasterNode)onthecluster.Andthisnodecontainsalltheinformationaboutotherfilesandfolderandlotsofotherinformationtoo,whichwewilllearninthenextfewtutorials.:)

WellthiswasjustanoverviewofHadoopandHadoopDistributedFileSystem.NowinthenextpartiwillgointothedepthofHDFSandthereafterMapReduceandwillcontinuefromhere...

Letmeknowifyouhaveanydoubtsin

understanding

anythingintothecommentsectionandiwillbereallygladtoanswerthesame:)

IfyoulikewhatyoujustreadandwanttocontinueyourlearningonBIGDATAyoucan

subscribetoourEmail

andLikeour

facebookpage

Thesemightalsohelpyou:,

HadoopTutorial:Part4-WriteOperationsinHDFS

HadoopTutorial:Part3-ReplicaPlacementorReplicationandReadOperationsinHDFS

HadoopTutorial:Part2-HadoopDistributedFileSystem(HDFS)

HadoopTutorial:Part1-WhatisHadoop?(anOverview)

BestofBooksandResourcestoGetStartedwithHadoop

HadoopTutorial:Part5-AllHadoopShellCommandsyouwillNeed.

HadoopInstallationonLocalMachine(SinglenodeCluster)

FindCommentsbeloworAddone

RomainRigaux

said...

Nicesummary!

\o"commentpermalink"

October03,2013

pragyakhare

said...

Iknowi'mabeginnerandthisquestionmytbeasilly1butcanyoupleaseexplaintomethathowPARALLELISMisachievedviamap-reduceattheprocessorlevel???ifI'veadualcoreprocessor,isitthatonly2jobswillrunatatimeinparallel?

\o"commentpermalink"

October05,2013

Anonymoussaid...

HiIamfromMainframebackgroundandwithlittleknowledgeofcorejava...DoyouthinkJavaisneededforlearningHadoopinadditiontoHive/PIG?EvenwanttolearnJavaformapreducebutcouldn'tfindwhatallwillbeusedinrealtime..anddefinitiveguidebooksseemstoughforlearningmapreducewithJava..anyoptionwhereIcanlearnitstepbystep?

Sorryforlongcomment..butitwouldbehelpfulifyoucanguideme..

\o"commentpermalink"

October05,2013

DeepakKumar

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论