毕业设计外文文献-Hive - 使用Hadoop的PB级数据仓库

上传人：1*** IP属地：江苏上传时间：2023-04-24 格式：DOCX 页数：16 大小：39.25KB 积分：12 举报 版权申诉

已阅读5页，还剩11页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

PAGE附录A外文翻译—原文部分Hive–APetabyteScaleDataWarehouseUsingHadoopAshishThusoo,JoydeepSenSarma,NamitJain,ZhengShao,PrasadChakka,NingZhang,SureshAntony,HaoLiuandRaghothamMurthyFacebookDataInfrastructureTeamThesizeofdatasetsbeingcollectedandanalyzedintheindustryforbusinessintelligenceisgrowingrapidly,makingtraditionalwarehousingsolutionsprohibitivelyexpensive.Hadoop[1]isapopularopen-sourcemap-reduceimplementationwhichisbeingusedincompanieslikeYahoo,Facebooketc.tostoreandprocessextremelylargedatasetsoncommodityhardware.However,themap-reduceprogrammingmodelisverylowlevelandrequiresdeveloperstowritecustomprogramswhicharehardtomaintainandreuse.Inthispaper,wepresentHive,anopen-sourcedatawarehousingsolutionbuiltontopofHadoop.HivesupportsqueriesexpressedinaSQL-likedeclarativelanguage-HiveQL,whicharecompiledintomapreducejobsthatareexecutedusingHadoop.Inaddition,HiveQLenablesuserstoplugincustommap-reducescriptsintoqueries.Thelanguageincludesatypesystemwithsupportfortablescontainingprimitivetypes,collectionslikearraysandmaps,andnestedcompositionsofthesame.TheunderlyingIOlibrariescanbeextendedtoquerydataincustomformats.Hivealsoincludesasystemcatalog-Metastore–thatcontainsschemasandstatistics,whichareusefulindataexploration,queryoptimizationandquerycompilation.InFacebook,theHivewarehousecontainstensofthousandsoftablesandstoresover700TBofdataandisbeingusedextensivelyforbothreportingandad-hocanalysesbymorethan200userspermonth.ScalableanalysisonlargedatasetshasbeencoretothefunctionsofanumberofteamsatFacebook–bothengineeringandnon-engineering.Apartfromadhocanalysisandbusinessintelligenceapplicationsusedbyanalystsacrossthecompany,anumberofFacebookproductsarealsobasedonanalytics.TheseproductsrangefromsimplereportingapplicationslikeInsightsfortheFacebookAdNetwork,tomoreadvancedkindsuchasFacebook'sLexiconproduct[2].AsaresultaflexibleinfrastructurethatcaterstotheneedsofthesediverseapplicationsandusersandthatalsoscalesupinacosteffectivemannerwiththeeverincreasingamountsofdatabeinggeneratedonFacebook,iscritical.HiveandHadooparethetechnologiesthatwehaveusedtoaddresstheserequirementsatFacebook.TheentiredataprocessinginfrastructureinFacebookpriorto2008wasbuiltaroundadatawarehousebuiltusingacommercialRDBMS.Thedatathatweweregeneratingwasgrowingveryfast-asanexamplewegrewfroma15TBdatasetin2007toa700TBdatasettoday.Theinfrastructureatthattimewassoinadequatethatsomedailydataprocessingjobsweretakingmorethanadaytoprocessandthesituationwasjustgettingworsewitheverypassingday.Wehadanurgentneedforinfrastructurethatcouldscalealongwithourdata.AsaresultwestartedexploringHadoopasatechnologytoaddressourscalingneeds.ThefactthatHadoopwasalreadyanopensourceprojectthatwasbeingusedatpetabytescaleandprovidedscalabilityusingcommodityhardwarewasaverycompellingpropositionforus.ThesamejobsthathadtakenmorethanadaytocompletecouldnowbecompletedwithinafewhoursusingHadoop.However,usingHadoopwasnoteasyforendusers,especiallyforthoseuserswhowerenotfamiliarwithmapreduce.Endusershadtowritemap-reduceprogramsforsimpletaskslikegettingrawcountsoraverages.HadooplackedtheexpressivenessofpopularquerylanguageslikeSQLandasaresultusersendedupspendinghours(ifnotdays)towriteprogramsforevensimpleanalysis.Itwasverycleartousthatinordertoreallyempowerthecompanytoanalyzethisdatamoreproductively,wehadtoimprovethequerycapabilitiesofHadoop.BringingthisdataclosertousersiswhatinspiredustobuildHiveinJanuary2007.Ourvisionwastobringthefamiliarconceptsoftables,columns,partitionsandasubsetofSQLtotheunstructuredworldofHadoop,whilestillmaintainingtheextensibilityandflexibilitythatHadoopenjoyed.HivewasopensourcedinAugust2008andsincethenhasbeenusedandexploredbyanumberofHadoopusersfortheirdataprocessingneeds.Rightfromthestart,HivewasverypopularwithalluserswithinFacebook.Today,weregularlyrunthousandsofjobsontheHadoop/Hiveclusterwithhundredsofusersforawidevarietyofapplicationsstartingfromsimplesummarizationjobstobusinessintelligence,machinelearningapplicationsandtoalsosupportFacebookproductfeatures.DATASTORAGE,SERDEANDFILEFORMATSDataStorageWhilethetablesarelogicaldataunitsinHive,tablemetadataassociatesthedatainatabletohdfsdirectories.Theprimarydataunitsandtheirmappingsinthehdfsnamespaceareasfollows:Tables–Atableisstoredinadirectoryinhdfs.Partitions–Apartitionofthetableisstoredinasubdirectorywithinatable'sdirectory.Buckets–Abucketisstoredinafilewithinthepartition'sortable'sdirectorydependingonwhetherthetableisapartitionedtableornot.Asanexampleatabletest_tablegetsmappedto<warehouse_root_directory>/test_tableinhdfs.Thewarehouse_root_directoryisspecifiedbythehive.metastore.warehouse.dirconfigurationparameterinhive-site.xml.Bydefaultthisparameter'svalueissetto/user/hive/warehouse.Atablemaybepartitionedornon-partitioned.ApartitionedtablecanbecreatedbyspecifyingthePARTITIONEDBYclauseintheCREATETABLEstatementasshownbelow.CREATETABLEtest_part(c1string,c2int)PARTITIONEDBY(dsstring,hrint);Intheexampleshownabovethetablepartitionswillbestoredin/user/hive/warehouse/test_partdirectoryinhdfs.Apartitionexistsforeverydistinctvalueofdsandhrspecifiedbytheuser.Notethatthepartitioningcolumnsarenotpartofthetabledataandthepartitioncolumnvaluesareencodedinthedirectorypathofthatpartition(theyarealsostoredinthetablemetadata).AnewpartitioncanbecreatedthroughanINSERTstatementorthroughanALTERstatementthataddsapartitiontothetable.BoththefollowingstatementsINSERTOVERWRITETABLEtest_partPARTITION(ds='2009-01-01',hr=12)SELECT*FROMt;ALTERTABLEtest_partADDPARTITION(ds='2009-02-02',hr=11);addanewpartitiontothetabletest_part.TheINSERTstatementalsopopulatesthepartitionwithdatafromtablet,whereasthealtertablecreatesanemptypartition.Boththesestatementsendupcreatingthecorrespondingdirectories/user/hive/warehouse/test_part/ds=2009-01-01/hr=12and/user/hive/warehouse/test_part/ds=2009-02-02/hr=11–inthetable’shdfsdirectory.Thisapproachdoescreatesomecomplicationsincasethepartitionvaluecontainscharacterssuchas/or:thatareusedbyhdfstodenotedirectorystructure,butproperescapingofthosecharactersdoestakecareofaproducinganhdfscompatibledirectoryname.TheHivecompilerisabletousethisinformationtoprunethedirectoriesthatneedtobescannedfordatainordertoevaluateaquery.Incaseofthetest_parttable,thequerySELECT*FROMtest_partWHEREds='2009-01-01';willonlyscanallthefileswithinthe/user/hive/warehouse/test_part/ds=2009-01-01directoryandthequerySELECT*FROMtest_partWHEREds='2009-02-02'ANDhr=11;willonlyscanallthefileswithinthe/user/hive/warehouse/test_part/ds=2009-01-01/hr=12directory.Pruningthedatahasasignificantimpactonthetimeittakestoprocessthequery.Inmanyrespectsthispartitioningschemeissimilartowhathasbeenreferredtoaslistpartitioningbymanydatabasevendors([6]),buttherearedifferencesinthatthevaluesofthepartitionkeysarestoredwiththemetadatainsteadofthedata.ThefinalstorageunitconceptthatHiveusesistheconceptofBuckets.Abucketisafilewithintheleafleveldirectoryofatableorapartition.Atthetimethetableiscreated,theusercanspecifythenumberofbucketsneededandthecolumnonwhichtobucketthedata.Inthecurrentimplementationthisinformationisusedtoprunethedataincasetheuserrunsthequeryonasampleofdatae.g.atablethatisbucketedinto32bucketscanquicklygeneratea1/32samplebychoosingtolookatthefirstbucketofdata.Similarly,thestatementSELECT*FROMtTABLESAMPLE(2OUTOF32);wouldscanthedatapresentinthesecondbucket.NotethattheonusofensuringthatthebucketfilesareproperlycreatedandnamedarearesponsibilityoftheapplicationandHiveQLDDLstatementsdonotcurrentlytrytobucketthedatainawaythatitbecomescompatibletothetableproperties.Consequently,thebucketinginformationshouldbeusedwithcaution.Thoughthedatacorrespondingtoatablealwaysresidesinthe<warehouse_root_directory>/test_tablelocationinhdfs,Hivealsoenablesuserstoquerydatastoredinotherlocationsinhdfs.ThiscanbeachievedthroughtheEXTERNALTABLEclauseasshowninthefollowingexample.CREATEEXTERNALTABLEtest_extern(c1string,c2int)LOCATION'/user/mytables/mydata';Withthisstatement,theuserisabletospecifythattest_externisanexternaltablewitheachrowcomprisingoftwocolumns–c1andc2.Inadditionthedatafilesarestoredinthelocation/user/mytables/mydatainhdfs.NotethatasnocustomSerDehasbeendefineditisassumedthatthedataisinHive’sinternalformat.Anexternaltablediffersfromanormaltableinonlythatadroptablecommandonanexternaltableonlydropsthetablemetadataanddoesnotdeleteanydata.Adroponanormaltableontheotherhanddropsthedataassociatedwiththetableaswell.Serialization/Deserialization(SerDe)AsmentionedpreviouslyHivecantakeanimplementationoftheSerDejavainterfaceprovidedbytheuserandassociateittoatableorpartition.Asaresultcustomdataformatscaneasilybeinterpretedandqueriedfrom.ThedefaultSerDeimplementationinHiveiscalledtheLazySerDe–itdeserializesrowsintointernalobjectslazilysothatthecostofdeserializationofacolumnisincurredonlyifthecolumnoftherowisneededinsomequeryexpression.TheLazySerDeassumesthatthedataisstoredinthefilesuchthattherowsaredelimitedbyanewline(asciicode13)andthecolumnswithinarowaredelimitedbyctrl-A(asciicode1).ThisSerDecanalsobeusedtoreaddatathatusesanyotherdelimitercharacterbetweencolumns.Asanexample,thestatementCREATETABLEtest_delimited(c1string,c2int)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\002'LINESTERMINATEDBY'\012';specifiesthatthedatafortabletest_delimitedusesctrl-B(asciicode2)asacolumndelimiterandusesctrl-L(asciicode12)asarowdelimiter.Inaddition,delimiterscanbespecifiedtodelimittheserializedkeysandvaluesofmapsanddifferentdelimiterscanalsobespecifiedtodelimitthevariouselementsofalist(collection).Thisisillustratedbythefollowingstatement.CREATETABLEtest_delimited2(c1string,c2list<map<string,int>>)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\002'COLLECTIONITEMSTERMINATEDBY'\003'MAPKEYSTERMINATEDBY'\004';ApartfromLazySerDe,someotherinterestingSerDesarepresentinthehive_contrib.jarthatisprovidedwiththedistribution.AparticularlyusefuloneisRegexSerDewhichenablestheusertospecifyaregularexpressiontoparsevariouscolumnsoutfromarow.Thefollowingstatementcanbeusedforexample,tointerpretapachelogs.addjar'hive_contrib.jar';CREATETABLEapachelog(hoststring,identitystring,userstring,timestring,requeststring,statusstring,sizestring,refererstring,agentstring)ROWFORMATSERDEROWFORMATSERDE'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITHSERDEPROPERTIES('input.regex'='([^]*)([^]*)([^]*)(-|\\[[^\\]]*\\])([^\"]*|\"[^\"]*\")(-|[0-9]*)(-|[0-9]*)(?:([^\"]*|\"[^\"]*\")([^\"]*|\"[^\"]*\"))?','output.format.string'='%1$s%2$s%3$s%4$s%5$s%6$s%7$s%8$s%9$s');Theinput.regexpropertyistheregularexpressionappliedoneachrecordandtheoutput.format.stringindicateshowthecolumnfieldscanbeconstructedfromthegroupmatchesintheregularexpression.ThisexamplealsoillustrateshowarbitrarykeyvaluepairscanbepassedtoaserdeusingtheWITHSERDEPROPERTIESclause,acapabilitythatcanbeveryusefulinordertopassarbitraryparameterstoacustomSerDe.FileFormatsHadoopfilescanbestoredindifferentformats.AfileformatinHadoopspecifieshowrecordsarestoredinafile.TextfilesforexamplearestoredintheTextInputFormatandbinaryfilescanbestoredasSequenceFileInputFormat.Userscanalsoimplementtheirownfileformats.Hivedoesnotimposeanrestrictionsonthetypeoffileinputformats,thatthedataisstoredin.Theformatcanbespecifiedwhenthetableiscreated.Apartfromthetwoformatsmentionedabove,HivealsoprovidesanRCFileInputFormatwhichstoresthedatainacolumnorientedmanner.Suchanorganizationcangiveimportantperformanceimprovementsspeciallyforqueriesthatdonotaccessallthecolumnsofthetable.Userscanaddtheirownfileformatsandassociatethemtoatableasshowninthefollowingstatement.CREATETABLEdest1(keyINT,valueSTRING)STOREDASINPUTFORMAT'org.apache.hadoop.mapred.SequenceFileInputFormat'OUTPUTFORMAT'org.apache.hadoop.mapred.SequenceFileOutputFormat'TheSTOREDASclausespecifiestheclassestobeusedtodeterminetheinputandoutputformatsofthefilesinthetable’sorpartition’sdirectory.ThiscanbeanyclassthatimplementstheFileInputFormatandFileOutputFormatjavainterfaces.TheclassescanbeprovdedtoHadoopinajarinwayssimilartothoseshownintheexamplesonaddingcustomSerDes.CONCLUSIONSANDFUTUREWORKHiveisaworkinprogress.Itisanopen-sourceproject,andisbeingactivelyworkedonbyFacebookaswellasseveralexternalcontributors.HiveQLcurrentlyacceptsonlyasubsetofSQLasvalidqueries.WeareworkingtowardsmakingHiveQLsubsumeSQLsyntax.Hivecurrentlyhasanaiverule-basedoptimizerwithasmallnumberofsimplerules.Weplantobuildacostbasedoptimizerandadaptiveoptimizationtechniquestocomeupwithmoreefficientplans.Weareexploringcolumnarstorageandmoreintelligentdataplacementtoimprovescanperformance.Wearerunningperformancebenchmarksbasedon[9]tomeasureourprogressaswellascompareagainstothersystems.Inourpreliminaryexperiments,wehavebeenabletoimprovetheperformanceofHadoopitselfby~20%comparedto[9].TheimprovementsinvolvedusingfasterHadoopdatastructurestoprocessthedata,forexample,usingTextinsteadofString.ThesamequeriesexpressedeasilyinHiveQLhad~20%overheadcomparedtoouroptimizedHadoopimplementation,i.e.,Hive'sperformanceisonparwiththeHadoopcodefrom[9].Wehavealsoruntheindustrystandarddecisionsupportbenchmark–TPC-H[11].Basedontheseexperiments,wehaveidentifiedseveralareasforperformanceimprovementandhavebegunworkingonthem.Moredetailsareavailablein[10]and[12].WeareenhancingtheJDBCandODBCdriversforHiveforintegrationwithcommercialBItoolsthatonlyworkwithtraditionalrelationalwarehouses.Weareexploringmethodsformulti-queryoptimizationtechniquesandperforminggenericn-wayjoinsinasinglemap-reducejob.REFERENCES[1]ApacheHadoop.Availableat/hadoop.[2]FacebookLexiconat/lexicon.[3]Hivewikiat/hadoop/hive.[4]HadoopMap-ReduceTutorialat/common/docs/current/mapred_tutorial.html.[5]HadoopHDFSUserGuideat/common/docs/current/hdfs_user_guide.html.[6]Mysqllistpartitioningat/doc/refman/5.1/en/partitioning-list.html.[7]ApacheThrift.Availableat/thrift.[8]DataNucleus.Availableat.[9]A.Pavloet.al.AComparisonofApproachestoLarge-ScaleDataAnalysis.InProc.ofACMSIGMOD,2009.[10]HivePerformanceBenchmark.Availableat/jira/browse/HIVE-396[11]TPC-HBenchmark.Availableat/tpch[12]RunningTPC-HqueriesonHive.Availableat/jira/browse/HIVE-600[13]HadoopPig.Availableat/pig[14]R.Chaiken,et.al.Scope:EasyandEfficientParallelProcessingofMassiveDataSets.InProc.ofVLDB,2008.[15]HadoopDBProject.Availableat/hadoopdb/hadoopdb.html[16]MicroStrategy.Availableat附录B外文翻译—译文部分Hive-使用Hadoop的PB级数据仓库AshishThusoo，JoydeepSenSarma，NamitJain，ZhengShao，PrasadChakka，NingZhang，SureshAntony，HaoLiu

和RaghothamMurthyFacebook数据基础架构团队业界收集和分析的商业智能数据集的规模正在迅速增长，使得传统的仓储解决方案成本过高。Hadoop[1]是一种流行的开源map-reduce实现，正在雅虎，Facebook等公司中使用，用于在商用硬件上存储和处理极大的数据集。但是，map-reduce编程模型的级别非常低，需要开发人员编写难以维护和重用的自定义程序。在本文中，我们介绍了Hive，一种基于Hadoop构建的开源数据仓库解决方案。Hive支持以类似SQL的声明性语言表示的查询--HiveQL，它们被编译为使用Hadoop执行的mapreduce作业。此外，HiveQL使用户能够将自定义map-reduce脚本插入查询中。该语言包括一个类型系统，它支持包含基本类型的表，像数组和映射的集合，以及相同的嵌套组合。底层IO库可以扩展为以自定义格式查询数据。Hive还包括一个系统目录-Metastore-包含模式和统计信息，它们在数据探索，查询优化和查询编译中非常有用。在Facebook中，Hive仓库包含数万个表并存储超过700TB的数据，并且每月有超过200个用户广泛用于报告和临时分析。对大型数据集的可扩展分析是Facebook许多团队（工程和非工程团队）功能的核心。除了公司分析师使用的临时分析和商业智能应用程序外，许多Facebook产品也基于分析。这些产品包括简单的报告应用程序，如Facebook广告网络的Insights，以及Facebook的Lexicon产品等更先进的应用程序[2]。因此，灵活的基础设施可满足这些不同应用程序和用户的需求，并且随着Facebook上生成的数据量不断增加而以经济高效的方式扩展，这一点至关重要。Hive和Hadoop是我们用来在Facebook上满足这些要求的技术。

Facebook之前的整个数据处理基础架构是围绕使用商业RDBMS构建的数据仓库构建的。我们生成的数据增长非常快-例如，我们从2007年的15TB数据集增长到今天的700TB数据集。当时的基础设施非常不足以至于一些日常数据处理工作需要花费一天多的时间来处理，而且情况日渐恶化。我们迫切需要可以与我们的数据一起扩展的基础设施。因此，我们开始探索Hadoop作为满足我们扩展需求的技术。事实上，Hadoop已经是一个以PB级规模使用并使用商用硬件提供可扩展性的开源项目，这对我们来说是一个非常引人注目的主张。现在，使用Hadoop可以在几个小时内完成相同的工作，这些工作需要一天多的时间才能完成。但是，对于最终用户来说，使用Hadoop并不容易，特别是那些不熟悉mapreduce的用户。最终用户必须为简单的任务编写map-reduce程序，例如获取原始计数或平均值。Hadoop缺乏像SQL这样的流行查询语言的表现力，因此用户最终花费数小时（如果不是几天）来编写程序以进行简单的分析。我们非常清楚，为了真正使公司能够更有效地分析这些数据，我们必须改进Hadoop的查询功能。将这些数据更接近用户是我们在2007年1月构建Hive的原因。我们的愿景是将熟悉的表，列，分区和SQL子集概念引入Hadoop的非结构化世界，同时仍保持可扩展性和灵活性Hadoop很享受。Hive于2008年8月开源，从那时起，许多Hadoop用户都在使用和探索它们的数据处理需求。从一开始，Hive就非常受Facebook内所有用户的欢迎。今天，我们定期在Hadoop/Hive集群上运行数千个作业，其中包括数百个用户，用于从简单的摘要作业到商业智能，机器学习应用程序以及支持Facebook产品功能的各种应用程序。数据存储，SERDER和文件格式A.数据存储虽然表是Hive中的逻辑数据单元，但表元数据将表中的数据与hdfs目录相关联。hdfs名称空间中的主要数据单元及其映射如下：Tables-表存储在hdfs的目录中。Partitions-表的分区存储在表的目录中的子目录中。Buckets-存储桶存储在分区或表的目录中的文件中，具体取决于表是否为分区表。作为示例，表test_table被映射到hdfs中的<warehouse_root_directory>/test_table。warehouse_root_directory由hive-site.xml中的hive.metastore.warehouse.dir配置参数指定。默认情况下，此参数的值设置为/user/hive/warehouse。表可以是分区的或非分区的。可以通过在CREATETABLE语句中指定PARTITIONEDBY子句来创建分区表，如下所示。CREATETABLEtest_part（c1string，c2int）PARTITIONEDBY（dsstring，hrint）;在上面显示的示例中，表分区将存储在hdfs的/user/hive/warehouse/test_part目录中。存在用户指定的ds和hr的每个不同值的分区。请注意，分区列不是表数据的一部分，分区列值在该分区的目录路径中编码（它们也存储在表元数据中）。可以通过INSERT语句或通过向表中添加分区的ALTER语句创建新分区。以下两个陈述在表test_part中添加一个新分区。INSERTOVERWRITETABLEtest_partPARTITION(ds='2009-01-01',hr=12)SELECT*FROMt;ALTERTABLEtest_partADDPARTITION(ds='2009-02-02',hr=11);INSERT语句还使用表t中的数据填充分区，其中altertable创建空分区。这两个语句最终都创建了相应的目录/user/hive/warehouse/test_part/ds=2009-01-01/hr=12和/user/hive/warehouse/test_part/ds=2009-02-02/hr=11-在表的hdfs目录中。如果分区值包含字符，则此方法会产生一些复杂性。例如/或：hdfs用来表示目录结构，但正确转义这些字符确实会产生一个hdfs兼容的目录名。Hive编译器能够使用此信息来修剪需要扫描数据的目录，以便评估查询。如果是test_part表，则查询SELECT*FROMtest_partWHEREds='2009-01-01';将仅扫描/user/hive/warehouse/test_part/ds=2009-01-01目录和查询中的所有文件SELECT*FROMtest_partWHEREds='2009-02-02'ANDhr=11;将仅扫描/user/hive/warehouse/test_part/ds=2009-01-01/hr=12中的所有文件目录。修剪数据会对处理查询所花费的时间产生重大影响。在许多方面，这种分区方案类似于许多数据库供应商所称的列表分区（[6]），但区别在于分区键的值与元数据而不是数据一起存储。Hive使用的最终存储单元概念是Buckets的概念。存储桶是表或分区的叶级目录中的文件。在创建表时，用户可以指定所需的桶数以及用于存储数据的列。在当前实现中，该信息用于在用户运行时修剪数据查询数据样本，例如一个被分成32个桶的表可以通过选择查看第一个数据桶来快速生成1/32样本。同样，声明SELECT*FROMtTABLESAMPLE（2outof32）;将扫描第二个存储桶中的数据。请注意，确保正确创建和命名存储桶文件的责任是应用程序和HiveQL的责任DDL语句当前不会尝试以与表属性兼容的方式存储数据。因此，应谨慎使用分组信息。虽然对应于表的数据始终位于hdfs中的<warehouse_root_directory>/test_table位置，但Hive还允许用户查询存储在hdfs中其他位置的数据。这可以通过EXTERNALTABLE子句实现，如以下示例所示。CREATEEXTERNALTABLEtest_extern（c1string，c2int）LOCATION'/user/mytables/mydata';使用此语句，用户可以指定test_extern是一个外部表，每行包含两列--c1和c2。此外，数据文件存储在该位置hdfs中的/user/mytables/mydata。请注意，由于未定义自定义SerDe，因此假定数据采用Hive的内部格式。外部表与普通表的不同之处仅在于外部表上的droptable命令仅删除表元数据并且不删除任何数据。另一方面，普通表上的丢弃也会丢弃与表相关联的数据。B.Serialization/Deserialization（SerDe）如前所述，Hive可以实现用户提供的SerDejava接口，并将其与表或分区相关联。因此，可以轻松地解释和查询自定义数据格式。Hive中的默认SerDe实现称为LazySerDe-它会懒惰地将行反序列化为内部对象，这样只有在某些查询表达式中需要该行的列时才会产生反序列化的成本。LazySerDe假定数据存储在文件中以便行由换行符（ascii代码13）分隔，行内的列由ctrl-A（ascii代码1）分隔。此SerDe还可用于读取在列之间使用任何其他分隔符的数据。举个例子，声明CREATETABLEtest_delimited(c1string,c2int)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\002'LINESTERMINATEDBY'\012';指定表test_delimited的数据使用ctrl-B（ascii代码2）作为列分隔符，并使用ctrl-L（ascii代码12）作为行分隔符。此外，可以指定分隔符为了划分映射的序列化键和值，还可以指定不同的分隔符来分隔列表（集合）的各种元素。以下陈述说明了这一点。CREATETABLEtest_delimited2(c1string,c2list<map<string,int>>)ROWFORMATDELIMITEDFIELDSTERMINATEDBY'\002'COLLECTIONITEMSTERMINATEDBY'\003'MAPKEYSTERMINATEDBY'\004';除了LazySerDe之外，随分发提供的hive_contrib.jar中还存在一些其他有趣的SerDes。一个特别有用的是RegexSerDe，它允许用户指定一个正则表达式来解析一行中的各个列。例如，可以使用以下语句来解释apache日志。addjar'hive_contrib.jar';CREATETABLEapachelog(hoststring,identitystring,userstring,timestring,requeststring,statusstring,sizestring,refererstring,agentstring)ROWFORMATSERDEROWFORMATSERDE'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITHSERDEPROPERTIES('input.regex'='([^]*)([^]*)([^]*)

人人文库> 全部分类> 教育资料 > 作文作品

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

毕业设计外文文献-Hive - 使用Hadoop的PB级数据仓库

文档简介

温馨提示

最新文档

评论

毕业设计外文文献-Hive - 使用Hadoop的PB级数据仓库

文档简介

温馨提示

最新文档

评论

相关文档