大数据挖掘外文翻译文献_第1页
大数据挖掘外文翻译文献_第2页
大数据挖掘外文翻译文献_第3页
大数据挖掘外文翻译文献_第4页
大数据挖掘外文翻译文献_第5页
已阅读5页,还剩11页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

文献信息:文献标题:AStudyofDataMiningwithBigData(大数据挖掘研究)国外作者:VHShastri,VSreeprada文献出处:《InternationalJournalofEmergingTrendsandTechnologyinComputerScience》,2016,38(2):99-103字数统计:英文2291单词,12196字符;中文3868汉字外文文献:AStudyofDataMiningwithBigDataAbstractDatahasbecomeanimportantpartofeveryeconomy,industry,organization,business,functionandindividual.BigDataisatermusedtoidentifylargedatasetstypicallywhosesizeislargerthanthetypicaldatabase.Bigdataintroducesuniquecomputationalandstatisticalchallenges.BigDataareatpresentexpandinginmostofthedomainsofengineeringandscience.Datamininghelpstoextractusefuldatafromthehugedatasetsduetoitsvolume,variabilityandvelocity.ThisarticlepresentsaHACEtheoremthatcharacterizesthefeaturesoftheBigDatarevolution,andproposesaBigDataprocessingmodel,fromthedataminingperspective.Keywords:BigData,DataMining,HACEtheorem,structuredandunstructured.I.IntroductionBigDatareferstoenormousamountofstructureddataandunstructureddatathatoverflowtheorganization.Ifthisdataisproperlyused,itcanleadtomeaningfulinformation.Bigdataincludesalargenumberofdatawhichrequiresalotofprocessinginrealtime.Itprovidesaroomtodiscovernewvalues,tounderstandin-depthknowledgefromhiddenvaluesandprovideaspacetomanagethedataeffectively.Adatabaseisanorganizedcollectionoflogicallyrelateddatawhichcanbeeasilymanaged,updatedandaccessed.Dataminingisaprocessdiscoveringinterestingknowledgesuchasassociations,patterns,changes,anomaliesandsignificantstructuresfromlargeamountofdatastoredinthedatabasesorotherrepositories.BigDataincludes3V’sasitscharacteristics.Theyarevolume,velocityandvariety.Volumemeanstheamountofdatageneratedeverysecond.Thedataisinstateofrest.Itisalsoknownforitsscalecharacteristics.Velocityisthespeedwithwhichthedataisgenerated.Itshouldhavehighspeeddata.Thedatageneratedfromsocialmediaisanexample.Varietymeansdifferenttypesofdatacanbetakensuchasaudio,videoordocuments.Itcanbenumerals,images,timeseries,arraysetc.DataMininganalysesthedatafromdifferentperspectivesandsummarizingitintousefulinformationthatcanbeusedforbusinesssolutionsandpredictingthefuturetrends.Datamining(DM),alsocalledKnowledgeDiscoveryinDatabases(KDD)orKnowledgeDiscoveryandDataMining,istheprocessofsearchinglargevolumesofdataautomaticallyforpatternssuchasassociationrules.Itappliesmanycomputationaltechniquesfromstatistics,informationretrieval,machinelearningandpatternrecognition.Dataminingextractonlyrequiredpatternsfromthedatabaseinashorttimespan.Basedonthetypeofpatternstobemined,dataminingtaskscanbeclassifiedintosummarization,classification,clustering,associationandtrendsanalysis.BigDataisexpandinginalldomainsincludingscienceandengineeringfieldsincludingphysical,biologicalandbiomedicalsciences.II.BIGDATAwithDATAMININGGenerallybigdatareferstoacollectionoflargevolumesofdataandthesedataaregeneratedfromvarioussourceslikeinternet,social-media,businessorganization,sensorsetc.WecanextractsomeusefulinformationwiththehelpofDataMining.Itisatechniquefordiscoveringpatternsaswellasdescriptive,understandable,modelsfromalargescaleofdata.Volumeisthesizeofthedatawhichislargerthanpetabytesandterabytes.Thescaleandriseofsizemakesitdifficulttostoreandanalyseusingtraditionaltools.BigDatashouldbeusedtominelargeamountsofdatawithinthepredefinedperiodoftime.Traditionaldatabasesystemsweredesignedtoaddresssmallamountsofdatawhichwerestructuredandconsistent,whereasBigDataincludeswidevarietyofdatasuchasgeospatialdata,audio,video,unstructuredtextandsoon.BigDataminingreferstotheactivityofgoingthroughbigdatasetstolookforrelevantinformation.Toprocesslargevolumesofdatafromdifferentsourcesquickly,Hadoopisused.Hadoopisafree,Java-basedprogrammingframeworkthatsupportstheprocessingoflargedatasetsinadistributedcomputingenvironment.Itsdistributedfilesystemsupportsfastdatatransferratesamongnodesandallowsthesystemtocontinueoperatinguninterruptedattimesofnodefailure.ItrunsMapReducefordistributeddataprocessingandisworkswithstructuredandunstructureddata.III.BIGDATAcharacteristics-HACETHEOREM.Wehavelargevolumeofheterogeneousdata.Thereexistsacomplexrelationshipamongthedata.Weneedtodiscoverusefulinformationfromthisvoluminousdata.Letusimagineascenarioinwhichtheblindpeopleareaskedtodrawelephant.Theinformationcollectedbyeachblindpeoplemaythinkthetrunkaswall,legastree,bodyaswallandtailasrope.Theblindmencanexchangeinformationwitheachother.Figure1:BlindmenandthegiantelephantSomeofthecharacteristicsthatincludeare:i.Vastdatawithheterogeneousanddiversesources:Oneofthefundamentalcharacteristicsofbigdataisthelargevolumeofdatarepresentedbyheterogeneousanddiversedimensions.Forexampleinthebiomedicalworld,asinglehumanbeingisrepresentedasname,age,gender,familyhistoryetc.,ForX-rayandCTscanimagesandvideosareused.Heterogeneityreferstothedifferenttypesofrepresentationsofsameindividualanddiversereferstothevarietyoffeaturestorepresentsingleinformation.ii.Autonomouswithdistributedandde-centralizedcontrol:thesourcesareautonomous,i.e.,automaticallygenerated;itgeneratesinformationwithoutanycentralizedcontrol.WecancompareitwithWorldWideWeb(WWW)whereeachserverprovidesacertainamountofinformationwithoutdependingonotherservers.iii.Complexandevolvingrelationships:Asthesizeofthedatabecomesinfinitelylarge,therelationshipthatexistsisalsolarge.Inearlystages,whendataissmall,thereisnocomplexityinrelationshipsamongthedata.Datageneratedfromsocialmediaandothersourceshavecomplexrelationships.IV.TOOLS: OPENSOURCEREVOLUTIONLargecompaniessuchasFacebook,Yahoo,Twitter,LinkedInbenefitandcontributeworkonopensourceprojects.InBigDataMining,therearemanyopensourceinitiatives.Themostpopularofthemare:ApacheMahout:ScalablemachinelearninganddataminingopensourcesoftwarebasedmainlyinHadoop.Ithasimplementationsofawiderangeofmachinelearninganddataminingalgorithms:clustering,classification,collaborativefilteringandfrequentpatternmining.R:opensourceprogramminglanguageandsoftwareenvironmentdesignedforstatisticalcomputingandvisualization.RwasdesignedbyRossIhakaandRobertGentlemanattheUniversityofAuckland,NewZealandbeginningin1993andisusedforstatisticalanalysisofverylargedatasets.MOA:Streamdataminingopensourcesoftwaretoperformdatamininginrealtime.Ithasimplementationsofclassification,regression;clusteringandfrequentitemsetminingandfrequentgraphmining.ItstartedasaprojectoftheMachineLearninggroupofUniversityofWaikato,NewZealand,famousfortheWEKAsoftware.ThestreamsframeworkprovidesanenvironmentfordefiningandrunningstreamprocessesusingsimpleXMLbaseddefinitionsandisabletouseMOA,AndroidandStorm.SAMOA:ItisanewupcomingsoftwareprojectfordistributedstreamminingthatwillcombineS4andStormwithMOA.VowpalWabbit:opensourceprojectstartedatYahoo!ResearchandcontinuingatMicrosoftResearchtodesignafast,scalable,usefullearningalgorithm.VWisabletolearnfromterafeaturedatasets.Itcanexceedthethroughputofanysinglemachinenetworkinterfacewhendoinglinearlearning,viaparallellearning.V.DATAMININGforBIGDATADataminingistheprocessbywhichdataisanalysedcomingfromdifferentsourcesdiscoversusefulinformation.DataMiningcontainsseveralalgorithmswhichfallinto4categories.Theyare:1.AssociationRule2.Clustering3.Classification4.RegressionAssociationisusedtosearchrelationshipbetweenvariables.Itisappliedinsearchingforfrequentlyvisiteditems.Inshortitestablishesrelationshipamongobjects.Clusteringdiscoversgroupsandstructuresinthedata.Classificationdealswithassociatinganunknownstructuretoaknownstructure.Regressionfindsafunctiontomodelthedata.Thedifferentdataminingalgorithmsare:CategoryAlgorithmAssociationApriori,FPgrowthClusteringK-Means,Expectation.ClassificationDecisiontrees,SVMRegressionMultivariatelinearregressionTable1.ClassificationofAlgorithmsDataMiningalgorithmscanbeconvertedintobigmapreducealgorithmbasedonparallelcomputingbasis.BigDataDataMiningItiseverythingintheworldnow.ItistheoldBigData.Sizeofthedataislarger.Sizeofthedataissmaller.Involvesstorageandprocessingoflargedatasets.Interestingpatternscanbefound.BigDataisthetermforlargedataset.Dataminingreferstotheactivityofgoingthroughbigdatasettolookforrelevantinformation.Bigdataistheasset.Dataminingisthehandlerwhichprovidebeneficialresult.Bigdata"variesdependingonthecapabilitiesoftheorganizationmanagingtheset,andonthecapabilitiesoftheapplicationsthataretraditionallyusedtoprocessandanalysethedata.Dataminingreferstotheoperationthatinvolverelativelysophisticatedsearchoperation.Table2.DifferencesbetweenDataMiningandBigDataVI.ChallengesinBIGDATAMeetingthechallengeswithBIGDataisdifficult.Thevolumeisincreasingeveryday.Thevelocityisincreasingbytheinternetconnecteddevices.Thevarietyisalsoexpandingandtheorganizations’capabilitytocaptureandprocessthedataislimited.ThefollowingarethechallengesinareaofBigDatawhenitishandled:1.Datacaptureandstorage2.Datatransmission3.Datacuration4.Dataanalysis5.DatavisualizationAccordingto,challengesofbigdataminingaredividedinto3tiers.Thefirsttieristhesetupofdataminingalgorithms.Thesecondtierincludes1.InformationsharingandDataPrivacy.2.DomainandApplicationKnowledge.Thethirdoneincludeslocallearningandmodelfusionformultipleinformationsources.3.Miningfromsparse,uncertainandincompletedata.4.Miningcomplexanddynamicdata.Figure2:PhasesofBigDataChallengesGenerallyminingofdatafromdifferentdatasourcesistediousassizeofdataislarger.Bigdataisstoredatdifferentplacesandcollectingthosedatawillbeatedioustaskandapplyingbasicdataminingalgorithmswillbeanobstacleforit.Nextweneedtoconsidertheprivacyofdata.Thethirdcaseisminingalgorithms.Whenweareapplyingdataminingalgorithmstothesesubsetsofdatatheresultmaynotbethatmuchaccurate.VII.ForecastofthefutureTherearesomechallengesthatresearchersandpractitionerswillhavetodealduringthenextyears:AnalyticsArchitecture:Itisnotclearyethowanoptimalarchitectureofanalyticssystemsshouldbetodealwithhistoricdataandwithreal-timedataatthesametime.AninterestingproposalistheLambdaarchitectureofNathanMarz.TheLambdaArchitecturesolvestheproblemofcomputingarbitraryfunctionsonarbitrarydatainrealtimebydecomposingtheproblemintothreelayers:thebatchlayer,theservinglayer,andthespeedlayer.ItcombinesinthesamesystemHadoopforthebatchlayer,andStormforthespeedlayer.Thepropertiesofthesystemare:robustandfaulttolerant,scalable,general,andextensible,allowsadhocqueries,minimalmaintenance,anddebuggable.Statisticalsignificance:Itisimportanttoachievesignificantstatisticalresults,andnotbefooledbyrandomness.AsEfronexplainsinhisbookaboutLargeScaleInference,itiseasytogowrongwithhugedatasetsandthousandsofquestionstoansweratonce.Distributedmining:Manydataminingtechniquesarenottrivialtoparalyze.Tohavedistributedversionsofsomemethods,alotofresearchisneededwithpracticalandtheoreticalanalysistoprovidenewmethods.Timeevolvingdata:Datamaybeevolvingovertime,soitisimportantthattheBigDataminingtechniquesshouldbeabletoadaptandinsomecasestodetectchangefirst.Forexample,thedatastreamminingfieldhasverypowerfultechniquesforthistask.Compression:DealingwithBigData,thequantityofspaceneededtostoreitisveryrelevant.Therearetwomainapproaches:compressionwherewedon’tlooseanything,orsamplingwherewechoosewhatisthedatathatismorerepresentative.Usingcompression,wemaytakemoretimeandlessspace,sowecanconsideritasatransformationfromtimetospace.Usingsampling,weareloosinginformation,butthegainsinspacemaybeinordersofmagnitude.ForexampleFeldmanetalusecoresetstoreducethecomplexityofBigDataproblems.Coresetsaresmallsetsthatprovablyapproximatetheoriginaldataforagivenproblem.Usingmerge-reducethesmallsetscanthenbeusedforsolvinghardmachinelearningproblemsinparallel.Visualization:AmaintaskofBigDataanalysisishowtovisualizetheresults.Asthedataissobig,itisverydifficulttofinduser-friendlyvisualizations.Newtechniques,andframeworkstotellandshowstorieswillbeneeded,asforexamplethephotographs,infographicsandessaysinthebeautifulbook”TheHumanFaceofBigData”.HiddenBigData:Largequantitiesofusefuldataaregettinglostsincenewdataislargelyuntaggedfilebasedandunstructureddata.The2012IDCstudyonBigDataexplainsthatin2012,23%(643exabytes)ofthedigitaluniversewouldbeusefulforBigDataiftaggedandanalyzed.However,currentlyonly3%ofthepotentiallyusefuldataistagged,andevenlessisanalyzed.VIII.CONCLUSIONTheamountsofdataisgrowingexponentiallyduetosocialnetworkingsites,searchandretrievalengines,mediasharingsites,stocktradingsites,newssourcesandsoon.BigDataisbecomingthenewareaforscientificdataresearchandforbusinessapplications.Dataminingtechniquescanbeappliedonbigdatatoacquiresomeusefulinformationfromlargedatasets.Theycanbeusedtogethertoacquiresomeusefulpicturefromthedata.BigDataanalysistoolslikeMapReduceoverHadoopandHDFShelpsorganization.中文译文:大数据挖掘研究摘要数据已经成为各个经济、行业、组织、企业、职能和个人的重要组成部分。大数据是用于识别大型数据集的一个术语,通常其大小比典型的数据库要大。大数据引入了独特的计算和统计挑战。在工程和科学的大部分领域,大数据目前都有延伸。由于大数据的数量之多、速度之快、种类之繁,所以可以使用数据挖掘,有助于从庞大的数据集中提取有用的数据。本文介绍了HACE定理,它描述了大数据革命的特征,并从数据挖掘角度提出了一个大数据处理模型。关键词:大数据,数据挖掘,HACE定理,结构化和非结构化。一、简介大数据指的是大量的结构化数据和非结构化数据,这些数据遍布了整个组织。如果这些数据被正确使用,将会产生有意义的信息。大数据包括大量的数据,需要大量的实时处理。它提供了两个空间,一个用于发现新价值,并从隐藏的价值中了解深入的知识,另一个用于有效管理数据。数据库是一个与数据相关的逻辑上有组织的集合,可以方便地管理、更新和访问。数据挖掘是从数据库或其他存储库中存储的大量数据中发现有趣的知识(如关联、模式、更改、异常和重要结构)的过程。大数据包括3V的特征。它们是大量(volume)、高速(velocity)和多样(variety)。大量意味着每秒生成的数据量。数据是静态的,它的规模特征也是众所周知的。高速是数据生成的速度。大数据应该有高速数据,社交媒体产生的数据就是一个例子。多样意味着可以采取不同类型的数据,例如音频、视频或文档。它可以是数字、图像、时间序列、数组等。数据挖掘从不同的角度分析数据,并将其汇总为有用的信息,可用于商业解决方案和预测未来趋势。数据挖掘(DM)也称为数据库中的知识发现(KDD),或者知识发现和数据挖掘,是为关联规则等模式自动搜索大量数据的过程。它应用了统计学、信息检索、机器学习和模式识别等方面的许多计算技术。数据挖掘仅在短时间内从数据库中提取所需的模式。根据要挖掘的模式类型,可以将数据挖掘任务分为汇总、分类、聚类、关联和趋势分析。在包括物理、生物和生物医学等科学和工程领域在内的所有领域,大数据都有延伸。二、大数据挖掘一般而言,大数据是指大量数据的集合,这些数据来自互联网、社交媒体、商业组织、传感器等各种来源。我们可以借助数据挖掘技术来提取一些有用的信息。这是一种从大量数据中发现模式以及描述性、可理解的模型的技术。容量是数据的大小,大于PB和TB。规模和容量的增加使得传统的工具难以存储和分析。在预定的时间段内,应该使用大数据挖掘大量数据。传统的数据库系统旨在解决少量的结构化和一致性的数据,而大数据包括各种数据,如地理空间数据、音频、视频、非结构化文本等。大数据挖掘是指通过大数据集来查找相关信息的活动。为了快速处理不同来源的大量数据,使用了Hadoop。Hadoop是一个免费的基于Java的编程框架,支持在分布式计算环境中处理大型数据集。其分布式文件系统支持节点之间的快速数据传输速率,并允许系统在发生节点故障时不中断运行。它为分布式数据处理进行MapReduce,用于结构化和非结构化数据。三、大数据特征——HACE定理我们有大量的异构数据。数据之间存在复杂的关系。我们需要从这些庞大的数据中发现有用的信息。让我们想象一下,一个盲人被要求画大象的场景。每个盲人收集到的信息可能会认为躯干像墙,腿像树,身体像墙,尾巴像绳子。盲人们可以相互交换信息。图1:盲人和大象其中的一些特征包括:1.具有异构及不同来源的海量数据:大数据的基本特征之一是大量的异构数据和多样数据。例如,在生物医学世界中,个人用姓名、年龄、性别、家族病史等来表示,用于X射线和CT扫描图像和视频。异构是指同一个体的不同表现形式,多样是指用各种特征来表示单一信息。2.具有分布式和非集中式控制的自治:来源是自治的,即自动生成;它在没有任何集中控制的情况下生成信息。我们可以将它与万维网(WWW)进行比较,其中每台服务器都提供一定数量的信息,而不依赖于其他服务器。3.复杂且不断演化的关系:随着数据量变得无限大,存在的关系也很大。在早期阶段,当数据很小时,数据之间的关系并不复杂。社交媒体和其他来源生成的数据具有复杂的关系。四.工具:开放源码革命Facebook、雅虎、Twitter、LinkedIn等大公司受益于开源项目,并为之做出贡献。在大数据挖掘中,有许多开源计划。其中最受欢迎的是:ApacheMahout:主要基于Hadoop的可扩展机器学习和数据挖掘的开源软件。它实现了广泛的机器学习和数据挖掘算法:聚类、分类、协同过滤和频繁模式。R:为统计计算和可视化设计的开源编程语言和软件环境。R是由在新西兰奥克兰大学的RossIhaka和RobertGentleman在1993年开始设计的,用于统计分析超大型数据集。MOA:流数据挖掘开源软件,可以实时进行数据挖掘。它具有分类、回归、聚类和频繁项集挖掘和频繁图挖掘等实现。它始于新西兰怀卡托大学机器学习小组的一个项目,以WEKA软件著称。流框架为使用简单的根据XML来定义和运行流过程提供了一个环境,并能够使用MOA、Android和StormSAMOA:这是一个新的即将推出的分布式流挖掘软件项目,它将S4和Storm与MOA结合在一起。VowpalWabbit:在雅虎启动的开源项目。研究并继续在微软研究院设计一个快速的、可扩展的、有用的学习算法。VW能够从大量特征数据集中学习。在进行线性学习、通过并行学习时,它可以超过任何单机网络接口的吞吐量。五、大数据的数据挖掘数

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论