翻译以.原文和在同一文件中前男_第1页
翻译以.原文和在同一文件中前男_第2页
翻译以.原文和在同一文件中前男_第3页
翻译以.原文和在同一文件中前男_第4页
翻译以.原文和在同一文件中前男_第5页
已阅读5页,还剩20页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

流量分析。而目前,主要通过数据包头部的某些字段(例如端)或应用层协议基于网络流的统计特征实现自动分类。通过使用自多个不同位置的网络数据进背或ftp一些新型网络应用也迅速发展起来(如流、游戏或点对点(P2P)应趋势分析(评估需求趋势的和大小以进行网络规划的自适应)合法的(在网络流量细节统计的基础上允许使用轻量型和线路侦听检测(检测由用户或蠕虫的安全相关的可疑活动最常见的基于公认端的识别技术已经不再可靠,因为许多应用程序不再使用固定的、可预测的端。互联网数字分配机构(IAN)[1配0–1023和1024–49151的端口。但是许多应用程序没有被IANA分配或端口,所以只有与A口识或端非3(2)3共个IP如的T或当前许多行业产品应用一种更为可靠的技术,基于数据包内容重建会话和应用程序3密一,会程可着私P些方法比状态重建更高效,比基于端口的分类效果更好,但仍协议存在依赖。文献[]述。的内在规律自动建立一个分类器。在过去的十年中,ML已经从走出并产生出重大的[9]。我们的方法包括确定最优的流属性集合,最大限度地减少处理成本,本的组织结构如下。第二节概述相关工作。第三节描述了基于ML识别应用程相关工作使用ML技术进行网络流分类的概念是在检测领域首先[10]属性由超过24小时的流量中提取聚合起来。属性和EM参数的影响,也不知晓分类的实际效果如何。在文献[14]EM算法根据其大小(例如老鼠和大象)对网络文献[16]的作者基于分类器和众多流属性使用了一个类似的方法。他们只使基于机器学习的应用识别方法NetMate[18]将数据包解析成双向网络流并计算其特征向量。然后只选[15]性模型是用于学习(1。一旦某个类已经被学习,就可以识别新的网络流(2。图1基于机器学习的流分类模类的结果可用于QoS映射、趋势分析等。我们使用了机器学习中的autoclass方法[19]autoclass是种无监督的分类器,能够在未知分类的训练数据集中基于一些实例的属性自动学习自然类(也称为聚类。autoclass模型中的所有实例必须是非线性相关的,因此两个实例之间依据类成员不组内概率只取决于J和已知的分配给Cj但不是Xi_实例数量。组内概率是k个属性该参数(更一般的方法见文献[19]Xi_的Xi是类Cj一个实例的可能性:J个非空子集的数据计算后验概率。但分区方法的可能性数目使得对大量实例和/autoclass使用了基于autoclass执行结果,多次从参数空间的伪随机点出发重复EM搜索。对当前数据集,概率最高的参数组将被认为是最优的。当类的数量未知时,autoclassJstart_。然后只Jstart_中还有剩余项,就EM搜索的初始数量设为Jstart_中的下一项。如果有某些类退出收敛,EM搜索结束时的类数量可以更小。当不断迭代用尽初始列表时,autoclass将从已10类对数正态分布中随机选择J如,用于不同的目的的网络流量(如批量转移、互动接口、流等,就需要详细的很快时)相比于独立于机器学习算法的方法,比如基于相关性的特征选择(CFS)[21]一个子集,(2)进行类学习和(3)评估类结构。由于穷尽式搜索不可行,我们实现了生最好的结果的属性放置在属性列表SEL(1)中。然后,所有的SEL(1)和第二个不在进一步改善。SFS只是一种(简单的)识别最有效的特性集的方法,还有其他方法比如顺序向后消除(见[22]。一组类整体均匀性H是指各个类同质性的平均值我们的目标是最大限度地分离不同的应用程序。我们使用H作为评价指标,而不是标准的指标,如精度、比例和,是因为我们使用的是一种无监督机器学习技术。在评为了评估本文方法,我们使用了源于NLANR[23]的截取自不同位置不同年份的目的端口。在罕见的情况下,源端口是IANA标准端口,因此我们交换了网络流的两个方向(包括IP地址、端口和流属性。80上而不81上的网络流量。我们假设服务器端口和应用特征之间没有很强的相关性,“错”端口将会引入一个未知的,所以不使用服务器端端口属性。TCP6个数据包。然而,有效的UDP流可以只包括两个数据包,因此不包括小UDP流可能干扰这种策略显然除了留下大量的网络流,但在本工作中,我们的目标是区分独立可能了端口扫描程序和真正的应用程序。然而,特别是从安全角度讲,考虑一般的(FTPdata,net,SMTP,DNS,HTTP,AOLMessenger,Napster,Half-Life)随机抽样了行SFS。2显示了使用固定属性集单次运行得到的示例运行结果。它显示了应用程序如何其同质性是一个应用程序的最大一部分流量,比如H(最左边的)0.52H(最右边的)=1。总体的同质性H是所有类的平均值,在这种情况下H=0.86。图2类中应用程序的分布3显示了基于属性数量的整体H平均值。它显示了应用程序的识别效率随着使用0.85-0.89有明显的同质性改进(0.85and0.90之间的值,但执行效率明显降低。图3属性与流组选择对同质性的影响图4显示了对于每组流都是哪些特性被选择进入最优特征集。y轴显示了为某一最我们认为,包长度优于时间间隔是因为我们所的应用程序没有一个具有时间间隔特性(例如每20ms一个数据包,这种情形下,时间间隔的统计会更有意义。图4不同流组的最优特征集HalfLife的流量是具有时间间隔方面明显特征的。然而,这只包括游戏0(无影响)1(最大影响1显示了不同组网络流的平均值(根据使用最优属性集得到的学习结果4的结果,包长度和体表1属性影响我们为每个不同的应用程序类计算同质性。图5显示了不同组网络流中的每个应用如图所示,一些应用程序是相当同质的类(HalfLife),但对其他应用程序来说就不太同质(如net,Web。同质性越高就越有可能与其他应用程序区分开来。一些应用HalfLifeFTP的特征就与其他应用程序非常接图5不同流组中应用程序的同质性平均度。这表明某些应用程序有很高的精度,但也存在一些问题Napster有一个网络流,不更高,对于某些应用程序已经接近95%。所有网络流组的平均精度为86.5%。图6不同流组中应用程序的精的假阳性率。FTP、net和Web流量的假阳性率最高。由于登录的问题,net图7不同流组中应用程序的假正率8,显示了哪个应用程序在其分种类中,至少有一个流在其中,拥有最多样化100。图8不同流组中应用程序在类中的拓展结论和进一步的研究86.5%。而有些应用程序则可以通过使用的属性和特征来很好地区分开,至于其他流量则混为一体难以我们计划使用大量的网络流和的应用程序来评估我们的方法。还可以使用基于算法的比较,我们使用了文献[16]中提到的简单分类器。另外,仍需要评估分类器的精度、率和分类性能。尽管任何类型的数据集的选[24]数据报才能实现分类结果的可信性(尽可能保证在大多数场景下实现快速分类,以及也许本文中的方法也可以应用于端口扫描或流量检测等安全场景,但我们目前能,以及如何权衡分类方法的准确性和运算资源开销之间的。AutomatedTrafficClassificationandApplicationIdentificationMachineCentreforAdvancedInternetArchitecturesSwinburneUniversityofTechnology,Melbourne,AustraliaThedynamicclassificationandidentificationofnetworkapplicationsresponsiblefornetworktrafficflowsofferssubstantialbenefitstoanumberofkeyareasinIPnetworkengineering,managementandsurveillance.Currentlysuchclassificationsrelyonselectedpacketheaderfields(e.g.portnumbers)orapplicationlayerprotocoldecoding.Thesemethodshaveanumberofshortfallse.g.manyapplicationscanuseunpredictableportnumbersandprotocoldecodingrequiresahighamountofcomputingresourcesorissimplyinfeasibleincaseprotocolsareunknownorencrypted.Weproposeanovelmethodforanunsupervisedmachinelearningtechnique.Flowsareautomaticallyclassifiedbasedonstatisticalflowcharacteristics.WeevaluatetheefficiencyofourapproachusingdatafromseveraltraffictracescollectedatdifferentlocationsoftheInternet.Weusefeatureselectiontofindanoptimalfeaturesetanddeterminetheinfluenceofdifferentfeatures.RecentyearshaveseenadramaticincreaseinthevarietyofapplicationsusingtheInternet.Inadditionto'traditional'applications(e.g. ,weborftp)newapplicationshavegainedstrongmomentum(e.g.streaming,gamingorpeer-to-peer(P2P)).Theabilitytodynamicallyidentifyandclassifyflowsaccordingtotheirnetworkapplicationsishighlybeneficialfor:Trendyses(estimatingthesizeandoriginsofcapacitydemandtrendsfornetworknning)Adaptive,network-basedmarkingoftrafficrequiringspecificQoSwithoutdirect applicationorend-hostinvolvementTheauthorsthankCiscoSystems,Inc.forsupportingthisworkwithaUniversityResearchProjectgrant.

Dynamicaccesscontrol(adaptivefirewallsthatcandetectforbiddenapplications,DenialofService(DoS)attacksorotherunwantedtraffic)LawfulInterception(enablingminimallyinvasivewarrantsandwire-tapsbasedonstatisticalsummariesoftrafficdetails)Intrusiondetection(detectsuspiciousactivitiesrelatedtosecuritybreachesduetomalicioususersorworms)Themostcommonidentificationtechniquebasedontheinspectionof‘knownportnumbers’isnolongeraccuratebecausemanyapplicationsnolongerusefixed,predictableportnumbers.TheInternetAssignedNumbersAuthority(IANA)[1]assignsthewell-knownportsfrom0-1023andregistersportnumbersintherangefrom1024-49151.ButmanyapplicationshavenoIANAassignedorregisteredportsandonlyutilise‘wellknown’defaultports.OftentheseportsoverlapwithIANAportsandanunambiguousidentificationisnolongerpossible[2].Evenapplicationswithwell-knownorregisteredportscanendupusingdifferentportnumbersbecause(i)non-privilegedusersoftenhavetouseportsabove1023,(ii)usersmaybedeliberayor(iii)multipleserversaresharingasingleIPaddress(host).Furthermoresomeapplications(e.g.passiveFTP /voicecommunication)usedynamicportsunknowableinadvance.Amorereliabletechniqueusedinmanycurrentindustryproductsinvolvesstatefulreconstructionofsessionandapplicationinformationfrompacketcontent(e.g.[3]).Althoughthistechniqueavoidsrelianceonfixedportnumbers,itimposessignificantcomplexityandprocessingloadonthetrafficidentificationdevice.Itmustbekeptup-to-datewithextensiveknowledgeofapplicationsemanticsandnetwork-levelsyntax,andmustbepowerfulenoughtoperformconcurrentysisofapotentiallylargenumberofflows.ThisapproachcanbedifficultorProceedingsoftheIEEEConferenceonLocalComputerNetworks30th0-7695-2421-4/05$20.00©2005impossiblewhendealingwithproprietaryprotocolsorencryptedtraffic.Anotherproblemisthatdirectysisofsessionandapplicationlayercontentmayrepresentanexplicitbreachoforganisationalprivacypoliciesorviolationofrelevantprivacylegislation.Theauthors[4]proposesignature-basedmethodstoclassifyP2Ptraffic.Althoughtheseapproachesaremoreefficientthanstatefulreconstructionandprovidebetterclassificationthantheport-basedapproachtheyarestillprotocoldependent.Theauthorsof[5]describeamethodtobypassprotocol-baseddetectors.Previousworkusedanumberofdifferentparameterstodescribenetworktraffic(e.g.[4],[6],[7])includingthesizeanddurationofflows,packetlengthandinrrivaltimedistributions,flowidletimesetc.WeproposetouseMachineLearning(ML)[8]toautomaticallyclassifyandidentifynetworkapplicationsbasedontheseparameters.AMLalgorithmautomaticallybuildsaclassifierbylearningtheinherentstructureofadatasetdependingonthecharacteristics.Overthepastdecade,MLhasevolvedfromafieldoflaboratorydemonstrationstoafieldofsignificantcommercialvalue[9].Ourapproachincludesthetaskofidentifyingtheoptimalsetofflowattributesthatminimizestheprocessingcost,while izingtheclassificationaccuracy.WeevaluatetheeffectivenessofourapproachusingtraffictracescollectedatdifferentlocationsoftheInternet.Therestofthepaperisorganizedasfollows.Section2presentsanoverviewaboutrelatedwork.Section3describesourapproachforML-basedapplicationidentification.Section4evaluatesourapproachusingtraffictraces.Section5concludesandoutlinesfutureRelatedTheideaofusingMLtechniquesforflowclassificationwasfirstintroducedinthecontextofintrusiondetection[10].Theauthorsof[11]useprincipalcomponentysis(PCA)anddensityestimationtoclassifytrafficintodifferentapplications.Theyusethedistributionsoftwoflowattributesfromafairlysmalldatasetandstudyafewwell-knownports.In[12]theauthorsusenearestneighbour(NN)andlineardiscriminateysis(LDA)tosuccessfullymapapplicationstodifferentQoSclassesusinguptofourattributes.Withthisapproach,thenumberofclassesissmallandknowna-priori.Theattributesusedforthe

TheExpectation ization(EM)algorithmisusedin[13]toclusterflowsintodifferentapplicationtypesusingafixedsetofattributes.Theauthorsfindthatthealgorithmseparatesthetrafficintofewbasicclasses,butfromtheirevaluationitisnotclearwhatinfluencedifferentattributesandEMparametershaveandhowgoodtheclusteringactuallyis.In[14]theauthorsuseasimulatedannealingEMmiceandelephants).Theauthorsconcludethattheirapproachproducesmoremeaningfulresultsthanpreviousthreshold-basedmethods.WehaveproposedanML-basedapproachforidentifyingdifferentnetworkapplicationsin[15].InthispaperweevaluatetheapproachusinganumberoftraffictracescollectedatdifferentlocationsintheInternet.Theauthorsof[16]areusingasimilarapproachbasedontheNaïveBayesclassifierandalargenumberofflowattributes.Theyonlyuseonedatasetbuttheflowsinthissethavebeenhand-classifiedallowingaveryaccurateevaluation.Researchoncombiningdifferentnon-MLtechniquestoidentifynetworkapplicationsispresentedin[17].ML-basedApplication MLLearning&ClassificationFirstweclassifypacketsintobidirectionalflowsandcomputetheflowcharacteristicsusingNetMate[18].Samplingcanbeusedtoonlyselectasubsetoftheflowdatatoimprovetheperformanceofthelearningprocess(see[15]formoredetails).The MLLearning&ClassificationPacket DataSourceFigure1:ML-basedflowProceedingsoftheIEEEConferenceonLocalComputerNetworks30th(LCN’05)0-7695-2421-4/05$20.00©2005IEEENotethatweusethetermlearningfortheinitialprocessofconstructingtheclassifiertodifferentiateitfromthelatterprocessofclassification.However,inotherworkthisissometimescalledclassificationclustering.Theresultsofthelearning

fromalltheequations(see[19]forthemoregeneralapproach).CombiningtheinterclassandintraclassprobabilitieswegetthedirectprobabilitythatinstanceXiwithattributevaluesXiisamemberofclassCj:c, P(Xi,XiCj|VV)k

P(Xik|XiCj,Vjk map, ysisML

ByintroducingpriorsontheparametersetthiscanbeconvertedintoaBayesianmodelobtainingthejointprobabilityoftheparametersetVandthecurrentdatabaseX:Forthemachinelearningweusethe P(XV)P(V)P(X|V approach[19].autoclassisanunsupervisedBayesianclassifierthatautomaticallylearnsthe‘natural’classesunclassifiedinstancesbasedonsomeattributesof

Thegoalistofind umposteriorparametervaluesobtainedfromtheparametersposteriorprobabilitydensityfunction:theinstances.Theresultingclassifiercanthenbetoclassifynewunseeninstances.Inthissectionweonlyprovideanoverviewandthereaderis

P(V|X)P(X,V)P(X

P(X,VdVP(X,V

toread[19]foradetaileddescriptionofthealgorithm.Intheautoclassmodelallinstancesmustbeconditionallyindependentandthereforeanysimilaritybetweentwoinstancesisaccountedforbytheirclassmembershiponly.Theclassaninstanceisaofisanunknownorhiddenattributeofeachinstance.TheprobabilitythataninstanceXiofasetofI

Onecouldusethisequationdirectly,computingtheposteriorprobabilitiesforeverypartitioningofthedataintoJnon-emptysubsets.ButthenumberofpossiblepartitionsapproachJIforsmallJmakingtheapproachcomputationallyinfeasibleforlargesetsofinstancesand/orclasses.ThereforeautoclassusesapproximationbasedontheEMalgorithm[20].instances(thedatabase)with Xiis instances(thedatabase)with XiismemberofaparticularclassCjofasetofJclassesconsistsoftwoparts:theinterclassprobabilityandtheprobabilitydensityfunctionofeachclass(intraclassprobability).Becausetheclassesconstituteadiscretepartitioningofthedata,theappropriate probabilitydensityfunctionisaBernoullidistributioncharacterisedbyasetVofprobabilities{,…,

parametersV.TheEMalgorithmisguaranteedtoconvergetoalocal um.Inanattempttofindtheglobal umautoclassperformsrepeatedEMsearchesstartingfrompseudo-randompointsintheparameterspace.Themodelwiththeparametersetscoringthehighestprobabilitygiventhecurrentjconstrainedthat0j1P(XiCj|Vc)

j1.j

databaseischosenastheautoclasscanbepreconfiguredwiththenumberofclasses(ifknown)oritcantrytoestimatethenumberofclassesitself.ForourproblemtheexactnumberofTheinterclassprobabilitydoesonlydependonJandtheknownnumberofinstancesassignedtoCjbutnotXi.Theintraclassprobabilityistheproductoftheconditionallyindependentprobabilitydistributionsofthekattributes:

classesisunknowninadvance.Onecouldarguethatthereshouldbeexactlyoneclassperapplication.However,wefoundthedistributionsofflowattributes–evenforasingleapplication–canbequiteBecauseweareusingsimpleattributemodels(lognormaldistributions)itisnotpossibletoP(Xi|XiCj,Vj)P(Xik|XiCj,Vjkk

eachapplicationwithasingleclass.Whenthenumberofclassesisunknownautoclassisconfiguredwithaautoclasssupportsdiscreteandreal startlistofclassnumbers .ThenforeachattributemodelsfortheindividualP(Xik).However, onlyuserealattributes,whicharemodelledwithdistributions.Thereforeweassumethereisonlyonefunctionalformfortheattributeprobabilitydensityfunctionsandhaveomittedthisparameter

searchtheinitialnumberofclassesistakenfromthenextentryinJstartaslongasthereareentriesleft.ThenumberofclassesattheendofanEMsearchcanbesmallerifsomeoftheclassesdropoutofconvergence.Forallfurtheri tionsafterthestartProceedingsoftheIEEEConferenceonLocalComputerNetworks30th(LCN’05)0-7695-2421-4/05$20.00©2005IEEElistisexhaustedautoclassrandomlychoosesJfromalognormaldistributionfittedtothenumberofclassesofthebest10classificationsfoundsofar.

TheoverallhomogeneityHofasetofclassesisthemeanoftheclasshomogeneities:HHavingmultipleclassesperapplicationprovidestheadvantageofamorefine-grainedviewinto

H C

and application.Forinstance,webtrafficisusedfordifferentpurposes(e.g.bulktransfers,inctiveinterfaces,streaming,etc.)andfordetailedysisitwouldbebeneficialtodifferentiatebetweenthem.OntheperformanceoftheapproachintermsofruntimeandFeatureOurfeatureselectiontechniqueisbasedontheactualperformanceofthelearningalgorithm.Thismethodgenerallyachievesthehighestaccuracybecauseit‘tailors’thefeaturesettothealgorithm.Onthedownsideitismuchmorecomputationallyexpensive(especiallywhenthelearningalgorithmisnotveryfast)thanalgorithm-independentmethodssuchascorrelation-basedfeatureselection(CFS)[21].Findingthecombinationofattributesthatprovidesthemostcontrastingapplicationclassesisarepeatedtheclassesand(iii)evaluatingtheclassstructure.Weimplementedsequentialforwardselection(SFS)tofindthebestattributesetbecauseanexhaustivesearchisnotfeasible.Thealgorithmstartswitheverysingleattribute.TheattributethatproducesthebestresultiscedinalistofselectedattributesSEL(1).ThenallcombinationsofSEL(1)andasecondattributenotinSEL(1)aretried.Thecombinationthatproducesthebest esSEL(2).Theprocessisrepeateduntilnofurtherimprovementisachieved.SFSisonlyone(simple)approachtoidentifythemostusefulfeaturesetandthereareotherapproachessuchassequentialbackwardelimination(see[22]).Toassessthequalityoftheresultingclasseswehavedevelopedametrictermedintra-classhomogeneityH.WedefineAandCassetsofapplicationsandclassesfoundduringthelearningrespectively.Wealsodefineafunctioncount(a,c)thatcountsthenumberofflowsthatapplicationaAhasinclasscC.ThenthehomogeneityH(c)ofaclasscisdefinedasthelargestfractionofflowsofoneapplicationintheclass:

Thegoalisto izeHtoachieveagoodseparationbetweendifferentapplications.ThereasonwhyweuseHasanevaluationmetric,insteadofstandardmetricslikeaccuracy,precisionandrecall,isthatweareusinganunsupervisedlearningtechnique.Thenumberofclassesandtheclasstoapplicationmapisnotknownbeforethelearning.Afterwardseachclasscanbeassignedtotheapplicationthathasthemostflowsinit.However,ifmorethanoneapplicationcontributesasignificantnumberofflowstoaclassthemapcanbedifficult.HighhomogeneityvaluesarerequiredinordertounambiguouslymapaclasstoanTraceFortheevaluationweusetheAuckland-VI,NZIX-IIandLeipzig-IItracesfromNLANR[23]capturedindifferentyearsatdifferentlocationsintheInternet.Becausewearelimitedtousepublicavailableanonymisedtracesweareunabletoverifythetrueapplicationsthatcreatedtheflows.Inourevaluationwethereforeassumeaflow’sIANAdefinedorregisteredserverportidentifiestheapplication.Inourcasetheserverportisusuallythedestinationportofthebidirectionalflows.InrarecaseswherethesourceportwastheIANAdefinedportwehaveswappedbothdirectionsoftheflow(includingIPaddresses,portsandflowattributes).Weadmitthatassumingtheserverportalwaysidentifiesanapplicationisnotcorrect.However,weassumethatfortheportsweuseinthisstudythemajorityofthetrafficisfromtheexpectedapplication.Thenitismostlikelythatfew‘wrong’flowswoulddecreasethehomogeneityofthelearnedclasses.Thereforeourevaluationresultscanbetreatedaslowerboundoftheeffectiveness.Wealsodonotconsidertrafficoftheselectedapplicationsonotherthanthestandardserverportse.g.wedoonlyconsiderwebtrafficonport80butnotonport81.Assumingthereisnostrongcorrelationbetweentheusedserverportandtheapplicationcharacteristicsthisdoesnotcount(a,

introduceanyadditionalbiasbecauseitcanbeviewedasrandomsampling.aProceedingsoftheIEEEConferenceonLocalComputerNetworks30th(LCN’05)0-7695-2421-4/05$20.00©2005IEEEFlowOurattributesetincludespacketinter-arrivaltimeandpacketlengthmeanandvariance,flowsize(bytes)andduration.Asidefromdurationallattributesarebidirectionalmeaningtheyareseparaycomputedforbothdirectionsofaflow.Ourgoalistominimisethenumberofattributesandweonlyuse‘basic’attributesthatcanbeeasilycomputed.Wearenotusingtheserverportasanattributebecause‘wrong’portscouldintroduceanunknownbias.Inourysisweexcludeflowsthathavelessthanthreepacketsineachdirectionbecauseforveryshortflowsonlysomeattributescouldbecomputede.g.flowscontainingonlyasinglepacketwouldprovidenointer-arrivaltimestatisticsandpacketlengthstatisticscanonlybecomputedintheforwarddirection.Thiswouldmorethanhalvethenumberofavailableattributesmakingitdifficulttoseparatedifferentapplicationsandwouldmostlikelybiastheattributeinfluenceresults.FurthermoreanyvalidTCPflowsshouldhaveatleastsixpackets.However,validUDPflowscanconsistofjusttwopacketsandthereforeexcludingsmallUDPflowsmayhavebiasedtheresultsforDNSandHalf-Lifetraffic.Thisstrategyclearlyleavesasideasubstantialnumberofflowsbutinthisworkweaimtoseparatedifferentapplicationsandwearenotinterestedin‘strange’flowsoranomalies.Infactusingtheseflowswouldbedangerousbecausewerelyontheserverportforvalidatingourapproach.Forinstance,ifwewoulduseone-packetflowswemightconfuseportscanswiththerealapplication.However,ingeneralsmallflowscanprovideinterestinginsightsandshouldnotbeignoredespeciallyfromasecurityviewpoint.IdentifyingNetworkForperformancereasonsweuseasubsetof8,000flowsfromeachtracefile.Foreachapplication(FTPdata,net,SMTP,DNS,HTTP,AOLMessenger,Napster,Half-Life)werandomlysample1,000flowsoutofallflowsofthatparticularapplication.WederivetheflowsamplesandperformtheSFSforeachofthefourtracesusingthesameparametersforthelearningFigure2showsanexampleresultofasinglerunofthealgorithmwithafixedattributeset.Itshowshowtheapplicationsaredistributedamongtheclassesthathavebeenfound.Theclassesareorderedwithdecreasingclasssizefromlefttoright(increasingclassnumber).Foreachoftheclassesthehomogeneityisthe fractionofflowsofoneapplication

e.g.H(leftmost)=0.52andH(rightmost)=1.TheoverallhomogeneityHisthemeanofallclasshomogeneitiesandinthiscaseH=0.86.Percentageof0 03691317Percentageof0 Figure2:ExampledistributionofapplicationsacrosstheclassesIntra-classHomogenity Figure3showstheoverallmeanHdependingonthenumberofattributes.Itshowstheoverallefficiencyinidentifyingthedifferentapplicationincreaseswiththenumberofattributesuntilitreachesa between0.85and0.89dependingonthetrace.ThatmeansonaverageIntra-classHomogenity Auckland-VIAuckland-VIpartAuckland-VIpart2 NumberofFigure3:HomogeneitydependingonthenumberofflowattributesanddifferenttracesFigure4showswhatfeatureshavebeenselectedforthebestsetsforeachofthetraces.They-axisshowsthepercentageoftracesafeaturemadeitintothebestset.Althoughthebestfeaturesetsaredifferentforallthetracesthereisacleartrendtowardssomeofthefeatures.Packetlengthstatisticsseemstobepreferredoverinter-arrivaltimesstatisticsanddurationandbackwardvolumealsoseemtobeoflimitedvalue.ProceedingsoftheIEEEConferenceonLocalComputerNetworks30th0-7695-2421-4/05$20.00©2005Auckland-VI,day1Leipzig-IIAuckland-VI,day2 per-applicationhomogeneitydistributionacrossthedifferenttraces.Theper-applicationhomogeneityisdefinedasthemeanhomogeneityofallclasseswhereanapplicationhasthelargestfraction.Thedistributionsareshownasboxplots.Thelowerend,middleandupperendoftheboxarethe1stAuckland-VI,day1Leipzig-IIAuckland-VI,day2 Forward-Pkt-Len-Forward-Pkt-Len-Backward-Pkt-Len-Backward-Forward-Pkt-Len-Forward-Backward-Pkt-Len-Forward-IAT-Figure4:FeaturesselectedforthebestfeaturesetsofdifferenttracesWebelievethereasonwhythepacketlengthispreferredoverinter-arrivaltimesisbecausenoneoftheapplicationsweinvestigatehasverycharacteristicinter-arrivaltimedistributions.Ifforinstancewehadchosenvoicecommunicationwhereapplicationshaveverycharacteristicinter-arrivaltimes(e.g.onepacketevery20ms)wewouldexpectinter-arrivaltimestobemuchmoreuseful.GametrafficsuchasHalf-Lifetrafficisknowntohaveverycharacteristicinter-arrivaltimes.However,thisisonlytrueforthetrafficthatexchangesgamestateinformationduringagame.MostHalf-Lifetrafficflowsinourdatasetareactuallycausedbyyersjustqueryinginformationfromtheserversuchasthenumberofactiveyersetc.Apotentialproblemwithinter-arrivaltimesisthatpacketqueuinginrouterscanchangetheirdistributionsespeciallyincaseofcongestion.Incontrastpacketlengthsareusuallyconstantincasethereisnointermediatefragmentationorencryption.HomWealsoestimatetheinfluenceofthedifferentattributesonthe eofthelearning.TheHomw.r.t.theglobaldistributionofasingleclassclassification.Thetotalinfluenceofanattributeis

Thefigureshowsthattheclassesofsomeapplicationsarequitehomogenous(e.g.Half-Life)butforotherstheyarelesshomogenous(,Web).Thehigherthehomogeneitythemorelikelyitistoseparateanapplicationfromalltheothers.SomeapplicationssuchasHalf-LifecanbewellseparatedfromtherestbutotherssuchasFTPseemtohavecharacteristicsverysimilartootherapplications.theclassprobabilityweightedaverageoftheinfluenceineachoftheclasses.Theinfluencevaluesrange0(noinfluence)to1( uminfluence).Table1showsthemeanvalues(basedonthelearningresultswhenusingthebestattributesets)acrossthedifferenttraces.TheresultsaresimilartoFigure4inthatpacketlengthandvolumestatisticsaremostinfluentialwhileinter-arrivaltimesanddurationhavelessinfluence.Wecomputedthehomogeneityoftheclassesforeachofthedifferentapplications.Figure5showsthe

FTPnetSMTPDNS AOLNapsterH-LifeFigure5:MeanhomogeneityperapplicationacrossdifferenttracesFigure6showsthepercentageofflowsinthe‘correct’classesforeachapplicationandalltraces(thisisusuallycalledaccuracy).Tocomputetheaccuracywemapeachclasstotheapplicationthatisdominatingtheclass(byhavingthelargestfractionofflowsinthatclass).Thefigureindicatestheexpectedclassificationaccuracy.ItshowsthatsomeProceedingsoftheIEEEConferenceonLocalComputerNetworks30th(LCN’05)0-7695-2421-4/05$20.00©2005IEEEP810applicationshaveaveryhighaccuracybuttherearesomeproblemse.g.forNapsterthereisonetraceP810024FTPnetSMTPDNS AOLNapster024Figure6:Accuracyperapplicationacrossdifferenttraces netSMTPDNSWebAOLNapsterH-PercentageofFalseWhiletheaccuracygivesthepercentageofcorrectlyclassifiedflowsitdoesnotprovideameasureintowhichapplicationsflowsarelikelytobemisclassified.Toaddressthisissuewealsocomputethefalsepositiverateperapplication,whichisdefinedasthenumberofmisclassifiedflowsdividedbythetotalnumberofflowsinallclassesassignedtotheapplication.Figure7showsthepercentageoffalsepositivesforeachapplication.FTP,netandWhastheproblemthatit netSMTPDNSWebAOLNapsterH-PercentageofFalse00Figure7:Falsepositivesperapplicationacrossdifferenttraces

TovisualizewhichapplicationshavethemostdiversecharacteristicsthepercentageofclassesanapplicationhasatleastoneflowinisshowninFigure8(spreadofanapplicationamongthedifferentclasses).Notsurprisinglywebtrafficisthemostdistributedapplication(similarresultswerefoundin[13])whereasgametrafficistheleastdistributed.Onaveragethetotalnumberofclassesfoundwas100.0 netSMTPDNS WebAOLNapster0Figure8:SpreadoftheapplicationsovertheclassesacrossdifferenttracesAlthoughthemainfocusisonachievingag

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论