绝对初学者的机器学习_第1页
绝对初学者的机器学习_第2页
绝对初学者的机器学习_第3页
绝对初学者的机器学习_第4页
绝对初学者的机器学习_第5页
已阅读5页,还剩123页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

_

MachineLearningForAbsoluteBeginners

OliverTheobald

SecondEdition

Copyright©2017byOliverTheobald

Allrightsreserved.Nopartofthispublicationmaybereproduced,distributed,ortransmittedinanyformorbyanymeans,includingphotocopying,recording,orotherelectronicormechanicalmethods,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembodiedincriticalreviewsandcertainothernon-commercialusespermittedbycopyrightlaw.

Contents

INTRODUCTION

WHATISMACHINELEARNING?MLCATEGORIES

THEMLTOOLBOXDATASCRUBBING

SETTINGUPYOURDATAREGRESSIONANALYSISCLUSTERING

BIAS&VARIANCE

ARTIFICIALNEURALNETWORKSDECISIONTREES

ENSEMBLEMODELINGBUILDINGAMODELINPYTHONMODELOPTIMIZATIONFURTHERRESOURCESDOWNLOADINGDATASETSFINALWORD

INTRODUCTION

MachineshavecomealongwaysincetheIndustrialRevolution.Theycontinuetofillfactoryfloorsandmanufacturingplants,butnowtheircapabilitiesextendbeyondmanualactivitiestocognitivetasksthat,untilrecently,onlyhumanswerecapableofperforming.Judgingsongcompetitions,drivingautomobiles,andmoppingthefloorwithprofessionalchessplayersarethreeexamplesofthespecificcomplextasksmachinesarenowcapableofsimulating.

Buttheirremarkablefeatstriggerfearamongsomeobservers.Partofthisfearnestlesontheneckofsurvivalistinsecurities,whereitprovokesthedeep-seatedquestionofwhatif?Whatifintelligentmachinesturnonusinastruggleofthefittest?Whatifintelligentmachinesproduceoffspringwithcapabilitiesthathumansneverintendedtoimparttomachines?Whatifthelegendofthesingularityistrue?

Theothernotablefearisthethreattojobsecurity,andifyou’reatruckdriveroranaccountant,thereisavalidreasontobeworried.AccordingtotheBritishBroadcastingCompany’s(BBC)interactiveonlineresourceWillarobottakemyjob?,professionssuchasbarworker(77%),waiter(90%),charteredaccountant(95%),receptionist(96%),andtaxidriver(57%)eachhaveahighchanceofbecomingautomatedbytheyear2035.

[1]

Butresearchonplannedjobautomationandcrystalballgazingwithrespecttothefutureevolutionofmachinesandartificialintelligence(AI)shouldbereadwithapinchofskepticism.AItechnologyismovingfast,butbroadadoptionisstillanuncharteredpathfraughtwithknownandunforeseenchallenges.Delaysandotherobstaclesareinevitable.

NorismachinelearningasimplecaseofflickingaswitchandaskingthemachinetopredicttheoutcomeoftheSuperBowlandserveyouadeliciousmartini.Machinelearningisfarfromwhatyouwouldcallanout-of-the-boxsolution.

Machinesoperatebasedonstatisticalalgorithmsmanagedandoverseenbyskilledindividuals—knownasdatascientistsandmachinelearningengineers.Thisisonelabormarketwherejobopportunitiesaredestinedfor

growthbutwhere,currently,supplyisstrugglingtomeetdemand.IndustryexpertslamentthatoneofthebiggestobstaclesdelayingtheprogressofAIistheinadequatesupplyofprofessionalswiththenecessaryexpertiseandtraining.

AccordingtoCharlesGreen,theDirectorofThoughtLeadershipatBelatrixSoftware:

“It’sahugechallengetofinddatascientists,peoplewithmachinelearningexperience,orpeoplewiththeskillstoanalyzeandusethedata,aswellasthosewhocancreatethealgorithmsrequiredformachinelearning.Secondly,whilethetechnologyisstillemerging,therearemanyongoingdevelopments.It’sclearthatAIisalongwayfromhowwemightimagineit.”

[2]

Perhapsyourownpathtobecominganexpertinthefieldofmachinelearningstartshere,ormaybeabaselineunderstandingissufficienttosatisfyyourcuriosityfornow.Inanycase,let’sproceedwiththeassumptionthatyouarereceptivetotheideaoftrainingtobecomeasuccessfuldatascientistormachinelearningengineer.

Tobuildandprogramintelligentmachines,youmustfirstunderstandclassicalstatistics.Algorithmsderivedfromclassicalstatisticscontributethemetaphoricalbloodcellsandoxygenthatpowermachinelearning.Layeruponlayeroflinearregression,k-nearestneighbors,andrandomforestssurgethroughthemachineanddrivetheircognitiveabilities.Classicalstatisticsisattheheartofmachinelearningandmanyofthesealgorithmsarebasedonthesamestatisticalequationsyoustudiedinhighschool.Indeed,statisticalalgorithmswereconductedonpaperwellbeforemachinesevertookonthetitleofartificialintelligence.

Computerprogrammingisanotherindispensablepartofmachinelearning.Thereisn’taclick-and-dragorWeb2.0solutiontoperformadvancedmachinelearninginthewayonecanconvenientlybuildawebsitenowadayswithWordPressorStrikingly.Programmingskillsarethereforevitaltomanagedataanddesignstatisticalmodelsthatrunonmachines.

Somestudentsofmachinelearningwillhaveyearsofprogrammingexperiencebuthaven’ttouchedclassicalstatisticssincehighschool.Others,perhaps,neverevenattemptedstatisticsintheirhighschoolyears.Butnottoworry,manyofthemachinelearningalgorithmswediscussinthisbookhaveworkingimplementationsinyourprogramminglanguageofchoice;noequationwritingnecessary.Youcanusecodetoexecutetheactualnumber

crunchingforyou.

Ifyouhavenotlearnedtocodebefore,youwillneedtoifyouwishtomakefurtherprogressinthisfield.Butforthepurposeofthiscompactstarter’scourse,thecurriculumcanbecompletedwithoutanybackgroundincomputerprogramming.Thisbookfocusesonthehigh-levelfundamentalsofmachinelearningaswellasthemathematicalandstatisticalunderpinningsofdesigningmachinelearningmodels.

Forthosewhodowishtolookattheprogrammingaspectofmachinelearning,Chapter13walksyouthroughtheentireprocessofsettingupasupervisedlearningmodelusingthepopularprogramminglanguagePython.

WHATISMACHINELEARNING?

In1959,IBMpublishedapaperintheIBMJournalofResearchandDevelopmentwithan,atthetime,obscureandcurioustitle.AuthoredbyIBM’sArthurSamuel,thepaperinvestedtheuseofmachinelearninginthegameofcheckers“toverifythefactthatacomputercanbeprogrammedsothatitwilllearntoplayabettergameofcheckersthancanbeplayedbythepersonwhowrotetheprogram.”

[3]

Althoughitwasnotthefirstpublicationtousetheterm“machinelearning”perse,ArthurSamueliswidelyconsideredasthefirstpersontocoinanddefinemachinelearningintheformwenowknowtoday.Samuel’slandmarkjournalsubmission,SomeStudiesinMachineLearningUsingtheGameofCheckers,isalsoanearlyindicationofhomosapiens’determinationtoimpartourownsystemoflearningtoman-mademachines.

Figure1:Historicalmentionsof“machinelearning”inpublishedbooks.Source:GoogleNgramViewer,2017

ArthurSamuelintroducesmachinelearninginhispaperasasubfieldofcomputersciencethatgivescomputerstheabilitytolearnwithoutbeingexplicitlyprogrammed.

[4]

Almostsixdecadeslater,thisdefinitionremainswidelyaccepted.

AlthoughnotdirectlymentionedinArthurSamuel’sdefinition,akeyfeatureofmachinelearningistheconceptofself-learning.Thisreferstotheapplicationofstatisticalmodelingtodetectpatternsandimprove

performancebasedondataandempiricalinformation;allwithoutdirectprogrammingcommands.ThisiswhatArthurSamueldescribedastheabilitytolearnwithoutbeingexplicitlyprogrammed.Buthedoesn’tinferthatmachinesformulatedecisionswithnoupfrontprogramming.Onthecontrary,machinelearningisheavilydependentoncomputerprogramming.Instead,Samuelobservedthatmachinesdon’trequireadirectinputcommandtoperformasettaskbutratherinputdata.

Figure2:ComparisonofInputCommandvsInputData

Anexampleofaninputcommandistyping“2+2”intoaprogramminglanguagesuchasPythonandhitting“Enter.”

>>>2+2

4

>>>

Thisrepresentsadirectcommandwithadirectanswer.

Inputdata,however,isdifferent.Dataisfedtothemachine,analgorithmisselected,hyperparameters(settings)areconfiguredandadjusted,andthemachineisinstructedtoconductitsanalysis.Themachineproceedstodecipherpatternsfoundinthedatathroughtheprocessoftrialanderror.Themachine’sdatamodel,formedfromanalyzingdatapatterns,canthenbeusedtopredictfuturevalues.

Althoughthereisarelationshipbetweentheprogrammerandthemachine,theyoperatealayerapartincomparisontotraditionalcomputerprogramming.Thisisbecausethemachineisformulatingdecisionsbasedonexperienceandmimickingtheprocessofhuman-baseddecision-making.

Asanexample,let’ssaythatafterexaminingtheYouTubeviewinghabitsofdatascientistsyourmachineidentifiesastrongrelationshipbetweendata

scientistsandcatvideos.Later,yourmachineidentifiespatternsamongthephysicaltraitsofbaseballplayersandtheirlikelihoodofwinningtheseason’sMostValuablePlayer(MVP)award.Inthefirstscenario,themachineanalyzedwhatvideosdatascientistsenjoywatchingonYouTubebasedonuserengagement;measuredinlikes,subscribes,andrepeatviewing.Inthesecondscenario,themachineassessedthephysicalfeaturesofpreviousbaseballMVPsamongvariousotherfeaturessuchasageandeducation.However,inneitherofthesetwoscenarioswasyourmachineexplicitlyprogrammedtoproduceadirectoutcome.Youfedtheinputdataandconfiguredthenominatedalgorithms,butthefinalpredictionwasdeterminedbythemachinethroughself-learninganddatamodeling.

Youcanthinkofbuildingadatamodelassimilartotrainingaguidedog.Throughspecializedtraining,guidedogslearnhowtorespondinvarioussituations.Forexample,thedogwilllearntoheelataredlightortosafelyleaditsmasteraroundobstacles.Ifthedoghasbeenproperlytrained,then,eventually,thetrainerwillnolongerberequired;theguidedogwillbeabletoapplyitstraininginvariousunsupervisedsituations.Similarly,machinelearningmodelscanbetrainedtoformdecisionsbasedonpastexperience.

Asimpleexampleiscreatingamodelthatdetectsspamemailmessages.Themodelistrainedtoblockemailswithsuspicioussubjectlinesandbodytextcontainingthreeormoreflaggedkeywords:dearfriend,free,invoice,PayPal,Viagra,casino,payment,bankruptcy,andwinner.Atthisstage,though,wearenotyetperformingmachinelearning.Ifwerecallthevisualrepresentationofinputcommandvsinputdata,wecanseethatthisprocessconsistsofonlytwosteps:Command>Action.

Machinelearningentailsathree-stepprocess:Data>Model>Action.

Thus,toincorporatemachinelearningintoourspamdetectionsystem,weneedtoswitchout“command”for“data”andadd“model”inordertoproduceanaction(output).Inthisexample,thedatacomprisessampleemailsandthemodelconsistsofstatistical-basedrules.Theparametersofthemodelincludethesamekeywordsfromouroriginalnegativelist.Themodelisthentrainedandtestedagainstthedata.

Oncethedataisfedintothemodel,thereisastrongchancethatassumptionscontainedinthemodelwillleadtosomeinaccuratepredictions.Forexample,undertherulesofthismodel,thefollowingemailsubjectlinewouldautomaticallybeclassifiedasspam:“PayPalhasreceivedyourpaymentforCasinoRoyalepurchasedoneBay.”

AsthisisagenuineemailsentfromaPayPalauto-responder,thespamdetectionsystemisluredintoproducingafalsepositivebasedonthenegativelistofkeywordscontainedinthemodel.Traditionalprogrammingishighlysusceptibletosuchcasesbecausethereisnobuilt-inmechanismtotestassumptionsandmodifytherulesofthemodel.Machinelearning,ontheotherhand,canadaptandmodifyassumptionsthroughitsthree-stepprocessandbyreactingtoerrors.

Training&TestData

Inmachinelearning,dataissplitintotrainingdataandtestdata.Thefirstsplitofdata,i.e.theinitialreserveofdatayouusetodevelopyourmodel,providesthetrainingdata.Inthespamemaildetectionexample,falsepositivessimilartothePayPalauto-responsemightbedetectedfromthetrainingdata.Newrulesormodificationsmustthenbeadded,e.g.,emailnotificationsissuedfromthesendingaddress“

payments@

”shouldbeexcludedfromspamfiltering.

Afteryouhavesuccessfullydevelopedamodelbasedonthetrainingdataandaresatisfiedwithitsaccuracy,youcanthentestthemodelontheremainingdata,knownasthetestdata.Onceyouaresatisfiedwiththeresultsofboththetrainingdataandtestdata,themachinelearningmodelisreadytofilterincomingemailsandgeneratedecisionsonhowtocategorizethoseincomingmessages.

Thedifferencebetweenmachinelearningandtraditionalprogrammingmayseemtrivialatfirst,butitwillbecomeclearasyourunthroughfurtherexamplesandwitnessthespecialpowerofself-learninginmorenuancedsituations.

Thesecondimportantpointtotakeawayfromthischapterishowmachinelearningfitsintothebroaderlandscapeofdatascienceandcomputerscience.Thismeansunderstandinghowmachinelearninginterrelateswithparentfieldsandsisterdisciplines.Thisisimportant,asyouwillencountertheserelatedtermswhensearchingforrelevantstudymaterials—andyouwillhearthemmentionedadnauseaminintroductorymachinelearningcourses.Relevantdisciplinescanalsobedifficulttotellapartatfirstglance,suchas“machinelearning”and“datamining.”

Let’sbeginwithahigh-levelintroduction.Machinelearning,datamining,computerprogramming,andmostrelevantfields(excludingclassical

statistics)derivefirstfromcomputerscience,whichencompasseseverythingrelatedtothedesignanduseofcomputers.Withintheall-encompassingspaceofcomputerscienceisthenextbroadfield:datascience.Narrowerthancomputerscience,datasciencecomprisesmethodsandsystemstoextractknowledgeandinsightsfromdatathroughtheuseofcomputers.

Figure3:ThelineageofmachinelearningrepresentedbyarowofRussianmatryoshkadolls

Poppingoutfromcomputerscienceanddatascienceasthethirdmatryoshkadollisartificialintelligence.Artificialintelligence,orAI,encompassestheabilityofmachinestoperformintelligentandcognitivetasks.ComparabletothewaytheIndustrialRevolutiongavebirthtoaneraofmachinesthatcouldsimulatephysicaltasks,AIisdrivingthedevelopmentofmachinescapableofsimulatingcognitiveabilities.

Whilestillbroadbutdramaticallymorehonedthancomputerscienceanddatascience,AIcontainsnumeroussubfieldsthatarepopulartoday.Thesesubfieldsincludesearchandplanning,reasoningandknowledgerepresentation,perception,naturallanguageprocessing(NLP),andofcourse,machinelearning.MachinelearningbleedsintootherfieldsofAI,includingNLPandperceptionthroughtheshareduseofself-learningalgorithms.

Figure4:Visualrepresentationoftherelationshipbetweendata-relatedfields

ForstudentswithaninterestinAI,machinelearningprovidesanexcellentstartingpointinthatitoffersamorenarrowandpracticallensofstudycomparedtotheconceptualambiguityofAI.Algorithmsfoundinmachinelearningcanalsobeappliedacrossotherdisciplines,includingperceptionandnaturallanguageprocessing.Inaddition,aMaster’sdegreeisadequatetodevelopacertainlevelofexpertiseinmachinelearning,butyoumayneedaPhDtomakeanytrueprogressinAI.

Asmentioned,machinelearningalsooverlapswithdatamining—asisterdisciplinethatfocusesondiscoveringandunearthingpatternsinlargedatasets.Popularalgorithms,suchask-meansclustering,associationanalysis,andregressionanalysis,areappliedinbothdataminingandmachinelearningtoanalyzedata.Butwheremachinelearningfocusesontheincrementalprocessofself-learninganddatamodelingtoformpredictionsaboutthefuture,dataminingnarrowsinoncleaninglargedatasetstogleanvaluableinsightfromthepast.

Thedifferencebetweendataminingandmachinelearningcanbeexplainedthroughananalogyoftwoteamsofarchaeologists.Thefirstteamismadeupofarchaeologistswhofocustheireffortsonremovingdebristhatliesinthewayofvaluableitems,hidingthemfromdirectsight.Theirprimarygoalsaretoexcavatethearea,findnewvaluablediscoveries,andthenpackuptheirequipmentandmoveon.Adaylater,theywillflytoanotherexoticdestinationtostartanewprojectwithnorelationshiptothesitethey

excavatedthedaybefore.

Thesecondteamisalsointhebusinessofexcavatinghistoricalsites,butthesearchaeologistsuseadifferentmethodology.Theydeliberatelyreframefromexcavatingthemainpitforseveralweeks.Inthattime,theyvisitotherrelevantarchaeologicalsitesintheareaandexaminehoweachsitewasexcavated.Afterreturningtothesiteoftheirownproject,theyapplythisknowledgetoexcavatesmallerpitssurroundingthemainpit.

Thearchaeologiststhenanalyzetheresults.Afterreflectingontheirexperienceexcavatingonepit,theyoptimizetheireffortstoexcavatethenext.Thisincludespredictingtheamountoftimeittakestoexcavateapit,understandingvarianceandpatternsfoundinthelocalterrainanddevelopingnewstrategiestoreduceerrorandimprovetheaccuracyoftheirwork.Fromthisexperience,theyareabletooptimizetheirapproachtoformastrategicmodeltoexcavatethemainpit.

Ifitisnotalreadyclear,thefirstteamsubscribestodataminingandthesecondteamtomachinelearning.Atamicro-level,bothdataminingandmachinelearningappearsimilar,andtheydousemanyofthesametools.Bothteamsmakealivingexcavatinghistoricalsitestodiscovervaluableitems.Butinpractice,theirmethodologyisdifferent.Themachinelearningteamfocusesondividingtheirdatasetintotrainingdataandtestdatatocreateamodel,andimprovingfuturepredictionsbasedonpreviousexperience.Meanwhile,thedataminingteamconcentratesonexcavatingthetargetareaaseffectivelyaspossible—withouttheuseofaself-learningmodel—beforemovingontothenextcleanupjob.

MLCATEGORIES

Machinelearningincorporatesseveralhundredstatistical-basedalgorithmsandchoosingtherightalgorithmorcombinationofalgorithmsforthejobisaconstantchallengeforanyoneworkinginthisfield.Butbeforeweexaminespecificalgorithms,itisimportanttounderstandthethreeoverarchingcategoriesofmachinelearning.Thesethreecategoriesaresupervised,unsupervised,andreinforcement.

SupervisedLearning

Asthefirstbranchofmachinelearning,supervisedlearningconcentratesonlearningpatternsthroughconnectingtherelationshipbetweenvariablesandknownoutcomesandworkingwithlabeleddatasets.

Supervisedlearningworksbyfeedingthemachinesampledatawithvariousfeatures(representedas“X”)andthecorrectvalueoutputofthedata(representedas“y”).Thefactthattheoutputandfeaturevaluesareknownqualifiesthedatasetas“labeled.”Thealgorithmthendecipherspatternsthatexistinthedataandcreatesamodelthatcanreproducethesameunderlyingruleswithnewdata.

Forinstance,topredictthemarketrateforthepurchaseofausedcar,asupervisedalgorithmcanformulatepredictionsbyanalyzingtherelationshipbetweencarattributes(includingtheyearofmake,carbrand,mileage,etc.)andthesellingpriceofothercarssoldbasedonhistoricaldata.Giventhatthesupervisedalgorithmknowsthefinalpriceofothercardssold,itcanthenworkbackwardtodeterminetherelationshipbetweenthecharacteristicsofthecaranditsvalue.

Figure1:Carvaluepredictionmodel

Afterthemachinedecipherstherulesandpatternsofthedata,itcreateswhatisknownasamodel:analgorithmicequationforproducinganoutcomewithnewdatabasedontherulesderivedfromthetrainingdata.Oncethemodelisprepared,itcanbeappliedtonewdataandtestedforaccuracy.Afterthemodelhaspassedboththetrainingandtestdatastages,itisreadytobeappliedandusedintherealworld.

InChapter13,wewillcreateamodelforpredictinghousevalueswhereyistheactualhousepriceandXarethevariablesthatimpacty,suchaslandsize,location,andthenumberofrooms.Throughsupervisedlearning,wewillcreatearuletopredicty(housevalue)basedonthegivenvaluesofvariousvariables(X).

Examplesofsupervisedlearningalgorithmsincluderegressionanalysis,decisiontrees,k-nearestneighbors,neuralnetworks,andsupportvectormachines.Eachofthesetechniqueswillbeintroducedlaterinthebook.

UnsupervisedLearning

Inthecaseofunsupervisedlearning,notallvariablesanddatapatternsareclassified.Instead,themachinemustuncoverhiddenpatternsandcreatelabelsthroughtheuseofunsupervisedlearningalgorithms.Thek-meansclusteringalgorithmisapopularexampleofunsupervisedlearning.ThissimplealgorithmgroupsdatapointsthatarefoundtopossesssimilarfeaturesasshowninFigure1.

Figure1:Exampleofk-meansclustering,apopularunsupervisedlearningtechnique

IfyougroupdatapointsbasedonthepurchasingbehaviorofSME(SmallandMedium-sizedEnterprises)andlargeenterprisecustomers,forexample,youarelikelytoseetwoclustersemerge.ThisisbecauseSMEsandlargeenterprisestendtohavedisparatebuyinghabits.Whenitcomestopurchasingcloudinfrastructure,forinstance,basiccloudhostingresourcesandaContentDeliveryNetwork(CDN)mayprovesufficientformostSMEcustomers.Largeenterprisecustomers,though,aremorelikelytopurchaseawiderarrayofcloudproductsandentiresolutionsthatincludeadvancedsecurityandnetworkingproductslikeWAF(WebApplicationFirewall),adedicatedprivateconnection,andVPC(VirtualPrivateCloud).Byanalyzingcustomerpurchasinghabits,unsupervisedlearningiscapableofidentifyingthesetwogroupsofcustomerswithoutspecificlabelsthatclassifythecompanyassmall,mediumorlarge.

Theadvantageofunsupervisedlearningisitenablesyoutodiscoverpatternsinthedatathatyouwereunawareexisted—suchasthepresenceoftwomajorcustomertypes.Clusteringtechniquessuchask-meansclusteringcanalsoprovidethespringboardforconductingfurtheranalysisafterdiscretegroupshavebeendiscovered.

Inindustry,unsupervisedlearningisparticularlypowerfulinfrauddetection

—wherethemostdangerousattacksareoftenthoseyettobeclassified.Onereal-worldexampleisDataVisor,whoessentiallybuilttheirbusinessmodelbasedonunsupervisedlearning.

Foundedin2013inCalifornia,DataVisorprotectscustomersfromfraudulent

onlineactivities,includingspam,fakereviews,fakeappinstalls,andfraudulenttransactions.Whereastraditionalfraudprotectionservicesdrawonsupervisedlearningmodelsandruleengines,DataVisorusesunsupervisedlearningwhichenablesthemtodetectunclassifiedcategoriesofattacksintheirearlystages.

Ontheirwebsite,DataVisorexplainsthat"todetectattacks,existingsolutionsrelyonhumanexperiencetocreaterulesorlabeledtrainingdatatotunemodels.Thismeanstheyareunabletodetectnewattacksthathaven’talreadybeenidentifiedbyhumansorlabeledintrainingdata."

[5]

Thismeansthattraditionalsolutionsanalyzethechainofactivityforaparticularattackandthencreaterulestopredictarepeatattack.Underthisscenario,thedependentvariable(y)istheeventofanattackandtheindependentvariables(X)arethecommonpredictorvariablesofanattack.Examplesofindependentvariablescouldbe:

Asuddenlargeorderfromanunknownuser.I.E.establishedcustomersgenerallyspendlessthan$100perorder,butanewuserspends$8,000inoneorderimmediatelyuponregisteringtheiraccount.

Asuddensurgeofuserratings.I.E.AsatypicalauthorandbookselleronA,it’suncommonformyfirstpublishedworktoreceivemorethanonebookreviewwithinthespaceofonetotwodays.Ingeneral,approximately1in200Amazonreadersleaveabookreviewandmostbooksgoweeksormonthswithoutareview.However,Icommonlyseecompetitorsinthiscategory(datascience)attracting20-50reviewsinoneday!(Unsurprisingly,IalsoseeAmazonremovingthesesuspiciousreviewsweeksormonthslater.)

Identicalorsimilaruserreviewsfromdifferentusers.FollowingthesameAmazonanalogy,Ioftenseeuserreviewsofmybookappearonotherbooksseveralmonthslater(sometimeswithareferencetomynameastheauthorstillincludedinthereview!).Again,Amazoneventuallyremovesthesefakereviewsandsuspendstheseaccountsforbreakingtheirtermsofservice.

Suspiciousshippingaddress.I.E.Forsmallbusinessesthatroutinelyshipproductstolocalcustomers,anorderfromadistantlocation(wheretheydon'tadvertisetheirproducts)caninrarecasesbeanindicatoroffraudulentormaliciousactivity.

Standaloneactivitiessuchasasuddenlargeorderoradistantshippingaddressmayprovetoolittleinformationtopredictsophisticated

cybercriminalactivityandmorelikelytoleadtomanyfalsepositives.Butamodelthatmonitorscombinationsofindependentvariables,suchasasuddenlargepurchaseorderfromtheothersideoftheglobeoralandslideofbookreviewsthatreuseexistingcontentwillgenerallyleadtomoreaccuratepredictions.Asupervisedlearning-basedmodelcoulddeconstructandclassifywhatthesecommonindependentvariablesareanddesignadetectionsystemtoidentifyandpreventrepeatoffenses.

Sophisticatedcybercriminals,though,learntoevadeclassification-basedruleenginesbymodifyingtheirtactics.Inaddition,leadinguptoanattack,attackersoftenregisterandoperatesingleormultipleaccountsandincubatetheseaccountswithactivitiesthatmimiclegitimateusers.Theythenutilizetheirestablishedaccounthistorytoevadedetectionsystems,whicharetrigger-heavyagainstrecentlyregisteredaccounts.Supervisedlearning-basedsolutionsstruggletodetectsleepercellsuntiltheactualdamagehasbeenmadeandespeciallywithregardtonewcategoriesofattacks.

DataVisorandotheranti-fraudsolutionprovidersthereforeleverageunsupervisedlearningtoaddressthelimitationsofsupervisedlearningbyanalyzingpatternsacrosshundredsofmillionsofaccountsandidentifyi

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论