版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
PortlandStateUniversity
PDXScholar
BusinessFacultyPublicationsand
Presentations
TheSchoolofBusiness
8-2021
UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution
MartaStelmaszakRosa
PortlandStateUniversity,
stmar
ta@
Followthisandadditionalworksat:
/busadmin_fac
Partofthe
BusinessCommons
Letusknowhowaccesstothisdocumentbenefitsyou.
CitationDetails
Stelmaszak,M.(2021)UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution.AmericasConferenceonInformationSystems2021,9-13August2021.
ThisConferenceProceedingisbroughttoyouforfreeandopenaccess.IthasbeenacceptedforinclusioninBusinessFacultyPublicationsandPresentationsbyanauthorizedadministratorofPDXScholar.Pleasecontactusifwecanmakethisdocumentmoreaccessible:
pdxscholar@
.
UnboxingtheAlgorithm:AProcessModel
Twenty-SeventhAmericasConferenceonInformationSystems,Montreal,2021
PAGE
10
UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution
CompletedResearch
MartaStelmaszakPortlandStateUniversity
stmarta@
Abstract
Withtheexplosionofdata,analyticsandartificialintelligence,informationsystemsresearchfocusesontheuse,managementandconsequencesofalgorithms.Thisfar,onlyahandfulofpapersofferinsightsintohowalgorithmicsolutionswork.Toaddressthisgap,westudiedthecodemakingup45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonadatascienceplatformK.Wesynthesizedaprocessmodelofanalgorithmicsolution:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingmodel.Unboxingthealgorithmandinvestigatingtheprocessoffersamorefine-tunedunderstandingandlanguagetobetterconceptualizetheuse,managementandconsequencesofalgorithmicsolutions.Italsoprovidesascaffoldingforresearchintothedevelopmentofalgorithmicsolutions,highlightingtheirvariability,experimentationanddatascientistdecisions.
Keywords
Algorithms,algorithmicsolutions,datascience,informationsystemsdevelopment,processmodel
Introduction
Algorithmshave,withoutadoubt,attractedresearchattentionacrossanumberoffields,frommediastudies,throughsociology,tocomputerscience.Managementandinformationsystems(IS)researchersstudyalgorithmicsolutionsprimarilyintermsoftheiruse,managementandconsequencesforindividualsinworkcontexts,inorganizations,andinthewidersociety(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,thisresearchcangreatlybenefitfromanimprovedunderstandingofhowalgorithmicsolutionsaredeveloped,andthustherehavebeencallstofocusmoreontheorizingtheirdevelopment(vandenBroeketal.2021).Thisfar,onlyahandfulofpapersinISofferinsightsintohowalgorithmicsolutionsworkwhichisanessentiallinkbetweenunderstandingtheiruseandtheirdevelopment.Againstthisbackground,thisstudyaimstoanswerasimplequestion:whatistheprocessofmakinganalgorithmicsolutionwork?
Touncoverthebuildingblocksandproposeaprocessmodel,westudied45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK(Dissanayakeetal.2015;MangalandKumar2016).Referringtoacommonproblemfacedbymanycompaniesandoftentackledbyalgorithmicsolutions,thecreditcarddatasetattractedover200notebookswithcodeandcommentsdescribingattemptstobestpredictcustomerchurn.Weselected35ofthebest-regardednotebooks,downloadedthemandcodedthemusingagroundedtheoryapproach(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006).Wethengroupedthethemestoidentifytheelementsthatmadeupeachproposedalgorithmicsolutionanddistilledaprocessmodelofhowtheyweredeveloped.
Basedonourfindings,weproposeaprocessmodelofmakinganalgorithmicsolutionworkencompassing:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingthemodel.Wecontributetoinformationsystemsandmanagementliteraturebydevelopingaprocessmodelofanalgorithmicsolution
thatoffersamorefine-tunedlanguagetoinvestigatenotonlytheuse,managementandconsequencesofalgorithmicsolutionsonindividuals,organizationsandsocieties,butalsoenablesafurtherstudyofthedesignanddevelopmentofsuchsolutionsfromasocio-technicalperspective.
MakingAlgorithmicSolutionsWork
Recenttechnological(processingcapabilities,bigdata,machinelearning),societal(useofsmartphones,attitudestowardsdata,socialmedia)andorganizational(phantomization,networks)developmentscontributedtothegrowthinuseofvariousalgorithms(Baptistaetal.2017;Berenteetal.2019).ISresearchinthesocio-technicaltraditionhasthusfocusedonthestudyoftheuse,managementandconsequencesofalgorithmsonindividual,organizationalandsocietallevels(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,farlessattentionhasbeenpaidsofartotheunderstandingofhowalgorithmsandalgorithmicsolutionsbasedonthemaredeveloped(vandenBroeketal.2021).Firstpapersbegintouncoverhowdatascientistsandsubjectmatterexpertsneedtoworktogetherinthedevelopmentprocess(vandenBroeketal.2021),howthepracticesofdatascientistsinthebankingindustryrelyonbothsubjectivityandobjectivityintheproductionofinformation(Joshi2020),andhowdatascientistsengageinthepracticesofknowledgehiding(GhasemaghaeiandTurel2021).Inotherwords,whilefocusingpredominantlyonwhathappensafterthealgorithmsareputtowork,currentliteratureofferslittleinsightintohowalgorithmsaremadetowork,thatiswhatstepsneedtobeinplaceforanalgorithmicsolutiontoworkeffectively.Suchunderstandingisessentialbecausetheprocessofmakinganalgorithmicsolutionwork,asweshowbelow,determineswhatkindsofinsightsandpredictionsitoffers,thusinfluencingdecisions.
Mostresearcherswhoinbroadstrokesdescribewhatgoesintomakingalgorithmicsolutionsworkintheirpapersrefertocertainaspectswithvaryingconsistency:thefactthatalgorithmsprocessdata(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;Gregoryetal.2020;GrønsundandAanestad2020;Lebovitz2020;Lycett2013;NewellandMarabelli2015;Pachidietal.2021;Shresthaetal.2019)inanautomatedorpreprogramedway(Galliersetal.,2017;Grønsund&Aanestad,2020;Güntheretal.,2017;Shresthaetal.,2019)tolearnmodels(BalasubramanianandYe2021;Lietal.2019;Shresthaetal.2019)leadingtonewinsights(Güntheretal.,2017;Günther&Joshi,2020;Pachidietal.,2021),decisions(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;NewellandMarabelli2015)orpredictions(Lebovitz2020;Lietal.2019;Shresthaetal.2019).Thisoffersapunctuatedandincompletepictureoftheelementsinvolvedindevelopingalgorithmsthatcanbesubsequentlyusedinbusinesssettings.
Ahandfulofpapersofferinsightsintotheessentialelementsofwhathappensinsidealgorithmicsolutions.Pachidietal.(Pachidietal.2021)provideadetaileddescription,coveringvariouselementsthatareatplayinapredictivemodel:
“Themodelcombinedanumberofinternalandexternaldatasources,suchastimeseriesofcustomertransactions,Nielsenmarketdata,GartnerICTspendingpredictions,financialdata,andusagedata.Theoutputofthemodelwasrepresentedinaspreadsheetformatthatcontainedalistofallmedium-sizedcustomersandpredictionsregardingpotentialsalesopportunities.TheCLMmodelallocatedcustomerstodifferentcustomersegments(A,B,C,D)basedontheirhistoricalandpredictedsaleswithTelCo.ForeachTelCoproductline(e.g.,businesstelephonesystems,mobilephonepackages,fixedlinesetups),theCLMmodelassignedapositioninthecustomersaleslifecycle(inform,specify,sell,maintain),eachofwhichentailedadifferentcontactstrategy.Thus,themodeloutputconsistedofarankingofopportunities,withaprioritizedactionlistforaccountmanagers.”
GrønsundandAanestad(2020,p.7)aresimilarlydetailed:
“Thealgorithm-supportedanalysissystemwasdesignedtoautomatebothdataacquisitionandtheprocessingofdataforsubsequentanalysis.Acquisitionofdatawasautomatedbythesystempullingstreamsofdataonshipactivityfromthesatellite-AISdataprovider,alongwithadditionaldatasuchasvesseldescriptionsandgeospatialdata,intoaHadoop-baseddatawarehouserepository.Herethedatawereextractedandconsolidated,thenclassifiedusingrule-basedNLP(NaturalLanguageProcessing)classification,andfinallypresentedinBItoolsthatallowedhumaninterpretationoftheoutput.”
Whilethedescriptionsbothpointtoobtaining,compilingandprocessingofdata,furtheranalysisandclassification,theyrevealdifferencesinhowthesolutionswork,anddonotofferacompletepicture.Takingamoregeneralview,OrlikowskiandScottdefineanalgorithmas“asetofstep-by-stepinstructionstoachieveadesiredresultinafinitenumberofmoves”(2015,p.210).Acknowledgingthismoretraditionaldefinitionofanalgorithm-aprogramcontainingafixedsequenceofinstructionsexecuteduntilasolutionisreached-rootedincomputerscience(HopcroftandUllman1983),Farajetal.(2018)‘update’andbroadenthescopeofthisdefinitionbyconceptualizinglearningalgorithmsas“anemergentfamilyoftechnologiesthatbuildonmachinelearning,computation,andstatisticaltechniques,aswellasrelyonlargedatasetstogenerateresponses,classifications,ordynamicpredictionsthatresemblethoseofaknowledgeworker”(p.62).AsimilardefinitionofartificialintelligencealgorithmsisputforwardbyTarafdaretal.(2020,p.1):“WedefineAIalgorithmsasthosethatextractinsightsandknowledgefrombigdatasources;computationalandstatisticaltechniquessuchasmachinelearning(ML)anddeeplearningembeddedinsuchalgorithms,aimto‘teach’computerstheabilitytododetectpatternsinbigdata”.
Whilethesedefinitionsofferagoodstartingpointandaninitialoverviewoftheelementsintheprocessofmakingalgorithmicsolutionswork,theyarepartialanddivergentintheirfocus.Thesedifferencesinthedefinition,understanding,scopeandscaleofthestepsandelementsrequiredtomakealgorithmsworkhamperthedevelopmentoftheunderstandingoftheuse,managementandconsequencesofalgorithms,andatthesametimemakeuncoveringtheirdevelopmentmoredifficult.ForISresearchtosystematicallyprogressinthisareaitisthusfundamentaltoask:whatistheprocessofmakinganalgorithmicsolutionwork?
ResearchSettingandMethods
Toanswerthisquestion,westudied45publicdatasciencenotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK.Belowwedescribetheresearchsetting,aswellasdatacollectionandanalysismethods.
ResearchSetting
Kisapopularplatformfordatascientistsandmachinelearningengineerswheretheycandevelopandimprovetheirskills,aswellasparticipateincorporate-sponsoredcompetitionsbyaddressingavarietyofproblemsrelatedtodatasetspublishedontheplatform.K,partofAlphabetInc,allowstouploaddatasets,setspecifictasksforthemandcreateinteractiveJypyternotebookswhereuserscandeveloptheiralgorithmicsolutions.Kwasselectedasasettingbecauseofitspublicavailabilityandopennessinsharingnotebooksthatallowsanunprecedentedaccesstotheinnerworkingsofalgorithmicsolutions.OthershaveusedKforresearchpurposesaswell(Dissanayakeetal.2015;MangalandKumar2016).
Thedatasetweselectedforthisstudyisawell-regardedandpopulardatasetwithhighusability.Itcontainsthedetailsofaround10,000creditcardcustomersofabank,wherebyaportionofcustomerschurned.Thegoalistoidentify,basedon18variablessuchasage,salary,creditcardlimitandsimilar,whatmakesacustomerchurn(giveupacreditcard)tobeabletopredictcustomersatriskofchurninginthefuture,aswellastoidentifythevariablesthataremostpredictiveoftheriskofchurn(“Kaggle.Com”2021).Whenthedatasetwasinvestigatedforthepurposesofthisresearchproject,therewerearound210notebookssubmittedthatcontainedalgorithmicsolutionspertainingtothisdataset,withconstantdailyactivityinexistingnotebooksandnewnotebooksbeingadded.
Weselectedanopenandpublicdatasetratherthanacompetitionbecausethemajorityofnotebookssubmittedforcompetitionsareprivateandthusvisibleonlytosponsorcompanies,andcompetitionsareusuallyveryspecificandlimitthenumberofpotentialalgorithmicsolutionsapplied.Incontrast,publicnotebooksallowgoodaccesstoavarietyofnotebookscontainingfairlyunrestrictedsolutionsandallowformuchmoreexperimentationonthepartofusers.FromthemanydatasetsavailableonK,weselectedthecreditcardcustomersdatasetbecauseitisrelatedtoacommonproblemthatmanycompaniesandbusinessesface,anditisaproblemthatisoftentackledbydevelopingalgorithmicsolutions,thusitisagoodrepresentativesampleofwhatresearchersininformationsystemsandmanagementwouldconsiderofinterest.
Datacollection
InJanuaryandFebruary2021,wecollected57JupyternotebooksthatwerecreatedusingthecreditcardcustomerdatasetinPythonasthesetprogramminglanguage.Thenotebookswerearrangedfromthe‘hottest’(ameasureusedonKtodefinenotebookswithmostactivity,editsandhighestvotesbythecommunity,Kaggle.Com,2021)totheleasthot,andthusthosethatwecollectedwereconsideredamongthe‘hottest’atthetime.Wedecidedtoselectthe‘hottest’notebooksasthesewereassessedashighqualitybythecommunity,thuswerelikelytocontainwell-developedalgorithmicsolutions.WediscardednotebooksinRtoeliminatedifferencesinprogramminglanguages,andnotebooksthatcontainedonlypartialsolutions,forexampleonlyanalyzeddatawithoutbuildingactualmodels.Weendedupwith45suitablenotebooks.UsingafeatureavailableonK,wedownloadedalloftheselectednotebooksandconvertedthemtoPDFdocumentstoanalyzetheminnVivo.
Dataanalysis
Sinceourstudyisrootedingroundedtheory(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006),weproceededbyinductivelycodingthenotebookstoidentifythedifferentelementsofcodetheycontainedbywhattheseelementsofcodedid.Wecodedeachsegmentofcodeineachnotebooktoidentifyitsfunction.Verbaldescriptionsofdatascientistssometimesprovidedadditionalinformationastotheroleofeachcodesegment,sothesewerecodedtoo.However,thedescriptionsweremostlyusefulinthesecondstageofdataanalysis,wherewegroupedthecodesweobtainedintohigher-levelelementsoftheprocess,astheyexplainedtheflowoftheprocess.Forexample,inthenotebooksdatascientistswouldsometimesindicatetheywereproceedingtoexploratorydataanalysis,andweusedthesecommentstogroupelementsofcodeidentifiedundertheelement‘ExploratoryDataAnalysis’.
Becauseoftheinductivenatureofourstudy,weoscillatedbetweendataanalysisandfurtherdatacollection.Aftercodingthefirst30notebooks,webegantogroupthecodestostartbuildingthemodel.Wethenproceededwithcodingandanalyzingnotebooksonebyonetosupplementandverifythemodelthatwasemergingfromouranalysis.Whenwereachednotebooknumber35,thesubsequent10notebooksdidnotaddanynewcodestothecodebookandatthispointwedecidedtostopcodingandanalyzingthenotebooksaswereachedthepointofsaturation.
UnboxingtheAlgorithm
Inthissection,wepresenttheelementsoftheprocessofmakinganalgorithmicsolutionworkthatweidentifiedinthedata.Eachelementisdiscussedinturnbyshowingwhatkindsofoperationswereperformedineveryelement.
PreparingtheEnvironment
Notebooksbeginwithsettingtheenvironmentinwhichthedevelopmentofthealgorithmicsolutiontakesplace:programminglanguage,accelerationandconnectiontotheinternet.ThenotebooksweobservedwereallsetupinaPython3environment,which“comeswithmanyhelpfulanalyticslibrariesinstalled”(Notebook002)andallowstowriteupto20GBtotheworkingdirectory.Notebooksgivethepossibilitytoturnonanaccelerator,suchasaGPU,forfasterprocessing,andtoconnecttotheinternetforaccesstoexternalfiles.Insomenotebooks,datascientistsuseverbalcommentstoidentifyandrestatetheproblem.
Afterthisinitialsetup,variousnecessarylibrariesareimported,thatispre-packagedfunctionsdesignedforspecificpurposesthatcanbedeployedbydatascientistswithouttheneedtocodesuchfunctionsfromscratch.Invariably,thenotebooksfeatured“numpy”(Notebook005),aPythonlibraryforlinearalgebraand“pandas”(Notebook007)allowingfordataprocessingandforexamplereadinginCSVfiles,amongothers.Thesetwolibrariesareessentialtodevelopthealgorithmicsolution.Otherlibrariesimportedincludedatavisualizationpackages,suchas“seaborn”or“matplotlib”(Notebook029),whicharefairlystandardandpopularlibrariesforthispurpose.Insomenotebooks,allrequiredpackagesareimportedinthebeginningofthenotebook,including“sklearn”and“keras”(Notebook014)thatareusedforbuildingmodels,whileothernotebooksimportadditionallibrariesasandwhenneeded.Librariesareimportedwithsimplecode:“importnumpyasnp”(Notebook001),forexample.Importinglibrariesisastandardprocedureandtherearenotsubstantialcommentsregardingthisstep.Thereexistsavarietyoflibraries
usedindevelopingalgorithmicsolutionsthatarewidelyused,andtheyencapsulateandabstractoutthecomplexitybehindsuchtasksliketrainingaspecificmodel,asexplainedbelow.
ReadinginData
Thenextelementintheprocessinanalgorithmicsolutionistoreadintherequireddata.Thefirststephere,quitelogically,includesloadingdatain.BecausethedatasetthatthenotebooksuseisuploadedtoKaggle,itcanbeattachedtoeachnotebookwithasimplesearchwithintheinterface,andthenimportedbyexecutingacommandfromthe“pandas”library“read_csv”(Notebook001).
Inspectingthedatafollows,usuallythroughfunction“head”,displayingfirstfive(bydefault)rowsofthedatasetandcorrespondingcolumnswithcolumnheaders,andsometimesfunction“shape”displayingthedimensionsofthedataset(numberofrowsandnumberofcolumns)aswellasfunction“columns”,givingthenamesofcolumnsinthedataset.Injustonenotebook,weobservedexplicitlylookingforduplicateentriesinthedataset.Commandstoperformthesefunctionsarepre-packagedandtakeformsof“df.head()”,“df.shape”or“df.columns”(Notebook003).Thisstageoftheprocessalsoinvolvescheckingdatatypespresentinthedataset,performedbyusingfunctions“info”or“dtypes”thatindicatewhichcolumnscontaininteger(wholenumbers),float(fractionswithdecimalpoints)orobject(textormixednumericandnon-numericvalues)datatype.Thisisimportantasmostalgorithmicsolutionsworkonlywithnumericalvalues.Aspartofreadingindata,simpledescriptivestatisticsofthedataareobtainedthroughfunction“describe”,resultingindisplayingthenumberofrows,mean,standarddeviation,minimumvalue,quartiles,andmaximumvalueforeachcolumn.
Conductingthethesestepsisessentialtoloadthedatasetandobtainbasicinformationaboutthedataneededtoconfirmthatthedataisloadedcorrectly,containstheexpectedcolumnsandrows,andtogaininitialfamiliaritywiththedataset.
CleaningData
Afterreadinginthedataset,dataiscleanedtoprepareitforfurtherprocessing.Thisisessentialbeforeanyanalysiscantakeplace.Stepsatthisstagetendtobetakeninvariousordersacrossthenotebooks,andarereportedhereinnoparticularorder.
Missingvaluesareidentifiedanddealtwith:thatisNULLvaluesinthedatasethavetoberesolvedbeforeanyanalysiscantakeplace.Thisisdonebyusingthefunction“isnull”,listingallcolumnswiththenumberofmissingvalues(Notebook001).Thecustomerchurndatasetcontainednonullvalues,sointhiscasetherewasnoneedtodeploysolutionstosolvethisproblem.Missingvalueshavetoberesolvedasthemajorityofalgorithmscannotdealwithdatasetscontainingmissingvalues.Oneofthewaystosolvethisproblemthatispresentedinthenotebooksisthemethodofimputation,thatisreplacingthemissingornullvaluewithanexistingvaluefromthedataset.Inthesolutionproposedinthenotebookthisisdonebasedonthenearestneighborofthemissingvalue,butsincenomissingvaluesweredetected,thesolutionisnotimplemented.
Inthenotebooks,wefoundsometimescolumnsarerenamediftheirnamesarenotintuitiveenoughorsimplytoolong.Certaincolumnscontainingvariablesthatarenotneededfortheanalysisareremoved.Forexample,thecustomerchurndatasetcontainstwocolumnswithNaïveBayesClassifierbydefault,andtheauthorofthedatasetsuggestsremovingthesecolumnsbeforeproceedingwithanalysis.Atdifferentpointsindatacleaning,exploratorydataanalysisorpre-processingthedatasetvariouscolumnsarealsoremovediftheyarenotcontributingtothemodel(forexample,removingcustomerID:“data=data.drop(columns=[‘CLIENTNUM’]”,Notebook015).Insomenotebooks,outliersareremovedfromthedatausingacommonstatisticalmeasureofz-score,indicatinghowfarfromthemeanagivendatapointis.Intheonlynotebookweobservedthatremovedoutliers,thisresultedinremoving810rowsfromthedataset.
Allnotebookswestudiedtransformdatatypesaspartofcleaningdata.Thisstep,sometimesreferredtoasfeatureengineering,isrequiredwhenthedatasetcontainsobjectdatatypes,whicharecategoricalvariablestypicalinmanydatasets,suchasmaritalstatus,levelofeducationorgender.Thesedatatypeshavetobetransformedintonumericalvariablesinordertobeanalyzed.Thisisconductedbyusingpre-existingfunctionstoencodethesevariablesasintegers(e.g.primaryeducationas1,secondaryas2,tertiaryas3)or
usingpopularone-hotencodingwherethereisnonaturalordinalrelationshipbetweencategoriesanddummyvariablesarecreated(e.g.maleis0,femaleis1).Cleaneddataisanessentialelementofanyalgorithmicsolution,aswithoutthestepstakeninthiselement,dataeitherresultsinerroneousanalysisandmodeltraining,orsimplycannotbeusedtotrainmodels.
ExploratoryDataAnalysis
Thenextstepinthealgorithmicsolutionprocessisexploratorydataanalysis,wherebyactionsaretakentolearnabouttherelationshipbetweenthedependentvariableofinterest(here:customerchurnorattrition)andindependentvariablesthatmayhelpbuildthepredictivemodel.Thisstepisessentialtouncoverwhatmodelwillbethemostappropriateforthedatasetandwhichvariablescanbepotentiallyofinterest.
Thefirststepistoidentifythedependentvariable(atrivialmatterinthegivendataset),andtoanalyzeindependentvariables.Thisisveryoftenperformedbyvisualizingthemindependently,inrelationtoeachother,orinrelationtothedependentvariable.Inmostcases,suchvisualizationswereimplementedusingfunctionsfromvisualizationlibraries,suchas“seaborn”,“matplotlib”orrarely“plotly”.Visualizingdataisthepartthattakesupthemostcodeinnearlyallnotebooksweanalyzed.Variousvisualizationsareproduced,suchasboxplots,piecharts,histograms,inordertohelpidentifywhichvariablesmaybeusefulinbuildingthemodel.Visualizationsareoftenaccompaniedbycommentssuchas“Femalesareslightlymorelikelytochurnwith17%comparedtomaleswith15%,we’llconvertthis9featureto1-0”(Notebook013).Somenotebookscontainmorecomprehensivecommentsonthelearningsfromvisualizations.
Thenextstepinexploratorydataanalysisistoidentifycorrelationsbetweenvariables.Identifyingcorrelationsisanimportantstepinexploratorydataanalysis,asfromthisdecisionscanbemadeastowhichfeaturestoincludeinpre-processingthedatasetformodelbuilding,asdescribedbelow.Forexample,Notebook022basedontheidentificationofcorrelationsdecidesto“#Dropsomefeatureswhichhavelessthan0.01correlationandgreaterthan-0.01correlation”.Exploratorydataanalysisisarequiredstepofbuildinganalgorithmicsolutionasitprovidesthenecessaryinsightintothedatasetforthepurposesofmodelbuilding.Itisatthisstagethattheimportanceofvariableswithrespecttothetargetvariableisassessed.
Pre-processingtheDataset
Thefollowingstepintheprocessistopre-processthedataset,whichinvolvespreparingthedatasetaccordingtotherequirementsofmodelbuilding.First,dataneedtobescaled,whichmayinvolveactualscaling,thatischangingtherangeofvariablestoacommonrange,e.g.between0and1,ornormalizingthevariablesfollowinganormaldistribution.Scalingisperformedtoensurethatnovariableisinterpretedasmorepredictivethanitactuallyisjustbecauseitsnumericalvaluesareonadifferentfromothervariables.Scalingisroutinelyperformedusingstandardpre-packagedfunctions,suchas“StandardScaler”fromthepopular“sklearn”library(Notebook026).
Thedatasetshouldberesampledifitisnotbalanced,thatisifonecategoryispresentmuchmorefrequentlythananother.Inthecaseofthedatasetinvestigated,customerswhoattiredoccurredmuchlessfrequently,asidentifiedinexploratorydataanalysis,soresamplingwasrequired.Thisisusuallydonebyoversamplingfromthegroupofattiredcustomers,mostfrequentlyusingapre-packagedfunction‘SMOTE’(SyntheticMinorityOversamplingTechnique)whichcreatesadditionaldatapoi
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年浙教版选修3物理下册阶段测试试卷
- 介绍知识图谱
- 二零二五年度工业固体废物处理与综合利用协议范本
- 专项多媒体系统购销及施工协议(2024年版)一
- 个体工商户雇佣协议标准文本2024版A版
- 2025年度LED照明产品研发成果转化与应用合同3篇
- 2025年度行政主体优益权在特许经营合同中的法律适用3篇
- 2024补充采购协议模板:专项条款
- 二零二五年度股份回购与员工持股计划的风险评估合同3篇
- 《金教程》高考生物一轮复习考能专项突破课件:第二单元细胞的基本结构和物质运输功能
- 2025年河南鹤壁市政务服务和大数据管理局招聘12345市长热线人员10人高频重点提升(共500题)附带答案详解
- 《上海理工大学》课件
- 中职班主任培训
- 建设项目安全设施施工监理情况报告
- 春节期间安全施工措施
- 2025年大唐集团招聘笔试参考题库含答案解析
- 建筑工地春节期间安全保障措施
- 2025山东水发集团限公司招聘管理单位笔试遴选500模拟题附带答案详解
- 2024-2030年中国触摸显示器商业计划书
- 安徽省合肥市2023-2024学年七年级上学期期末数学试题(含答案)
- 《国有企业管理人员处分条例》重点解读
评论
0/150
提交评论