




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
JournalPre-proofs
ArtificialIntelligenceinPharmaceuticalSciences
MingkunLu,JiayiYin,QiZhu,GaoleLin,MinjieMou,FuyaoLiu,ZiqiPan,
NanxinYou,XichenLian,FengchengLi,HongningZhang,LingyanZheng,
WeiZhang,HanyuZhang,ZihaoShen,ZhenGu,HonglinLi,FengZhu
PII:
S2095-8099(23)00164-9
DOI:
/10.1016/j.eng.2023.01.014
Reference:
ENG1255
Toappearin:
Engineering
ReceivedDate:
30September2022
RevisedDate:
11December2022
AcceptedDate:
6January2023
Pleasecitethisarticleas:M.Lu,J.Yin,Q.Zhu,G.Lin,M.Mou,F.Liu,Z.Pan,N.You,X.Lian,F.Li,H.
Zhang,L.Zheng,W.Zhang,H.Zhang,Z.Shen,Z.Gu,H.Li,F.Zhu,ArtificialIntelligenceinPharmaceutical
Sciences,Engineering(2023),doi:
/10.1016/j.eng.2023.01.014
ThisisaPDFfileofanarticlethathasundergoneenhancementsafteracceptance,suchastheadditionofacoverpageandmetadata,andformattingforreadability,butitisnotyetthedefinitiveversionofrecord.Thisversionwillundergoadditionalcopyediting,typesettingandreviewbeforeitispublishedinitsfinalform,butweareprovidingthisversiontogiveearlyvisibilityofthearticle.Pleasenotethat,duringtheproductionprocess,errorsmaybediscoveredwhichcouldaffectthecontent,andalllegaldisclaimersthatapplytothejournalpertain.
©2023PublishedbyElsevierLtd.onbehalfofChineseAcademyofEngineering.
1
Research
SmartProcessManufacturing—Review
ArtificialIntelligenceinPharmaceuticalSciences
MingkunLua,c,JiayiYina,QiZhua,GaoleLina,MinjieMoua,FuyaoLiua,ZiqiPana,NanxinYoua,XichenLiana,FengchengLia,HongningZhanga,LingyanZhenga,c,WeiZhanga,HanyuZhanga,ZihaoShenb,d,ZhenGua,
HonglinLib,d,e,*,FengZhua,c,*
aTheSecondAffiliatedHospital,ZhejiangUniversitySchoolofMedicine&CollegeofPharmaceuticalSciences,ZhejiangUniversity,Hangzhou310058,ChinabShanghaiKeyLaboratoryofNewDrugDesign,EastChinaUniversityofScienceandTechnology,Shanghai200237,China
cInnovationInstituteforArtificialIntelligenceinMedicineofZhejiangUniversity,Alibaba–ZhejiangUniversityJointResearchCenterofFutureDigitalHealthcare,Hangzhou330110,ChinadInnovationCenterforAIandDrugDiscovery,EastChinaNormalUniversity,Shanghai200062,China
eLingangLaboratory,Shanghai200031,China
*Correspondingauthors.
E-mailaddresses:
hlli@
(H.Li),
zhufeng@
(F.Zhu).
ARTICLEINFO
Articlehistory:
Received
Revised
Accepted
Availableonline
Keywords:
Artificialintelligence
Machinelearning
Deeplearning
Targetidentification
Targetdiscovery
Drugdesign
Drugdiscovery
2
ABSTRACT
Drugdiscoveryanddevelopmentaffectsvariousaspectsofhumanhealthanddramaticallyimpactsthepharmaceuticalmarket.However,investmentsinanewdrugoftengounrewardedduetothelongandcomplexprocessofdrugresearchanddevelopment(R&D).Withtheadvancementofexperimentaltechnologyandcomputerhardware,artificialintelligence(AI)hasrecentlyemergedasaleadingtoolinanalyzingabundantandhigh-dimensionaldata.ExplosivegrowthinthesizeofbiomedicaldataprovidesadvantagesinapplyingAIinallstagesofdrugR&D.Drivenbybigdatainbiomedicine,AIhasledtoarevolutionindrugR&D,duetoitsabilitytodiscovernewdrugsmoreefficientlyandatlowercost.ThisreviewbeginswithabriefoverviewofcommonAImodelsinthefieldofdrugdiscovery;then,itsummarizesanddiscussesindepththeirspecificapplicationsinvariousstagesofdrugR&D,suchastargetdiscovery,drugdiscoveryanddesign,preclinicalresearch,automateddrugsynthesis,andinfluencesinthepharmaceuticalmarket.Finally,themajorlimitationsofAIindrugR&Darefullydiscussedandpossiblesolutionsareproposed.
1.Introduction
Inthepastfewdecades,thepharmaceuticalindustryhasbeenlimitedbytheextentofcutting-edgeresearchinpharmaceuticalsciences,becausethedevelopmentofnewdrugsisalongandcomplexprocessaccompaniedbyhighrisksandhighcosts[1,2].Inotherwords,thecurrentfieldofdrugresearchanddevelopment(R&D)requiressignificantproductivityimprovementstoshortenthecycletimeandcostofdrugdevelopment[3].Technologiessuchasnetworkpharmacology,RNA-sequencing(RNA-seq),high-throughputscreening(HTS),orvirtualscreening(VS)haveallacceleratedthediscoveryofnewtargets,aswellasnewdrugstosomeextent[4–9].Nevertheless,thesetechnologieshaverarelybeensignificantcontributorstothecurrentprocessofnewdrugdiscovery.Thus,thereisanurgentneedfornewtechnologytodrivethedevelopmentofnewdrugs.
Asthecomputingpowerofdevicesgrows,artificialintelligence(AI)hasbeenusedinmanyrealcases,suchasinimageclassificationandspeechrecognition,duetoitsabilitytolearn,process,andpredictmassiveamountsofinformation[10–12].Atpresent,afteralongperiodofdataaccumulation,incombinationwiththedevelopmentofhigh-throughputRNA-seqtechnology,massiveamountsofbiomedicaldatahavebeencollected[13–18].Biomedicaldata,whichhasahighlevelofheterogeneityandcomplexity,comesfromavarietyofsources,includingomicsdatafromdifferentplatforms,experimentaldatafrombiologicalorchemicallaboratories,datageneratedbypharmaceuticalcompanies,publiclydisclosedtextualinformation,andmanuallycollateddatafrompubliclyavailabledatabases[19–22].AIcanbeusedtolearnthepotentialpatternsinthesevastamountsofbiomedicaldata,therebybringingnewopportunitiesandchallengestothepharmaceuticalsciencesandindustries.
TheAlphaFold2systemusedAIinthecriticalassessmentofproteinstructureprediction14(CASP14)competitionandoutperformedothersinaccuratelypredictingthethree-dimensional(3D)structuresofproteins[23].Similarly,intheOpen-GraphBenchmarkLarge-ScaleChallenge(OGB-LSC)competition,agraphneuralnetwork(GNN)combinedwithatransformermodelwonthetoprankinpredictingthemolecularpropertiescalculatedbymeansofdensityfunctionaltheory(DFT),whichisdifficultandhighlytime-consumingusingtraditionalmethods[24].ThesecompetitionsdemonstratedthestrongabilityofAItoanalyzebiologicalorchemicaldata.Duetoitspowerfulcapabilitytoutilizerelatedbiomedicaldatatounderstandcomplexbiologicalsystemsandchemicalreactionspaces[25,26],AIhashadarevolutionaryimpactonallstagesofdrugR&D,includingnotonlyresearchonproteinsandsmallmoleculesbutalsotheassisteddesignofclinicaltrialsandpost-marketsurveillance[27].Furthermore,inpharmaceuticalcompanies,manystate-of-the-art(SOTA)AImodelshavebeenadoptedindiversepipelinestoshortentheR&Dcycletimeanddecreasecosts[28–30].
AItechniquesinthiscontextmainlyinvolvemachinelearning(ML)anddeeplearning(DL).BothMLandDLalgorithmsareinvolvedintargetdiscoveryandvalidation[31],drugdiscoveryanddesign[32],andpreclinicaldrugresearch[33],wheretheyareusedtoanalyzedifferentdatacharacteristicsindifferentformats.Afteradrugcandidateisenrolledinaclinicaltrial[34],DLplaysapivotalroleinassistinginthedesignoftheclinicaltrialandinsupervisingandanalyzingdatafromtheclinicalphaseIV[33].Approveddrugshaveastrongimpactonmanufacturing[35]andthemarketeconomy,andDLcanplayapartintheseareasaswell.Therefore,inthisreview,wepresentacomprehensiveoverviewofmostaspectsoftheuseofAIinthepharmaceuticalsciences.WefocusonhowAIcanbeusedtopromotetargetdiscoveryanddrugdiscovery(asshowninFig.1)andreflectonhowtofurtheracceleratethedevelopmentofthisfield.
3
Fig.1.SummaryofAIapplicationsinthepharmaceuticalsciences.ADMET:absorption,distribution,metabolism,excretion,andtoxicity.
2.BasicconceptsofAIanditsscopeofapplication
AIwasfirstproposedattheDartmouthConferencein1956andwasdefinedasanalgorithmthatgivesmachinestheabilitytoreasonandperformfunctions[36].Fromperceptualmachinestosupportvectormachines(SVM)andartificialneuralnetworks(ANNs),thedevelopmentofAIhasgonethroughseveralupsanddowns,andiscurrentlyflourishingthankstothehardwaresupportthatisnowavailable.BothMLandDLfallunderthecategoryofAI;strictlyspeaking,DLcanbeplacedwithinthecategoryofML.However,ourdiscussionofMLinthisreviewonlyconcentratesontraditionalMLmethods,suchasrandomforest(RF)andSVMs.
2.1.Thebigdataera
Inthecurrentbigdataera,giganticamountsofbiologicalandclinicaldatahavelaidafoundationfortheapplicationofAIinthefieldofmedicalandpharmaceuticalresearch.AlthoughAIhasbeensuccessfullyandeffectivelyappliedinmultipleaspectsofthedrugR&Dprocess,thequantityandqualityofmedicaldatahavebecomeoneofthemainobstaclestothedevelopmentofAIinthepharmaceuticalsciences.Thusfar,pharmaceuticaldatabaseswithdetailedandstructuredbigdataproposedbymedicinalresearchersworldwideareplayingakeyroleinpromotingAIapplicationsinmedicalandpharmaceuticalresearch.
Forexample,thetherapeutictargetdatabase(TTD)includesthemostcomprehensiveinformationaboutknownand
4
Proteins
Genes
Drugs/drug
targets
Diseases
RCSB
PDB
PRIDE
UniProt
InterPro
VARIDT
Ensembl
UCSC
Genome
GEO
GenBank
RefSeq
EA
TTD
ChEMB
L
PubChe
m
DrugBank
DrugMAP
DTC
PHARO
S
TCGA
DisGenNET
ClinVar
OMIM
PDBcontains3Dstructuraldataoflargebiologicalmolecules,suchasproteinsandnucleicacids
PRIDEisapublicdatarepositoryforproteomics,includingproteinandpeptideidentifications,post-translationalmodificationsandsupportingspectralevidence
UniProtisaproteindatabasecontainingproteinsequences,functionalinformation,andanindexofresearchpapersInterProprovidesfunctionalanalysisofproteinsbyclassifyingthemintofamiliesandpredictingdomainsandimportantsitesVARIDTprovidescomprehensivedataonallaspectsofdrugtransporters’variability
Ensemblprovidescentralizedgenomicdataandpowerfulfunctionalitiessuchasgeneannotationandregulatoryfunctionpredictions
TheUCSCGenomebrowseroffersaccesstogenomesequencedatafromavarietyofvertebrateandinvertebratespeciesandmajormodelorganisms
TheGEOisadatabaserepositoryofhigh-throughputgeneexpressiondataandhybridizationarrays,chips,andmicroarraysGenBankisanannotatedcollectionofallpubliclyavailableDNAsequences
RefSeqprovidesseparateandlinkedrecordsforthegenomicDNA,genetranscripts,andcorrespondingproteinsformultipleorganisms
EAcollectsbaselinegeneexpressiondatafordifferentspeciesandcontexts,andcontainsdifferentialstudiesreportingexpressionchangesundertwodifferentconditions
TTDincludesthemostcomprehensiveinformationaboutknownandexploredtherapeuticproteinandnucleicacidtargetsChEMBLisamanuallycuratedlibraryofbioactivecompoundswithdrug-likeproperties
PubChemcoverscollectiveinformationonchemicalmoleculesandtheiractivitiesinresponsetobiologicalassaysDrugBankcombinescomprehensivedrugtargetinformationwithspecificdrugdata
DrugMAPprovidesacomprehensivelistofinteractingmoleculesfordrugs/drugcandidates,includinginformationondifferentialexpressionpatterns
DTCenablestheexplorationofbioactivitydata,theprocessingofnewbioactivitydata,anddatacurationinordertoimprovetheunderstandingofDTIs
PHAROSprovidesacomprehensive,integratedknowledgebaseforthedruggablegenome
TCGAhasover2.5petabytesofgenomic,epigenomic,transcriptomic,andproteomicdatarelatedtothecancergenomeDisGenNETcontainslarge,publiclyavailablecollectionsofgenesandvariantsassociatedwithhumandiseasesClinVarisapublicarchiveofreportsonrelationshipsamonghumanvariationsandphenotypes,withsupportingevidenceOMIMisanonlinecatalogofhumangenesandgeneticdisorders
[43]
[44]
[18]
[45]
[46,4
7]
[48]
[49]
[50]
[51]
[52]
[53]
[37]
[54]
[17]
[55]
[56]
[57]
[58]
[59]
[60]
[61]
[62]
exploredtherapeuticproteinandnucleicacidtargets,thetargeteddisease,pathwayinformation,andthecorrespondingdrugsdirectedateachofthesetargets.Itprovidesdetailedknowledgeofthefunctionsoftargets,aswellastheirsequence,3Dstructures,ligand-bindingproperties,relevantenzymes,andcorrespondingdruginformation[37].PubChem[17]providescollectiveinformationofchemicalmoleculesandtheiractivitiesinresponsetobiologicalassays,includingmolecularstructure,identifiers,physicochemicalproperties,patentinformation,andmoleculartoxicity.Somepopulardatabasesaimedatvariouspharmaceuticalissueshavebeenproposedandarefrequentlyused;theseplaysignificantrolesinpromotingtheapplicationofAIinmedicalandpharmaceuticalresearch[38–42].Summarizingvariouspopularpharmaceuticaldatabases,Table1[17,18,37,43–62]providesbriefinformationonpopularpharmaceuticaldatabases,categorizedintoprotein-related,gene-related,drug-related,anddisease-relateddatabases.
Table1
Pharmaceuticaldatabasesfocusingonproteins,genes,drugs/drugtargets,anddiseases.
FocusDatabaseDescriptionRefs.
PDB:proteindatabank;PRIDE:proteomicsidentificationdatabase;GEO:geneexpressionomnibus;EA:expressionatlas;DTC:drugtargetcommons;DTIs:drug–targetinteractions;TCGA:thecancergenomeatlas;OMIM:onlinemendelianinheritanceinman.
5
2.2.MLandDL
Unliketraditionalcomputerprogrammingcalculations,MLandDLcanlearnpotentialpatternsfromtheinputdatawithoutexplicitprogramming.Theyarenotlimitedbytheformatoftheinputdata,whichisbroadandcanincludetext,images,sound,andmore(alltypesofdatathatcanbeencoded)[63].Similartothehumanlearningmodel,MLandDLcangraduallyrecognizedifferentfeaturesofthedata,inferthepatternslyingwithin,andupdatetheirmodelparametersthroughcontinuousiterationsuntilavalidmodelisformed.
Accordingtotheapplicationscenarios,themodelscanbecategorizedintoregressionmodelsandclassificationmodels.Thedifferencebetweenclassificationandregressiontasksliesmainlyinwhetherthetypeofoutputvariableiscontinuousordiscrete.ChengandNg[64]appliedMLapproachestopredictthebiologicalactivityofper-andpolyfluorinatedalkylsubstances(PFAS)withanoutputofcontinuousvalues,andthisstudyisatypicalregressiontask.Hongetal.[65]builtaDLmodeltopredictwhetheraproteininabacteriumisoftheT4SEtype,withanoutputofdiscretevalues(e.g.,0/1),andthisstudyisatypicalclassificationtask.
Dependingonthetypeoflearningalgorithmrequiredtosolvetheproblem,modelsareconceptualizedintothreecategories:supervisedlearning,unsupervisedlearning,andreinforcementlearning.Supervisedlearningisalabeled-data-drivenprocessthattrainsamodelontherelationshipbetweeninputanditsprespecifiedoutputinordertopredictthecategoriesorcontinuousvariablesoffutureinput.Incomparison,unsupervisedmethodsareusedforidentifyingpatternsinunlabeleddatasetsandexploringadataset’spotentialstructurestoallowclusteringofthedataforfurtheranalysis.Inaddition,semi-supervisedlearningispart-waybetweensupervisedandunsupervisedlearning;itacceptsonlypartofthelabeleddatatodevelopatrainingmodelandisusedasapotentialsolutionforproblemsthatlackhigh-qualitydata[66].Reinforcementlearningperformsmodelconstructionthroughconstantinteractivelearning,relyingonpenaltiesforfailureorrewardsforsuccess.
2.3.IntroductiontodifferenttypesofML/DL-basedalgorithms
MLandDLmethodshavebeensuccessfullyappliedtosolverelevantbiomedicalproblems,withtheadoptedmodelingapproachvaryingfordifferentproblemsoreventhesameproblems.Forexample,smallmoleculesusedtobecharacterizedasengineeredfeaturesfordirectloadinginseveralMLmethodstopredicttheproperties;however,morerecently,GNNscanalsobeutilizedtodescribesmallmoleculesforpredictionsofproperties[67].Determiningthefunctionannotationsofproteinsisessentialfortheselectionofdruggableproteinsaspotentialtargets.Maxatetal.[68]conductedaconvolutionalneuralnetwork(CNN)toannotatethegeneontologyannotation(GOA)ofproteins.Nadavetal.[69]builtarecurrentneuralnetwork(RNN)forproteinfunctionannotations,andXiaetal.[70]combinedbothaCNNandRNNtopredictthegeneontology(GO)labelofproteins.
MLbuildsaspecialalgorithm—notaspecificalgorithm—thatfocusesonthefeaturesofthedataandtransformsthemintoknowledgethatmachinescanreadtoprovidehumanswithnewinsights.Variouscommonalgorithmsexistforresearcherstochoosefrom.ThenaïveBayes(NB)algorithmisaprobabilistic-basedclassifierbasedonBayes’theoremandindependenceassumptionsbetweenfeatures;itisasimpleandintuitivealgorithm[71].AnRFalgorithmconstructsasetofunrelateddecisiontreesthatformawholehierarchicalstructure;undermodelconstruction,eachtreeisindividuallyresponsibleforacorrespondingproblem[72].Thefinaldecisionisbasedonthemajorityvotesofthedecisiontrees.Modelsthatmakedecisionsbasedonthisapproacharealsocommonlyreferredtoasensemblemodels.eXtremegradientboosting(XGBOOST)isascalableMLalgorithmbasedongradientboosting,whichisalsoanensemblemodel[73].Multi-layerperceptron(MLP)canbeviewedasadirectedgraphconsistingofmultiplenodelayers,eachfullyconnectedtothenextlayer,sothatitmapsasetofinputvectorstoasetofoutputvectors.SVMisoneofthemostwidelyappliedMLalgorithms.Anoptimalhyperplaneisusedtoclassifysamples,whichareobtainedbymaximizingthemarginsbetweendifferentclassesinaspecificdimensionalspace,withthedimensionalitybeingdeterminedbythenumberoffeatures[74].K-nearestneighbor(KNN)isregardedas“lazylearning”thatclassifiesthesampleaccordingtoonlyafewneighboringsampleswhendistinguishingbetweencategories[75].Inadditiontotheabovemethods,severalotherMLmethodssuchasprincipalcomponentanalysis(PCA),partialleast-squares(PLS),lineardiscriminantanalysis(LDA),andlogisticregression(LR)havebeenappliedinbiomedicaldataprocesses[76,77].
DLispopularduetoitspowerfulgeneralizationandfeature-extractioncapabilities;itslearningandpredictionprocessisend-to-end.UnlikethetraditionalMLprocess(whichoftenconsistsofmultipleindependentmodules),DLobtainstheoutputdata(output-end)directlyfromtheinputdata(input-end)duringthemodeltrainingprocessandcontinuouslyadjustsandoptimizesthemodelbasedontheerrorbetweentheoutputandthetruevalue,untilitmeetstheexpectedresult.Adeepneuralnetwork(DNN)isafeed-forwardneuralnetworkconsistingofdenselyconnectedinput,hidden,andoutputlayers.Itachievesthefeaturelearningofinputdatabysimulatingnonlineartransformationsbetweenneurons,witheachlayerconsistingofvariousneurons[78].ACNNisafeed-forwardneuralnetworkthatconsistsofconvolutional(featureextraction)andpooling(dimensionalityreduction)layers.Theconvolutionalandpoolinglayershelptoextractalltheinformationinadatasetwithout
6
consumingtoomuchtimeandcomputationalresources[79].AnRNNisaclassofANNinwhichlinkednodesformadirectedorundirectedgraphalongatemporalsequence.AnRNNincludesafeedbackcomponentthatallowssignalsfromonelayertobefedbacktothepreviouslayer.Itistheonlyneuralnetworkwithinternalmemory,whichhelpstoaddressthedifficultyoflearningandstoringlong-terminformation[80].AGNNisaconnectivitymodelthatderivesthedependenciesinagraphbymeansofinformationtransferbetweennodesinthenetwork[81,82].AGNNupdatesthestateofanodeaccordingtoneighborsofthenodeatanydepthfromthenode;thisstateisabletorepresentthenodeinformation.TheneuralnetworkarchitecturesofthefournetworksdescribedaboveareshowninFig.2.
Anautoencoder(AE),whichconsistsofanencoderandadecoder,isusedtolearnefficientencodingsofinputdata.Theencoding,whichisgeneratedbyfeedinginputtotheencoder,regeneratestheinputbythedecoder.AnAEisusuallyusedfordatacompressionanddimensionalityreductionthroughtherepresentationmethods(i.e.,theencoding)ofasetofdata[83].Agenerativeadversarialnetwork(GAN)iscomposedoftwounderlyingneuralnetworks:ageneratorneuralnetworkandadiscriminatorneuralnetwork.Theformerisusedtogeneratecontent,whilethelatterisusedtodiscriminatethegeneratedcontent[84].Modelscanalsobeusedincombinationtosolveawiderrangeofproblems.Forexample,agraphconvolutionnetwork(GCN)extendsconvolutionaloperationsfromtraditionaldata(e.g.,images)tographdata[85].
Fig.2.SchematicnetworkarchitecturesforaDNN,GNN,CNN,andRNN.
Whenamodelfailstolearntheunderlyingpatternsindatafeatureseffectivelyandlosestheabilitytogeneralizetonewdata,suchaproblemiscalledmodelunderfitting[86].Incontrast,overfittingoccurswhenthemodelistrainingandnoisein
7
thedatafittedasarepresentativefeatureresultinginpoorpredictionsfornewdata[87].Comparedwithunderfitting,modeloverfittingismoredifficulttodealwith.Modelsoftenbecomeoverfittedduetobeingoverlycomplexorbecauseofanunderrepresentationofdata.Adatasetusedforamodelisoftendividedintoatrainingset,validationset,andtestset.Thesesetsarerespectivelyusedformodeltraining,modeladjustment,andmodelevaluation.Toputitsimply,amodelthatworksbadlyonboththetrainingandtestsetsisanunderfittedmodel,whileamodelthatworkswellonthetrainingsetbutbadlyonthetestsetisanoverfittedmodel.Typicalwaystosuppressoverfittingincluderegularization,dataaugmentation[88],dropout[89],earlystopping,ensemblelearning,andamongothermethods.
Researchersencounteredunderfittingandoverfittingproblems,usingonlyonemodeloftraditionalepidemicmodelsorMLmodels,whenpredictingthelong-termtrendsofthecoronavirusdisease2019(COVID-19)pandemic.Toaddresstheseissues,Sunetal.[90]proposedanewmodelcalleddynamic-susceptible-exposed-infective-quarantined(D-SEIQ).TheD-SEIQmodelcanaccuratelypredictthelong-termtrendsofCOVID-19outbreaksbyappropriatelymodifyingthesusceptible-exposed-infective-recovered(SEIR)modelandintegratingML-basedparameteroptimizationunderreasonableepidemiologyconstraints.
Differentmodelshavedifferentevaluationcriteria.Inregressionmodels,commonlyusedevaluationcriteriaincludemeansquarederror(MSE),rootMSE(RMSE),andRsquared.Inclassificationmodels,themorecommonlyusedcriteriaarerecall,precision,andF1score.Thereceiveroperatingcharacteristic(ROC)curveandprecision-recallcurve(PRC)arethemostcommonlyusedevaluationcriteriainclassificationmodels,withROCcurvestakingintoaccountbothpositiveandnegativecasestoassesstheoverallperformanceofthemodel,whilePRCsfocusmoreonpositivecases[91].
2.4.Abriefdescriptionofmoleculerepresentationasmodelinput
Overtime,theaccumulationofdataonsmallmoleculesandproteinshasresultedinanextremelylargedataresource.Databasesofmolecularsequences,structures,physicochemicalproperties,andsoforthhavebeencollectedandorganizedbydifferentorganizationsandcontainagreatdealofknowledgeandinformation.However,thedifferentsourcesandformatsofthedatamakeitdifficulttointegratethecorrelateddatafrommultipleheterogeneoussources.Therefore,itisparticularlyimportanttoadoptsuitablemethodstorepresentmoleculesinanappropriatewayandtominethecrucialinformationinthedataonmoleculesbymeansofAI[92].CurrentAIalgorithmsarehighlydependentonthequalityofthedata;thus,whenperformingmodelconstruction,itisnecessarytounifytheinputformatofmolecules,suchasbyrepresentingsmallmoleculesandproteinsasmodel-readablevectorsormatrices.
Atpresent,therepresentationofsmallmoleculesisgenerallydoneusingoneoffourmainapproaches.Thefirstapproachinvolvesknowledge-basedrepresentation.MoleculardescriptorsandmolecularfingerprintsbasedonhumanaprioriknowledgearewidelyusedinvariousMLorDLalgorithms[93].Thesecondapproachinvolvesdirectrepresentationbasedonimages.CNNshavenowbeenusedtolearnrulesfromtwo-dimensional(2D)digitalimages.A2DchemicaldigitalgridofamoleculecanbedirectlyusedasinputtoallowaCNNmodeltolearnthepropertiesofthemolecule[94].Thethirdapproachisstring-basedrepresentation.Forexample,atypicalcanonicalsimplifiedmolecular-inputline-entrysystem(SMILES)representssmallmoleculesintheformofstrings.Thus,CNNsandRNNscanbefurtherusedtolearnmolecularembeddingsfromthestringrepresentationsofchemicalstructures[95–97].Thefourthapproachinvolvesgraph-basedfeaturerepresentation.Representationmethodsbasedongraphconvolutionorgraphattentionhavebeenwidelyusedtoexplorethefeaturerepresentationofsmallmolecules.Inthesemethods,atomsandbondsareconsideredtobenodesandedges,respectively,whilenewmolecularrepresentationsareobtainedduringthecontinuousupdatingofinformationatindividualnodes.Graph-basedrepresentationshaveachievedoutstandingperformanceinavarietyofpharmaceuticallearningtasks[98,99].
Proteinrepresentationmethodscanbebasicallyclassifiedintofourcategories:representationbasedonintrinsicpropertiesofsequences,representationbasedonphy
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 垃圾焚烧发电行业报告
- 三农村水资源管理方案手册
- 三农市场推广策略与技巧指南
- 生态旅游度假区开发项目可行性研究报告
- 框架、技术与最佳实践指南
- 餐饮连锁店运营管理及拓展策略
- 施工安全管理考核细则
- 发改委立项可行性分析报告
- 农业技术推广创新模式指南
- 低空经济合作
- 《ISO 55013-2024 资产管理-数据资产管理指南》专业解读和应用指导材料(雷泽佳编制-2024C0)【第1部分:1-130】
- 软件资格考试嵌入式系统设计师(基础知识、应用技术)合卷(中级)试卷与参考答案(2024年)
- 2024年下半年杭州黄湖镇招考编外工作人员易考易错模拟试题(共500题)试卷后附参考答案
- 浙江省第五届初中生科学竞赛初赛试题卷
- 雷锋精神在2024:新时代下的学习
- 竣工验收流程培训课件
- 2024年上海中考化学终极押题密卷三含答案
- DB14∕T 1334-2017 波形钢腹板预应力混凝土组合结构桥梁悬臂施工与验收规范
- ECharts数据可视化课件 第4章 雷达图、旭日图和关系图
- 幸福女人课件教学课件
- 天翼云从业者考试复习题及答案
评论
0/150
提交评论