评估LLM在软件工程和采购中的机会(英文版)_第1页
评估LLM在软件工程和采购中的机会(英文版)_第2页
评估LLM在软件工程和采购中的机会(英文版)_第3页
评估LLM在软件工程和采购中的机会(英文版)_第4页
评估LLM在软件工程和采购中的机会(英文版)_第5页
已阅读5页,还剩18页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Assessing

OpportunitiesforLLMs

inSoftware

EngineeringandAcquisition

Authors

StephanyBellomo

ShenZhangJamesIversJulieCohenIpekOzkaya

NOVEMBER2023

2

LARGELANGUAGEMODELS(LLMS)AREGENERATIVEARTIFICIALINTELLIGENCE(AI)MODELSthathave

beentrainedonmassivecorpusesoftextdataand

canbepromptedtogeneratenew,plausiblecontent.LLMsareseeingrapidadvances,andtheypromisetoimproveproductivityinmanyfields.OpenAI’sGPT-

41andGoogle’sLaMDA2aretheunderlyingLLMs

ofserviceslikeChatGPT3,CoPilot4,andBard5.Theseservicescanperformarangeoftasks,including

generatinghuman-liketextresponsestoquestions,summarizingartifacts,andgeneratingworkingcode.Thesemodelsandservicesarethefocusofextensiveresearcheffortsacrossindustry,government,and

academiatoimprovetheircapabilitiesandrelevance,andorganizationsinmanydomainsarerigorously

exploringtheirusetouncoverpotentialapplications.

TheideaofharnessingLLMstoenhancetheefficiencyof

softwareengineeringandacquisitionactivitiesholdsspecialallurefororganizationswithlargesoftwareoperations,suchastheDepartmentofDefense(DoD),asdoingsooffersthepromiseofsubstantialresourceoptimization.PotentialusecasesforLLMsareplentiful,butknowinghowtoassessthebenefitsandrisksassociatedwiththeiruseisnontrivial.

Notably,togainaccesstothelatestadvances,organizationsmayneedtoshareproprietarydata(e.g.,sourcecode)withserviceproviders.UnderstandingsuchimplicationsiscentraltointentionalandresponsibleuseofLLMs,especiallyfor

organizationsmanagingsensitiveinformation.

Inthisdocument,weexaminehowdecisionmakers,suchastechnicalleadsandprogrammanagers,canassessthefitnessofLLMstoaddresssoftwareengineeringandacquisition

needs

[Ozkaya2023]

.Wefirstintroduceexemplarscenariosinsoftwareengineeringandsoftwareacquisitionand

identifycommonarchetypes.Wedescribecommonconcerns

involvingtheuseofLLMsandenumeratetacticsformitigatingthoseconcerns.Usingthesecommonconcernsandtactics,

wedemonstratehowdecisionmakerscanassessthefitnessofLLMsfortheirownusecasesthroughtwoexamples.

CapabilitiesofLLMs,risksconcerningtheiruse,andour

collectiveunderstandingofemergingservicesandmodelsareevolvingrapidly

[Brundageetal.2022]

.Whilethisdocumentisnotmeanttobecomprehensiveincoveringallsoftware

engineeringandacquisitionusecases,theirconcerns,andmitigationtactics,itdemonstratesanapproachthatdecisionmakerscanusetothinkthroughtheirownLLMusecasesasthisspaceevolves.

1

/research/gpt-4

2

https://blog.google/technology/ai/lamda/

3

4

/features/copilot

5

WhatIsanLLM?

AnLLMisadeepneuralnetworkmodeltrainedonanextensivecorpusofdiversedocuments(e.g.,websitesandbooks)to

learnlanguagepatterns,grammarrules,factsandevensomereasoningabilities

[Wolfram2023]

.LLMscangenerateresponsestoinputs(“prompts”)byiterativelydeterminingthenextwordorphraseappearingafterothersbasedonthepromptand

patternsandassociationslearnedfromtheirtrainingcorpususingprobabilisticandrandomizedselection

[Whiteetal.

2023]

.ThiscapabilityallowsLLMstogeneratehuman-liketextthatcanbesurprisinglycoherentandcontextuallyrelevant,eveniftheymaynotalwaysbesemanticallycorrect.

WhileLLMscanperformcomplextasksusingtheirtrainedknowledge,theylacktrueunderstanding.Rather,theyare

sophisticatedpatternmatchingtools.Moreover,duetotheirprobabilisticreasoning,theycangenerateinaccurateresults(oftenreferredtoas“hallucinations”),suchascitationsto

non-existentreferencesormethodcallstononexistent

applicationprogramminginterfaces(APIs).WhileLLMscanperformanalysisandinferencingonnewdatatheyhave

beenpromptedwith,dataonwhichLLMshavebeentrainedcanlimittheiraccuracy.However,thetechnologyisrapidlyadvancingwithnewmodelshavingincreasingcomplexity

andparameters,andbenchmarkshavealreadyemergedforcomparingtheirperformance

[Imsys2023]

.Inaddition,LLMserviceprovidersareworkingonwaystousemorerecentdata

[D’Cruze2023]

.Despitetheselimitations,thereare

productiveusesofLLMstoday.

ChoosinganLLM

TherearealreadydozensofLLMsandservicesbuiltusingLLMs,andmoreemergeeveryday.Thesemodelsvaryinmanydimensions,fromtechnicaltocontractual,andthe

detailsofthesedifferencescanbedifficulttokeepstraight.ThefollowingdistinctionsareagoodstartingpointwhenchoosinganLLMforuse.

ModelorService.ChatGPTisachatbotbuiltonOpenAI’sGPTfamilyofLLMs

[OpenAI2023]

.Thedifferenceisimportant,

asservicesbuiltonLLMscanaddadditionalcapabilities(e.g.,

specializedchatbotfeatures,specializedtrainingbeyondthe

coreLLM,ornon-LLMfeaturesthatcanimproveresultsfromanLLM).AservicelikeChatGPTistypicallyhostedbyaserviceprovider,meaningthatitmanagesthecomputingresources

(andassociatedcosts)andthatusersaretypicallyrequired

tosendtheirprompts(andpotentiallysensitivedata)tothe

serviceprovidertousetheservice.Amodel,likeMeta’sLlama26,canbefine-tunedwithdomain-ororganization-specificdatatoimproveaccuracy,butittypicallylackstheaddedfeatures

andresourcesofacommerciallysupportedservice.

6

/llama/

3

GeneralorSpecialized.LLMsarepre-trainedonacorpus,andthecompositionofthatcorpusisasignificantfactoraffectinganLLM’sperformance.GeneralLLMsaretrainedontextsourceslikeWikipediathatareavailabletothepublic.SpecializedLLMsfinetunethosemodelsbyaddingtrainingmaterialfromspecificdomainslikehealthcareandfinance

[Zhouetal.2022;Wuet

al.2023]

.LLMslikeCodeGen7havebeenspecializedwithlargecorpusesofsourcecodeforuseinsoftwareengineering.

OpenSourceorProprietary.OpensourceLLMsprovidea

platformforresearchersanddeveloperstofreelyaccess,use,andevencontributetothemodel’sdevelopment.ProprietaryLLMsaresubjecttovaryingrestrictionsonuse,makingthemlessopentoexperimentationorpotentialdeployment.

Someproviders(e.g.,Meta)usealicensethatislargely,butnotcompletely,open

[Hull2023]

.OpenAIoffersadifferent

compromise:WhiletheGPTseriesofLLMsisnotopen

source,OpenAIdoespermitfinetuning(forafee)asameansofspecializationandlimitedexperimentationwiththeir

proprietarymodel.

ThefieldofLLMsisafast-movingspace.Moreover,theethicsandregulationssurroundingtheirusearealsoinastateofflux,associetygrappleswiththechallengesandopportunitiesthesepowerfulmodelspresent.

KeepingapprisedofthesedevelopmentsiscrucialfortakingadvantageofthepotentialofferedbyLLMs.

7

/salesforce/CodeGen

UseCases

TheabilityofLLMstogenerateplausiblecontentfortextandcodeapplicationshassparkedtheimaginationsofmany.

Arecentliteraturereviewexamines229researchpapers

writtensince2017ontheapplicationofLLMstosoftware

engineeringproblems

[Houetal.2023]

.Applicationareasspanrequirements,design,development,testing,maintenance,andmanagementactivities,withdevelopmentandtestingbeingthemostcommon.

Ourteam,whichworkswithgovernmentorganizations

daily,tookabroaderperspectiveandbrainstormedseveraldozenideasforusingLLMsincommonsoftwareengineeringandacquisitionactivities(see

Table1

forexamples).Two

importantobservationsquicklyemergedfromthisactivity.

First,mostusecasesrepresenthuman-AIpartnershipsin

whichanLLMorLLM-basedservicecouldbeusedtohelp

humans(asopposedtoreplacehumans)completetasks

morequickly.Second,decidingwhichusecaseswouldbe

mostfeasible,beneficial,oraffordableisnotatrivialdecisionforthosejustgettingstartedwithLLMs.

4

Table1:SampleAcquisitionandSoftwareEngineeringUseCases

ACQUISITIONUSECASES

SOFTWAREENGINEERINGUSECASES

A1.AnewacquisitionspecialistusesanLLMtogeneratean

overviewofrelevantfederalregulationsforanupcomingrequestforproposal(RFP)review,expectingthesummarytosavetimeinbackgroundreading.

SE1.AdeveloperusesanLLMtofindvulnerabilitiesinexistingcode,hopingthattheexercisewillcatchadditionalissuesnotalreadyfoundbystaticanalysistools.

A2.AchiefengineerusesanLLMtogenerateacomparisonofalternativesfrommultipleproposals,expectingittousethebudgetandscheduleformulasfromprevioussimilarproposalreviewsandgenerateaccurateitemizedcomparisons.

SE2.AdeveloperusesanLLMtogeneratecodethatparses

structuredinputfilesandperformsspecifiednumericalanalysisonitsinputs,expectingittogeneratecodewiththedesired

capabilities.

A3.AcontractspecialistusesanLLMtogenerateideasfora

requestforinformation(RFI)solicitationgivenasetofconcernsandvagueproblemdescription,expectingittogenerateadraftRFIthatisatleast75%alignedwiththeirneeds.

SE3.AtesterusesanLLMtocreatefunctionaltestcases,

expectingittoproduceasetoftexttestcasesfromaprovidedrequirementsdocument.

A4.ACTOusesanLLMtocreateareportsummarizingallusesofdigitalengineeringtechnologiesintheorganizationbased

oninternaldocuments,expectingitcanquicklyproduceaclearsummarythatisatleast90%correct.

SE4.AdeveloperusesanLLMtogeneratesoftware

documentationfromcodetobemaintained,expectingittosummarizeitsfunctionalityandinterface.

A5.AprogramofficeleadusesanLLMtoevaluateacontractor’scodedeliveryforcompliancewithrequireddesignpatterns,

expectingthatitwillidentifyanyinstancesinwhichthecodefailstouserequiredpatterns.

SE5.AsoftwareengineerwhoisunfamiliarwithSQLusesan

LLMtogenerateanSQLqueryfromanaturallanguage

description,expectingittogenerateacorrectquerythatcanbetestedimmediately.

A6.AprogrammanagerusesanLLMtosummarizeasetof

historicalartifactsfromthepastsixmonthsinpreparationforahigh-visibilityprogramreviewandprovidesspecificretrievalcriteria(e.g.,deliverytempo,statusofopendefects,and

schedule),expectingittogenerateanaccuratesummaryofprogramstatusthatcomplieswiththeretrievalcriteria.

SE6.AsoftwarearchitectusesanLLMtovalidatewhethercodethatisreadyfordeploymentisconsistentwiththesystem’s

architecture,expectingthatitwillreliablycatchdeviationsfromtheintendedarchitecture.

A7.AprogrammanagerusesanLLMtogenerateareviseddraftofastatementofwork,givenashortstartingdescriptionand

alistofconcerns(e.g.,cybersecurity,softwaredeliverytempo,andinteroperabilitygoals).Theprogrammanagerexpectsittogenerateastructurethatcanbequicklyrefinedandthat

includestopicsdrawnfrombestpracticestheymaynotthinktorequestexplicitly.

SE7.AdeveloperusesanLLMtotranslateseveralclassesfrom

C++toRust,expectingthatthetranslatedcodewillpassthesametestsandbemoresecureandmemorysafe.

A8.ArequirementsengineerusesanLLMtogeneratedraft

requirementsstatementsforaprogramupgradebasedonpastsimilarcapabilities,expectingthemtobeagoodstartingpoint.

SE8.AdeveloperusesanLLMtogeneratesynthetictestdataforanewfeaturebeingdeveloped,expectingthatitwillquicklygeneratesyntacticallycorrectandrepresentativedata.

A9.Acontractofficerisseekingfundingtoconductresearchonahigh-prioritytopictheyarenotfamiliarwith.ThecontractofficerusesanLLMtocreateexampleprojectdescriptionsfortheir

context,expectingittoproducereasonabledescriptions.

SE9.AdeveloperprovidesanLLMwithcodethatisfailingin

productionandadescriptionofthefailures,expectingittohelpthedeveloperdiagnosetherootcauseandproposeafix.

Archetypes

Commonalitiesamongtheusecaseslendthemselvesto

abstractingthesetintoamanageablenumberofarchetypes.Twodimensionsarehelpfulinthisregard:thenatureof

theactivityanLLMisperformingandthenatureofthedatathattheLLMisactingon.Takingthecross-productofthesedimensions,theseusecasesfallintothearchetypesdepictedin

Table2

.

Table2:UseCaseArchetypes

ACTIVITYTYPE

DATATYPE

Text

Code

Model

Images

Retrieve

Information

retrieve-text

retrieve-code

retrieve-model

retrieve-images

GenerateArtifact

generate-text

generate-code

generate-model

generate-images

ModifyArtifact

modify-text

modify-code

modify-model

modify-images

AnalyzeArtifact

analyze-text

analyze-code

analyze-model

analyze-images

5

Matchingaspecificusetoanarchetypehelpsidentify

commonconcernsamongsimilarusecasesandknownsolutionscommonlyappliedforsimilarusecases.

Archetypescanbeatoolthatorganizationsusetogroupsuccesses,gaps,andlessonslearnedinastructuredway.

ActivityTypecapturesdifferencesinassociationsthatanLLMwouldneedtomaketosupportausecase,withsomeaskinganLLMtodothingsthatalanguagemodelwasnotdesignedtodo:

•RetrieveInformationasksanLLMtoconstructaresponsetoaquestion(e.g.,what’stheObserverpattern?)forwhichaknownanswerislikelyfoundinthetrainingcorpus,directlyoracrossrelatedelements.

•GenerateArtifactasksanLLMtocreateanewartifact(e.g.,asummaryofatopicoraPythonscriptthatperformsastatisticalanalysis)thatlikelybearssimilaritywithexistingexamplesinthecorpus.

•ModifyArtifactasksanLLMtomodifyanexistingartifact

toimproveitinsomeway(e.g.,translatePythoncodetoJavaorremoveadescribedbug)thatresemblesanalogousimprovementsamongartifactsinthetrainingcorpus.

•AnalyzeArtifactasksanLLMtodrawaconclusionaboutprovidedinformation(e.g.,whatvulnerabilitiesareinthiscodeorwillthisarchitecturescaleadequately?)thatlikelyrequiressemanticreasoningaboutdata.

DataTypecapturesdifferencesinthekindofdatathatanLLMoperatesonorgenerates,suchasthedifferencesinsemanticrulesthatmakedata(e.g.,code)well-formed:

•Textinputsvarywidelyinformalityandstructure(e.g.,

informalchatversusstructuredtextcapturedintemplates).

•Codeistextwithformalrulesforstructureandsemantics,andagrowingnumberofLLMsarebeingspecializedtotakeadvantageofthisstructureandsemantics.

•Modelsareabstractions(e.g.,fromsoftwaredesignor

architecture)thatoftenusesimpleterms(e.g.,publisher)thatimplydeepsemantics.

•Imagesareusedtocommunicatemanysoftwareartifacts(e.g.,classdiagrams)andoftenemployvisualconventions

that,muchlikemodels,implyspecificsemantics.WhileLLMsoperateontext,multimodalLLMs(e.g.,GPT-4)aregrowingintheirabilitytoingestandgenerateimagedata.

Figure1

showsanexampleofusingthearchetypesto

generateideasforLLMusecasesinaparticulardomain.

Thisexamplefocusesonindependentverificationand

validation(IV&V),aresource-intensiveactivitywithintheDoDthatinvolvesmanydifferentactivitiesthatmightbenefit

fromtheuseofLLMs.MorecomplexusecasesforIV&V

couldalsobegeneratedthatinvolveintegrationofmultiplearchetypesintoalargerworkflow.

ACTIVITYTYPE

Text

Code

DATATYPE

Model

Images

RetrieveInformation

retrieve-text

retrieve-code

retrieve-model

retrieve-images

GenerateArtifact

generate-text

1

2

generate-code

4

generate-model

generate-images

6

ModifyArtifact

modify-text

modify-code

modify-model

modify-images

AnalyzeArtifact

analyze-text

3

analyze-code

5

analyze-model

analyze-images

3

AnIV&VevaluatorusesanLLMtoanalyzesoftwaredesigndocumentsagainsta

specificsetofcertificationcriteriaandto

generateacertificationreport,expectingittodescribecertificationviolationsthattheywillreviewtoconfirm.

2

AdeveloperusesanLLMtocreatea

networkviewforauthorizationtooperate(ATO)certificationfromadescriptionofthearchitecture,expectingittoproducearoughnetworkdiagramtheycanrefine.

Figure1:UsingArchetypestoHelpBrainstormPotentialUseCases

AtesterusesanLLMtocreateintegrationtestdescriptionsfromasetofAPIsand

integrationscenarios,expectingitto

produceasetoftestcasedescriptionsthatcanbeusedtoimplementtests.

AnIV&VevaluatorusesanLLMtocreateaverificationchecklistfromasetof

certificationregulationsandasystem

description,expectingittoproducea

context-sensitivechecklisttheycantailor.

AdeveloperusesanLLMtofind

vulnerabilitiesinexistingcode,hopingthattheexercisewillcatchadditionalissuesnotalreadyfoundbystaticanalysistools.

AnewdeveloperusesanLLMasapair

programmertowritecode,expectingittohelpcreatevulnerability-freecode.

4

6

5

1

6

Mistakeshave

smallconsequences

Mistakesarehardforuserstofind

SE1A8

A4

SE8

A3

SE4

A9

SE3

SE5

A1

SE7

A5

SE6

A2

A6

SE9SE2

A7

Mistakeshave

largeconsequences

Mistakesareeasyforuserstofind

Figure2:TwoWaystoLookatConcernswiththeGenerationofIncorrectResults(A:AcquisitionUseCases,SE:SoftwareEngineeringUseCases

[Table1

])

ConcernsandHowtoAddressThem

RecognizingconcernsaroundapplicationsofLLMsto

softwareengineeringandacquisition,anddecidinghowto

addresseach,willhelpdecisionmakersmakemoreinformedchoices.TherearemultipleperspectivesoneshouldconsiderbeforegoingforwardwithanLLMusecase.Animportant

realityisthattheresultsgeneratedbyLLMsareinfact

sometimeswrong.

Figure2

illustratesthisperspectivebasedontwoquestions:

•Howsignificantwoulditbetoactonanincorrectresultinagivenusecase?

•HoweasywoulditbeforauserintheusecasetorecognizethataresultfromanLLMisincorrect?

Thisfigureshowsanotionalplacementoftheusecasesfrom

Table1

(actualplacementwouldbereliantonrefinement

oftheseusecases).Thegreenquadrantisidealfromthis

perspective:Mistakesarenotparticularlyconsequentialandrelativelyeasytospot.UsecasesinthisquadrantcanbeagreatplacefororganizationstostartLLMexperimentation.Theredquadrant,ontheotherhand,representstheleastfavorablecasesforLLMuse:Mistakescreaterealproblemsandarehardforuserstorecognize.

Theconsequencesofmistakesandeaseofspottingthemisonlyoneperspectiveofevaluation.Anotherperspectiveis

theexpectedsignificanceofimprovementsorefficienciesachievablewithLLMs.Amongmanyconcerns,wediscussfivecategoriesinfurtherdetailinthisdocument—correctness,disclosure,usability,performance,andtrust—astheyare

relevanttoallusecases.

Correctness:Thesignificanceofcorrectnessasaconcern

dependsonfactorssuchashowtheresultswillbeused,thesafeguardsusedinworkflows,andtheexpertiseofusers.

Correctnessreferstotheoverallaccuracyandprecisionof

outputrelativetosomeknowntruthorexpectation.Accuracy

hingesgreatlyonwhetheranLLMwastrainedorfine-tunedwithdatathatissufficientlyrepresentativetosupportthe

specificusecase.Evenwithrichtrainingcorpuses,some

inaccuracycanbeexpected

[Ouyangetal.2023]

.Forexample,arecentstudyoncodetranslationfoundGPT-4toperform

betterthanotherLLMs,eventhoughmorethan80%of

translationsonapairofopensourceprojectscontainedsomeerrors.Advancesarelikelytoimprove,butnoteliminate,

thesenumbers

[Panetal.2023]

.

7

Disclosure:WhenusersinteractwithLLMs,someusecases

mayrequiredisclosingproprietaryorsensitiveinformationtoaserviceprovidertocompleteatask(e.g.,sharingsourcecodetohelpdebugit).Thedisclosureconcernisthereforerelatedtotheamountofproprietaryinformationthatmustbeexposedduringuse.Ifusersshareconfidentialdata,tradesecrets,or

personalinformation,thereisariskthatsuchdatacouldbestored,misused,oraccessedbyunauthorizedindividuals.Moreover,itmightbecomepartofthetrainingdatacorpusanddisseminatedwithoutusershavinganymeanstotrackitsorigin.Forexample,GSACIOIL-23-01(theU.S.GeneralServicesAdministrationinstructionalletterSecurityPolicy

forGenerativeArtificialIntelligence[AI]LargeLanguageModels[LLMs])bansdisclosureoffederalnonpublicinformationasinputsinpromptstothird-partyLLMendpoints

[GSA2023]

.

Usability:LLMusershavevastlydifferentbackgrounds,

expectations,andtechnicalabilities.Usabilitycaptures

theabilityofLLMuserswithdifferentexpertisetocomplete

tasks.Usersmayneedexpertiseonboththeinput(craftingappropriateprompts)andoutput(judgingthecorrectnessofresults)sidesofLLMuse

[Zamfirescu-Pereiraetal.2023]

.Thesignificanceofusabilityasaconcerndependsonthe

degreetowhichgettingtoacceptableresultsissensitivetotheexpertiseofusers.Astudycompletedwithdevelopers’earlyexperiencesusingCoPilotreflectsthatthereisashiftfromwritingcodetounderstandingcodewhenusingLLMsoncodingtasks

[Birdetal.2023]

.Thisobservationhintsattheneedfordifferentusabilitytechniquesforinteractionmechanisms,aswellastheneedtoaccountforexpertise.

Performance:WhileusinganLLMrequiresmuchless

computingpowerthantraininganLLM,responsiveness

canstillbeafactorinLLMuse,especiallyifsophisticated

promptingapproachesareincorporatedintoanLLM-

basedservice.Forthepurposesofconcernsrelatedtousecases,performanceexpressesthetimerequiredtoarriveatanappropriateresponse.Modelsize,underlyingcompute

power,andwherethemodelrunsandisaccessedfromareamongthefactorsthatinfluenceresponsiveness

[Patterson

etal.2022]

.ServicesbuiltonLLMsmayintroduceadditionalperformanceoverheadduetothewayinwhichother

capabilitiesareintegratedwiththeLLM.

Trust:Toemploythetechnologywiththerequisitelevel

oftrust,usersmustgraspthelimitationsofLLMs.Trust

reflectstheuser’sconfidenceintheoutput.Overrelianceon

anLLMwithoutunderstandingitspotentialforerroror

biascanleadtoundesirableconsequences

[Rastogietal.

2023]

.Asaresult,severalotherconcerns(e.g.,explainability,bias,privacy,security,andethics)areoftenconsideredin

relationshiptotrust

[Schwartzetal.2023]

.Forexample,theDoDpublishedethicalAIprinciplestoadvancetrustworthyAIsystems

[DoD2020]

.

Howsignificanttheseandotherconcernsareforeachuse

casewillvarybycontextanduse.Thequestionsprovided

in

Table3

canhelporganizationsassesshowrelevanteachconcernisforaspecificusecase.AstartingpointcouldbetocategorizethesignificanceofeachconcernasHigh,Medium,orLow.Thisinformationcanhelporganizationsdecide

whetheranLLMisfitforpurposeandwhatconcernsneedtobemitigatedtoavoidunacceptableoutcomes.

Table3:ExampleQuestionstoHelpDeterminetheSignificanceofCommonConcernsforaSpecificUseCase

CONCERN

SIGNIFICANCEQUESTIONS

Correctness

•Whatistheriskorimpactofusinganincorrectresultintheusecase?

•Howdifficultisitfortheexpectedusertodeterminewhetheraresultiscorrect?

•Aretheregapsinthedatausedtotrainthe

LLMthatcouldadverselyimpactresults(e.g.,thedataisnotcurrentwithrecenttechnologyreleasesorcontainslittledataforanesotericprogramminglanguage)?

Disclosure

•CananLLMbepromptedwithoutdisclosingproprietaryinformation(e.g.,usinggenericquestionsorabstractingproprietarydetails)?

•Whatistheriskorimpactofathirdpartybeingabletoobserveyourprompts?

•Arethereexistingdatadisclosureconstraintsthatstrictlyneedtobeobserved?

Usability

•HowadeptareexpectedusersatpromptinganLLM?

•Howfamiliarareexpecteduserswith

approachesfordeterminingwhetherresultsareinaccurate?

•Howfamiliarareexpecteduserswith

approachesfordeterminingwhetherresultsareincomplete?

Performance

•Howquicklymustauserormachinebeabletoactonaresult?

•Aretheresignificantcomputingresourcelimitations?

•ArethereintermediatestepsintheinteractionwiththeLLMthatmayaffectend-to-end

performance?

Trust

•Areyourexpecteduserspredisposedto

acceptgeneratedresults(automationbias)orrejectthem?

•IsthedatatheLLMwastrainedonfreeofbiasandethicalconcerns?

•HastheLLMbeentrainedondatathatisappropriateforuse?

8

Thesecommonconcerns,andquestionstodetermine

theirsignificance,enableidentificationofcommontacticsforaddressingeachconcern.Atacticisacourseofactionthatcanbetakentoreducetheoccurrenceorimpactofaconcern.

Table4

summarizesacollectionoftacticsthatcanhelpmitigateeachconcern,alongwitharoughestimate

(High[H],Medium[M],orLow[L])oftherelativepotentialcostofusingeachtactic.Typically,themoreresources

(humanandcomputation)atacticrequires,thehigherthe

cost.Forexample,promptengineeringandmodeltraining

bothaddresscorrectness,butpromptengineeringistypicallymuchlessexpensive.Ofnote,sometactics(purplerows)

focusontechnicalinterventions,others(greenrows)focusonhuman-centeredactions,andtherest(grayrows)couldemploytechnicalorhuman-centeredinterventions.

Table4:TacticsThatCanBeUsedtoAddressCommonConcernswithLLMUse

CONCERN

TACTIC

DESCRIPTION

COST

Correctness

Promptengineering

Educateusersonpromptengineeringtechniquesandpatternstogeneratebetterresults.

L

Validatemanually

Dedicatetimetoallowuserstocarefullyvalidateinterimandfinalresults.

M

Adjustsettings

Changesettingsofexposedmodelparametersliketemperature

(randomnessofthemodel’soutput)andthemaximumnumberoftokens.

L

Adoptnewermodel

Usenewermodelsthatintegratetechnicaladvancesorimprovedtrainingcorpusesthatcanproducebetterresults.

M

Finetunemodel

Tailorapretrainedmodelusingorganization-ordomain-specificdatatoimproveresults.

M

Trainnewmodel

Useacustomtrainingcorpusorproprietarydatatotrainanewmodel.

H

Disclosure

Opendisclosurepolicy

Establishapolicythatallowsuserstoshareasmuchdeta

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论