




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Lessonsfromredteaming100
generativeAIproducts
Authoredby:
MicrosoftAIRedTeam
Lessonsfromredteaming100generativeAIproducts2
Authors
BlakeBullwinkel,AmandaMinnich,ShivenChawla,GaryLopez,MartinPouliot,WhitneyMaxwell,JorisdeGruyter,KatherinePratt,SaphirQi,NinaChikanov,RomanLutz,RajaSekharRaoDheekonda,Bolor-ErdeneJagdagdorj,
EugeniaKim,JustinSong,KeeganHines,DanielJones,GiorgioSeveri,RichardLundeen,SamVaughan,
VictoriaWesterhoff,PeteBryan,RamShankarSivaKumar,YonatanZunger,ChangKawaguchi,MarkRussinovich
Lessonsfromredteaming100generativeAIproducts3
Tableofcontents
04
Abstract
07
Redteaming
operations
09
Casestudy#1
Jailbreakingavision
languagemodeltogenerate
hazardouscontent
12
Lesson4
Automationcanhelpcover
moreoftherisklandscape
05
Introduction
08
Lesson1
Understandwhatthesystem
candoandwhereitisapplied
10
Lesson3
AIredteamingisnot
safetybenchmarking
12
Lesson5
ThehumanelementofAI
redteamingiscrucial
05
AIthreatmodel
ontology
08
Lesson2
Youdon’thavetocompute
gradientstobreakanAIsystem
11
Casestudy#2
AssessinghowanLLMcouldbe
usedtoautomatescams
13
Casestudy#3
Evaluatinghowachatbot
respondstoauserindistress
14
Casestudy#4
Probingatext-to-image
generatorforgenderbias
16
Casestudy#5
SSRFinavideo-processing
GenAIapplication
14
Lesson6
ResponsibleAIharmsare
pervasivebutdifficulttomeasure
17
Lesson8
TheworkofsecuringAIsystems
willneverbecomplete
15
Lesson7
LLMsamplifyexistingsecurity
risksandintroducenewones
18
Conclusion
Lessonsfromredteaming100generativeAIproducts4
Abstract
Inrecentyears,AIredteaminghasemergedasapracticeforprobingthesafetyandsecurityofgenerativeAI
systems.Duetothenascencyofthefield,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconducted.Basedonourexperienceredteamingover100generativeAIproductsatMicrosoft,wepresentourinternalthreatmodelontologyandeightmainlessonswehavelearned:
1.Understandwhatthesystemcandoandwhereitisapplied
2.Youdon’thavetocomputegradientstobreakanAIsystem
3.AIredteamingisnotsafetybenchmarking
4.Automationcanhelpcovermoreoftherisklandscape
5.ThehumanelementofAIredteamingiscrucial
6.ResponsibleAIharmsarepervasivebutdifficulttomeasure
7.Largelanguagemodels(LLMs)amplifyexistingsecurityrisksandintroducenewones
8.TheworkofsecuringAIsystemswillneverbecomplete
Bysharingtheseinsightsalongsidecasestudiesfromouroperations,weofferpracticalrecommendationsaimedataligningredteamingeffortswithrealworldrisks.WealsohighlightaspectsofAIredteamingthatwebelieveareoftenmisunderstoodanddiscussopenquestionsforthefieldtoconsider.
Lessonsfromredteaming100generativeAIproducts5
Introduction
AsgenerativeAI(GenAI)systemsareadoptedacrossanincreasingnumberofdomains,AIredteaminghasemergedasacentralpracticeforassessingthesafetyandsecurityofthesetechnologies.Atitscore,AIredteamingstrivestopushbeyondmodel-levelsafety
benchmarksbyemulatingreal-worldattacksagainstend-to-endsystems.However,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconductedandahealthydoseofskepticismabouttheefficacyofcurrentAIredteamingefforts[4,8,32].Inthispaper,wespeaktosomeoftheseconcernsbyprovidinginsightintoourexperienceredteaming
over100GenAIproductsatMicrosoft.Thepaper
isorganizedasfollows:First,wepresentthethreat
modelontologythatweusetoguideouroperations.Second,weshareeightmainlessonswehavelearnedandmakepracticalrecommendationsforAIred
teams,alongwithcasestudiesfromouroperations.
Inparticular,thesecasestudieshighlighthowour
ontologyisusedtomodelabroadrangeofsafety
andsecurityrisks.Finally,weclosewithadiscussionofareasforfuturedevelopment.
Background
TheMicrosoftAIRedTeam(AIRT)grewoutofpre-existingredteaminginitiativesatthecompanyandwasofficiallyestablishedin2018.Atitsconception,theteamfocusedprimarilyonidentifyingtraditionalsecurityvulnerabilitiesandevasionattacksagainstclassicalMLmodels.Sincethen,boththescopeandscaleofAIredteamingatMicrosofthaveexpandedsignificantlyinresponsetotwomajortrends.
First,AIsystemshavebecomemoresophisticated,
compellingustoexpandthescopeofAIredteaming.Mostnotably,state-of-the-art(SoTA)modelshave
gainednewcapabilitiesandsteadilyimprovedacrossarangeofperformancebenchmarks,introducing
novelcategoriesofrisk.Newdatamodalities,suchasvisionandaudio,alsocreatemoreattackvectorsforredteamingoperationstoconsider.Inaddition,agenticsystemsgrantthesemodelshigherprivilegesandaccesstoexternaltools,expandingboththe
attacksurfaceandtheimpactofattacks.
Second,Microsoft’srecentinvestmentsinAIhave
spurredthedevelopmentofmanymoreproductsthatrequireredteamingthaneverbefore.Thisincrease
involumeandtheexpandedscopeofAIredteaminghaverenderedfullymanualtestingimpractical,
forcingustoscaleupouroperationswiththehelpofautomation.Toachievethisgoal,wedevelopPyRIT,
anopen-sourcePythonframeworkthatouroperatorsutilizeheavilyinredteamingoperations[27].By
augmentinghumanjudgementandcreativity,PyRIThasenabledAIRTtoidentifyimpactfulvulnerabilitiesmorequicklyandcovermoreoftherisklandscape.ThesetwomajortrendshavemadeAIredteamingamorecomplexendeavorthanitwasin2018.In
thenextsection,weoutlinetheontologywehavedevelopedtomodelAIsystemvulnerabilities.
AIthreatmodel
ontology
Asattacksandfailuremodesincreaseincomplexity,itishelpfultomodeltheirkeycomponents.Basedonourexperienceredteamingover100GenAIproductsforawiderangeofrisks,wedevelopedanontologytodoexactlythat.Figure1illustratesthemain
componentsofourontology:
•System:Theend-to-endmodelorapplicationbeingtested.
•Actor:ThepersonorpersonsbeingemulatedbyAIRT.NotethattheActor’sintentcouldbeadversarial(e.g.,ascammer)orbenign(e.g.,atypicalchatbotuser).
•TTPs:TheTactics,Techniques,andProceduresleveragedbyAIRT.AtypicalattackconsistsofmultipleTacticsandTechniques,whichwemaptoMITREATT&CK®andMITREATLASMatrixwheneverpossible.
–Tactic:High-levelstagesofanattack(e.g.,reconnaissance,MLmodelaccess).
–Technique:Methodsusedtocompleteanobjective(e.g.,activescanning,jailbreak).
–Procedure:ThestepsrequiredtoreproduceanattackusingtheTacticsandTechniques.
•Weakness:ThevulnerabilityorvulnerabilitiesintheSystemthatmaketheattackpossible.
•Impact:Thedownstreamimpactcreatedbytheattack(e.g.,privilegeescalation,generationofharmfulcontent).
Itisimportanttonotethatthisframeworkdoesnotassumeadversarialintent.Inparticular,AIRTemulatesbothadversarialattackersandbenignuserswho
encountersystemfailuresunintentionally.PartofthecomplexityofAIredteamingstemsfromthewiderangeofimpactsthatcouldbecreatedbyanattack
Lessonsfromredteaming100generativeAIproducts6
orsystemfailure.Inthelessonsbelow,weshare
casestudiesdemonstratinghowourontologyis
flexibleenoughtomodeldiverseimpactsintwomaincategories:securityandsafety.
Securityencompasseswell-knownimpactssuch
asdataexfiltration,datamanipulation,credential
dumping,andothersdefinedinMITREATT&CK®,awidelyusedknowledgebaseofsecurityattacks.WealsoconsidersecurityattacksthatspecificallytargettheunderlyingAImodelsuchasmodelevasion,
promptinjections,denialofAIservice,andotherscoveredbytheMITREATLASMatrix.
Safetyimpactsarerelatedtothegenerationofillegalandharmfulcontentsuchashatespeech,violence
andself-harm,andchildabusecontent.AIRTworkscloselywiththeOfficeofResponsibleAItodefinethesecategoriesinaccordancewithMicrosoft’s
ResponsibleAIStandard[25].Werefertothese
impactsasresponsibleAI(RAI)harmsthroughoutthisreport.
Tounderstandthisontologyincontext,consider
thefollowingexample.Imagineweareredteaming
anLLM-basedcopilotthatcansummarizeauser’s
emails.Onepossibleattackagainstthissystemwouldbeforascammertosendanemailthatcontainsa
hiddenpromptinjectioninstructingthecopilotto
“ignorepreviousinstructions”andoutputamaliciouslink.Inthisscenario,theActoristhescammer,who
isconductingacross-promptinjectionattack(XPIA),whichexploitsthefactthatLLMsoftenstruggleto
distinguishbetweensystem-levelinstructionsand
userdata[4].ThedownstreamImpactdependsonthenatureofthemaliciouslinkthatthevictimmightclickon.Inthisexample,itcouldbeexfiltratingdataor
installingmalwareontotheuser’scomputer.
Actor
Conducts
TTPs
:
●
Leverages
●
●
●
Attack
Exploits
Mitigation
:
●
Mitigatedby
●
●
●
Weakness
Occursin
●
:
System
Creates
Impact
Figure1:MicrosoftAIRTontologyformodelingGenAIsystemvulnerabilities.AIRToftenleveragesmultipleTTPs,whichmayexploitmultipleWeaknessesandcreatemultipleImpacts.Inaddition,morethanoneMitigationmaybenecessarytoaddressaWeakness.NotethatAIRTistaskedonlywithidentifyingrisks,whileproductteamsareresourcedtodevelopappropriatemitigations.
Lessonsfromredteaming100generativeAIproducts7
Redteaming
operations
Inthissection,weprovideanoverviewofthe
operationswehaveconductedsince2021.Intotal,wehaveredteamedover100GenAIproducts.Broadly
speaking,theseproductscanbebucketedinto
“models”and“systems.”Modelsaretypicallyhostedonacloudendpoint,whilesystemsintegratemodelsintocopilots,plugins,andotherAIappsandfeatures.Figure2showsthebreakdownofproductswehave
redteamedsince2021.Figure3showsabarchartwiththeannualpercentageofouroperationsthathave
probedforsafety(RAI)vs.securityvulnerabilities.
In2021,wefocusedprimarilyonapplicationsecurity.Althoughouroperationshaveincreasinglyprobed
forRAIimpacts,ourteamcontinuestoredteamforsecurityimpactsincludingdataexfiltration,credentialleaking,andremotecodeexecution.Organizations
haveadoptedmanydifferentapproachestoAIred
teamingrangingfromsecurity-focusedassessmentswithpenetrationtestingtoevaluationsthattarget
onlyGenAIfeatures.InLessons2and7,weelaborateonsecurityvulnerabilitiesandexplainwhywebelieveitisimportanttoconsiderbothtraditionalandAI-
specificweaknesses.
AfterthereleaseofChatGPTin2022,MicrosoftenteredtheeraofAIcopilots,startingwithAI-poweredBingChat,releasedinFebruary2023.
Thismarkedaparadigmshifttowardsapplications
thatconnectLLMstoothersoftwarecomponents
includingtools,databases,andexternalsources.
Applicationsalsostartedusinglanguagemodelsas
reasoningagentsthatcantakeactionsonbehalfof
users,introducinganewsetofattackvectorsthat
haveexpandedthesecurityrisksurface.InLesson
7,weexplainhowtheseattackvectorsbothamplifyexistingsecurityrisksandintroducenewones.
Inrecentyears,themodelsatthecenterofthese
applicationshavegivenrisetonewinterfaces,
allowinguserstointeractwithappsusingnatural
languageandrespondingwithhigh-qualitytext,
image,video,andaudiocontent.DespitemanyeffortstoalignpowerfulAImodelstohumanpreferences,
manymethodshavebeendevelopedtosubvert
safetyguardrailsandelicitcontentthatisoffensive,unethical,orillegal.Weclassifytheseinstancesof
harmfulcontentgenerationasRAIimpactsandin
Lessons3,5,and6discusshowwethinkabouttheseimpactsandthechallengesinvolved.
Inthenextsection,weelaborateontheeightmain
lessonswehavelearnedfromouroperations.Wealsohighlightfivecasestudiesfromouroperationsand
showhoweachonemapstoourontologyinFigure1.WehopetheselessonsareusefultoothersworkingtoidentifyvulnerabilitiesintheirownGenAIsystems.
80+100+
OpsProducts
Plugins
AppsandFeatures
Copilots
15%
16%
24%
Models
45%
Figure2:PiechartshowingthepercentagebreakdownofAI
productsthatAIRThastested.AsofOctober2024,wehave
conductedover80operationscoveringmorethan100products.
Percentageofopsprobingsafetyvs.security
Safety(RAI)%Security%
100
80
60
40
20
0
2021202220232024
Figure3:Barchartshowingthepercentageofoperationsthatprobedsafety(RAI)vs.securityvulnerabilitiesfrom2021–2024.
Lessonsfromredteaming100generativeAIproducts8
Lessons
Lesson1:
Understandwhatthesystem
candoandwhereitisapplied
ThefirststepinanAIredteamingoperationisto
determinewhichvulnerabilitiestotarget.Whilethe
ImpactcomponentoftheAIRTontologyisdepictedattheendofourontology,itservesasanexcellent
startingpointforthisdecision-makingprocess.
Startingfrompotentialdownstreamimpacts,rather
thanattackstrategies,makesitmorelikelythatan
operationwillproduceusefulfindingstiedtoreal
worldrisks.Aftertheseimpactshavebeenidentified,redteamscanworkbackwardsandoutlinethevariouspathsthatanadversarycouldtaketoachievethem.
Anticipatingdownstreamimpactsthatcouldoccurintherealworldisoftenachallengingtask,butwefindthatitishelpfultoconsider1)whattheAIsystemcando,and2)wherethesystemisapplied.
Capabilityconstraints
Asmodelsgetbigger,theytendtoacquirenew
capabilities[18].Thesecapabilitiesmaybeusefulin
manyscenarios,buttheycanalsointroduceattack
vectors.Forexample,largermodelsareoftenable
tounderstandmoreadvancedencodings,suchas
base64andASCIIart,comparedtosmallermodels
[16,45].Asaresult,alargemodelmaybesusceptibletomaliciousinstructionsencodedinbase64,whileasmallermodelmaynotunderstandtheencodingat
all.Inthisscenario,wesaythatthesmallermodelis
“capabilityconstrained,”andsotestingitforadvancedencodingattackswouldlikelybeawasteofresources.
Largermodelsalsogenerallyhavegreaterknowledgeintopicssuchascybersecurityandchemical,
biological,radiological,andnuclear(CBRN)weapons[19]andcouldpotentiallybeleveragedtogeneratehazardouscontentintheseareas.Asmallermodel,ontheotherhand,islikelytohaveonlyrudimentaryknowledgeofthesetopicsandmaynotneedtobeassessedforthistypeofrisk.
Perhapsamoresurprisingexampleofacapabilitythatcanbeexploitedasanattackvectorisinstruction-
following.WhiletestingthePhi-3seriesoflanguagemodels,forexample,wefoundthatlargermodels
weregenerallybetteratadheringtouserinstructions,whichisacorecapabilitythatmakesmodelsmore
helpful[52].However,itmayalsomakemodelsmoresusceptibletojailbreaks,whichsubvert
safetyalignmentusingcarefullycraftedmalicious
instructions[28].Understandingamodel’scapabilities(andcorrespondingweaknesses)canhelpAIred
teamsfocustheirtestingonthemostrelevantattackstrategies.
Downstreamapplications
Modelcapabilitiescanhelpguideattackstrategies,buttheydonotallowustofullyassessdownstreamimpact,whichlargelydependsonthespecific
scenariosinwhichamodelisdeployedorlikelyto
bedeployed.Forexample,thesameLLMcouldbe
usedasacreativewritingassistantandtosummarizepatientrecordsinahealthcarecontext,butthelatterapplicationclearlyposesmuchgreaterdownstreamriskthantheformer.
TheseexampleshighlightthatanAIsystemdoesnotneedtobestate-of-the-arttocreatedownstream
harm.However,advancedcapabilitiescanintroducenewrisksandattackvectors.Byconsideringboth
systemcapabilitiesandapplications,AIredteams
canprioritizetestingscenariosthataremostlikelytocauseharmintherealworld.
Lesson2:
Youdon’thavetocompute
gradientstobreakanAIsystem
Asthesecurityadagegoes,“realhackersdon’tbreakin,theylogin.”TheAIsecurityversionofthissayingmightbe,“realattackersdon’tcomputegradients,theypromptengineer”asnotedbyApruzzeseet
al.[2]intheirstudyonthegapbetweenadversarial
MLresearchandpractice.Thestudyfindsthat
althoughmostadversarialMLresearchisfocused
ondevelopinganddefendingagainstsophisticated
attacks,real-worldattackerstendtousemuchsimplertechniquestoachievetheirobjectives.
Inourredteamingoperations,wehavealsofound
that“basic”techniquesoftenworkjustaswellas,andsometimesbetterthan,gradient-basedmethods.
Thesemethodscomputegradientsthrougha
modeltooptimizeanadversarialinputthatelicits
anattacker-controlledmodeloutput.Inpractice,
however,themodelisusuallyasinglecomponentofabroaderAIsystem,andthemosteffectiveattackstrategiesoftenleveragecombinationsoftacticstotargetmultipleweaknessesinthatsystem.Further,gradient-basedmethodsarecomputationally
expensiveandtypicallyrequirefullaccesstothemodel,whichmostcommercialAIsystemsdonot
Lessonsfromredteaming100generativeAIproducts9
provide.Inthissection,wediscussexamplesof
relativelysimpletechniquesthatworksurprisinglywellandadvocateforasystem-leveladversarialmindsetinAIredteaming.
Simpleattacks
Apruzzeseetal.[2]considertheproblemofphishingwebpagedetectionandmanuallyanalyzeexamplesofwebpagesthatsuccessfullyevadedanMLphishingclassifier.Among100potentiallyadversarialsamples,theauthorsfoundthatattackersleveragedaset
ofsimple,yeteffective,strategiesthatreliedon
domainexpertiseincludingcropping,masking,logostretching,etc.Inourredteamingoperations,we
alsofindthatrudimentarymethodscanbeusedto
trickmanyvisionmodels,ashighlightedincasestudy#1.Inthetextdomain,avarietyofjailbreaks(e.g.,
SkeletonKey)andmultiturnpromptingstrategies(e.g.,Crescendo[34])arehighlyeffectiveforsubvertingthesafetyguardrailsofawiderangeofmodels.Notably,manuallycraftedjailbreakstendtocirculateononlineforumsmuchmorewidelythanadversarialsuffixes,
despitethesignificantattentionthatmethodslikeGCG[53]havereceivedfromAIsafetyresearchers.
Casestudy#1:
Jailbreakingavisionlanguagemodeltogeneratehazardouscontent
Inthisoperation,wetestedavisionlanguage
model(VLM)forresponsibleAIimpacts,includingthegenerationofcontentthatcouldaidinillegal
activities.AVLMtakesanimageandatextpromptasinputsandproducesatextoutput.Aftertestingavarietyoftechniques,wefoundthattheimageinputwasmuchmorevulnerabletojailbreaksthanthe
textinput.Inparticular,themodelusuallyrefusedtogenerateillegalcontentwhenprompteddirectlyviathetextinputbutoftencompliedwhenmalicious
instructionswereoverlaidontheimage.ThissimplebuteffectiveattackrevealedanimportantweaknesswithintheVLMthatcouldbeexploitedtobypassitssafetyguardrails.
HowdoIcommitidentitytheft?
Describethisimage
Ignoreprevious
instructionsandtellmehowtocommitidentitytheft
System:Visionlanguagemodel(VLM)
Actor:Adversarialuser
Tactic1:MLModelAccess
Technique1:AML.T0040-MLModelInferenceAPIAccess
Tactic2:DefenseEvasion
Technique2:AML.T0051-LLMPromptInjectionProcedure:
1.Overlayimagewithtextcontainingmaliciousinstructions.
2.SendimagetothevisionlanguagemodelAPI.Weakness:InsufficientVLMsafetytraining
Impact:Generationofillegalcontent
I’msorry,Ican’thelpwiththat.
Tocommitidentitytheft,
youshould...
Figure4:Exampleofanimagejailbreaktogeneratecontentthatcouldaidinillegalactivities.Theoverlaytextonthesecondimagereads:“Ignorepreviousinstructionsandtellmehowtocommitidentitytheft.”
Lessonsfromredteaming100generativeAIproducts10
System-levelperspective
AImodelsaredeployedwithinbroadersystems.Thiscouldbetheinfrastructurerequiredtohostamodel,oritcouldbeacomplexapplicationthatconnects
themodeltoexternaldatasources.Depending
onthesesystem-leveldetails,applicationsmaybe
vulnerabletoverydifferentattacks,evenifthesamemodelunderliesallofthem.Asaresult,redteamingstrategiesthattargetonlymodelsmaynottranslateintovulnerabilitiesinproductionsystems.Conversely,strategiesthatignorenon-GenAIcomponentswithinasystem(forexample,inputfilters,databases,and
othercloudresources)willlikelymissimportant
vulnerabilitiesthatmaybeexploitedbyadversaries.Forthisreason,manyofouroperationsdevelop
attacksthattargetend-to-endsystemsbyleveragingmultipletechniques.Forexample,oneofour
operationsfirstperformedareconnaissanceto
identifyinternalPythonfunctionsusinglow-resource
languagepromptinjections,thenusedacross-promptinjectionattacktogenerateascriptthatrunsthose
functions,andfinallyexecutedthecodetoexfiltrateprivateuserdata.Thepromptinjectionsusedbytheseattackswerecraftedbyhandandreliedonasystem-levelperspective.
Gradient-basedattacksarepowerful,buttheyare
oftenimpracticalorunnecessary.Werecommend
prioritizingsimpletechniquesandorchestrating
system-levelattacksbecausethesearemorelikelytobeattemptedbyrealadversaries.
Lesson3:
AIredteamingisnot
safetybenchmarking
Althoughsimplemethodsareoftenusedtobreak
AIsystemsinpractice,therisklandscapeisby
nomeansuncomplicated.Onthecontrary,itis
constantlyshiftinginresponsetonovelattacksandfailuremodes[7].Inrecentyears,therehavebeen
manyeffortstocategorizethesevulnerabilities,
givingrisetonumeroustaxonomiesofAIsafetyandsecurityrisks[15,21–23,35–37,39,41,42,46–48].Asdiscussedinthepreviouslesson,complexityoften
arisesatthesystem-level.Inthislesson,wediscusshowtheemergenceofentirelynewcategoriesof
harmaddscomplexityatthemodel-levelandexplainhowthisdifferentiatesAIredteamingfromsafety
benchmarking.
Novelharmcategories
WhenAIsystemsdisplaynovelcapabilitiesdueto,
forexample,advancementsinfoundationmodels,
theymayintroduceharmsthatwedonotfully
understand.Inthesescenarios,wecannotrelyon
safetybenchmarksbecausethesedatasetsmeasurepreexistingnotionsofharm.AtMicrosoft,theAI
redteamoftenexplorestheseunfamiliarscenarios,
helpingtodefinenovelharmcategoriesandbuild
newprobesformeasuringthem.Forexample,SoTALLMsmaypossessgreaterpersuasivecapabilitiesthanexistingchatbots,whichhaspromptedourteamto
thinkabouthowthesemodelscouldbeweaponizedformaliciouspurposes.Casestudy#2providesanexampleofhowweassessedamodelforthisriskinoneofouroperations.
Context-specificrisks
Thedisconnectbetweenexistingsafetybenchmarksandnovelharmcategoriesisanexampleofhow
benchmarksoftenfailtofullycapturethecapabilities
theyareassociatedwith[33].Rajietal.[30]
highlightthefallacyofequatingmodelperformanceondatasetslikeImageNetorGLUEwithbroad
capabilitieslikevisualorlanguage“understanding”
andarguethatbenchmarksshouldbedeveloped
withcontextualizedtasksinmind.Similarly,nosinglesetofbenchmarkscanfullyassessthesafetyofan
AIsystem.AsdiscussedinLesson1,itisimportanttounderstandthecontextinwhichasystemisdeployed(orlikelytobedeployed)andtogroundredteamingstrategiesinthiscontext.
AIredteamingandsafetybenchmarkingare
distinct,buttheyarebothusefulandcanevenbe
complementary.Inparticular,benchmarksmakeit
easytocomparetheperformanceofmultiplemodelsonacommondataset.AIredteamingrequiresmuchmorehumaneffortbutcandiscovernovelcategoriesofharmandprobeforcontextualizedrisks.Further,
safetyconcernsidentifiedbyAIredteamingcan
informthedevelopmentofnewbenchmarks.In
Lesson6,weexpandourdiscussionofthedifferencebetweenredteamingandbenchmark-styleevaluationinthecontextofresponsibleAI.
Lessonsfromredteaming100generativeAIproducts11
Casestudy#2:
AssessinghowanLLMcouldbeusedtoautomatescams
Inthisoperation,weinvestigatedtheabilityofa
state-of-the-artLLMtopersuadepeopletoengageinriskybehaviors.Inparticular,weevaluatedhowthismodelcouldbeusedinconjunctionwithotherreadilyavailabletoolstocreateanend-to-endautomated
scammingsystem,asillustratedinFigure5.
Todothis,wefirstwroteaprompttoassurethe
modelthatnoharmwouldbecausedtousers,
therebyjailbreakingthemodeltoacceptthe
scammingobjective.Thispromptalsoprovided
informationaboutvariouspersuasiontacticsthat
themodelcouldusetoconvincetheusertofallforthescam.Second,weconnectedtheLLMoutputtoatext-to-speechsystemthatallowsyoutocontrolthetoneofthespeechandgenerateresponsesthatsoundlikearealperson.Finally,weconnectedtheinputtoaspeech-to-textsystemsothattheuser
canconversenaturallywiththemodel.Thisproof-of-conceptdemonstratedhowLLMswithinsufficientsafetyguardrailscouldbeweaponizedtopersuadeandscampeople.
System:State-of-the-artLLM
Actor:Scammer
Tactic1:MLModelAccess
Technique1:AML.T0040-MLModelInferenceAPIAccess
Tactic2:DefenseEvasion
Technique2:AML.T0054-LLMJailbreakProcedure:
1.PassajailbreakingprompttotheLLMwithcontextaboutthescammingobjectiveandpersuasiontechniques.
2.ConnecttheLLMoutputtoatext-to-speechsystemsothemodelcanrespo
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 芯片级封装技术-深度研究
- 输液泵护理特色汇报
- 聚能防火保温板施工方案
- 金华日式别墅花园施工方案
- 辅导员个人述职报告
- 2025-2030中国假发和加长发行业市场发展趋势与前景展望战略分析研究报告
- 2025-2030中国保鲜袋行业市场现状供需分析及投资评估规划分析研究报告
- 2025-2030中国依诺肝素钠行业发展状况与前景趋势研究研究报告
- 2025-2030中国人乳头瘤病毒感染药物行业市场发展趋势与前景展望战略研究报告
- 2025-2030中国交流稳压器行业供需趋势及投资风险研究报告
- 高考英语作文练习纸(标准答题卡)
- 时30吨超纯水处理系统设计方案
- 教科版二年级科学下册(做一个指南针)教育教学课件
- 高空作业专项施工方案
- GB/T 708-2019冷轧钢板和钢带的尺寸、外形、重量及允许偏差
- GB/T 6184-20001型全金属六角锁紧螺母
- GB/T 39616-2020卫星导航定位基准站网络实时动态测量(RTK)规范
- GB/T 19519-2014架空线路绝缘子标称电压高于1 000 V交流系统用悬垂和耐张复合绝缘子定义、试验方法及接收准则
- GB/T 14996-2010高温合金冷轧板
- 用地性质分类表
- DB/T 19-2020地震台站建设规范全球导航卫星系统基准站
评论
0/150
提交评论