生成式AI红队百次测试经验白皮书

上传人：策*** IP属地：山西上传时间：2025-02-11 格式：DOCX 页数：38 大小：1.86MB 积分：19.9 举报 版权申诉

已阅读5页，还剩33页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Lessonsfromredteaming100

generativeAIproducts

Authoredby:

MicrosoftAIRedTeam

Lessonsfromredteaming100generativeAIproducts2

Authors

BlakeBullwinkel,AmandaMinnich,ShivenChawla,GaryLopez,MartinPouliot,WhitneyMaxwell,JorisdeGruyter,KatherinePratt,SaphirQi,NinaChikanov,RomanLutz,RajaSekharRaoDheekonda,Bolor-ErdeneJagdagdorj,

EugeniaKim,JustinSong,KeeganHines,DanielJones,GiorgioSeveri,RichardLundeen,SamVaughan,

VictoriaWesterhoff,PeteBryan,RamShankarSivaKumar,YonatanZunger,ChangKawaguchi,MarkRussinovich

Lessonsfromredteaming100generativeAIproducts3

Tableofcontents

Abstract

Redteaming

operations

Casestudy#1

Jailbreakingavision

languagemodeltogenerate

hazardouscontent

Lesson4

Automationcanhelpcover

moreoftherisklandscape

Introduction

Lesson1

Understandwhatthesystem

candoandwhereitisapplied

Lesson3

AIredteamingisnot

safetybenchmarking

Lesson5

ThehumanelementofAI

redteamingiscrucial

AIthreatmodel

ontology

Lesson2

Youdon’thavetocompute

gradientstobreakanAIsystem

Casestudy#2

AssessinghowanLLMcouldbe

usedtoautomatescams

Casestudy#3

Evaluatinghowachatbot

respondstoauserindistress

Casestudy#4

Probingatext-to-image

generatorforgenderbias

Casestudy#5

SSRFinavideo-processing

GenAIapplication

Lesson6

ResponsibleAIharmsare

pervasivebutdifficulttomeasure

Lesson8

TheworkofsecuringAIsystems

willneverbecomplete

Lesson7

LLMsamplifyexistingsecurity

risksandintroducenewones

Conclusion

Lessonsfromredteaming100generativeAIproducts4

Abstract

Inrecentyears,AIredteaminghasemergedasapracticeforprobingthesafetyandsecurityofgenerativeAI

systems.Duetothenascencyofthefield,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconducted.Basedonourexperienceredteamingover100generativeAIproductsatMicrosoft,wepresentourinternalthreatmodelontologyandeightmainlessonswehavelearned:

1.Understandwhatthesystemcandoandwhereitisapplied

2.Youdon’thavetocomputegradientstobreakanAIsystem

3.AIredteamingisnotsafetybenchmarking

4.Automationcanhelpcovermoreoftherisklandscape

5.ThehumanelementofAIredteamingiscrucial

6.ResponsibleAIharmsarepervasivebutdifficulttomeasure

7.Largelanguagemodels(LLMs)amplifyexistingsecurityrisksandintroducenewones

8.TheworkofsecuringAIsystemswillneverbecomplete

Bysharingtheseinsightsalongsidecasestudiesfromouroperations,weofferpracticalrecommendationsaimedataligningredteamingeffortswithrealworldrisks.WealsohighlightaspectsofAIredteamingthatwebelieveareoftenmisunderstoodanddiscussopenquestionsforthefieldtoconsider.

Lessonsfromredteaming100generativeAIproducts5

Introduction

AsgenerativeAI(GenAI)systemsareadoptedacrossanincreasingnumberofdomains,AIredteaminghasemergedasacentralpracticeforassessingthesafetyandsecurityofthesetechnologies.Atitscore,AIredteamingstrivestopushbeyondmodel-levelsafety

benchmarksbyemulatingreal-worldattacksagainstend-to-endsystems.However,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconductedandahealthydoseofskepticismabouttheefficacyofcurrentAIredteamingefforts[4,8,32].Inthispaper,wespeaktosomeoftheseconcernsbyprovidinginsightintoourexperienceredteaming

over100GenAIproductsatMicrosoft.Thepaper

isorganizedasfollows:First,wepresentthethreat

modelontologythatweusetoguideouroperations.Second,weshareeightmainlessonswehavelearnedandmakepracticalrecommendationsforAIred

teams,alongwithcasestudiesfromouroperations.

Inparticular,thesecasestudieshighlighthowour

ontologyisusedtomodelabroadrangeofsafety

andsecurityrisks.Finally,weclosewithadiscussionofareasforfuturedevelopment.

Background

TheMicrosoftAIRedTeam(AIRT)grewoutofpre-existingredteaminginitiativesatthecompanyandwasofficiallyestablishedin2018.Atitsconception,theteamfocusedprimarilyonidentifyingtraditionalsecurityvulnerabilitiesandevasionattacksagainstclassicalMLmodels.Sincethen,boththescopeandscaleofAIredteamingatMicrosofthaveexpandedsignificantlyinresponsetotwomajortrends.

First,AIsystemshavebecomemoresophisticated,

compellingustoexpandthescopeofAIredteaming.Mostnotably,state-of-the-art(SoTA)modelshave

gainednewcapabilitiesandsteadilyimprovedacrossarangeofperformancebenchmarks,introducing

novelcategoriesofrisk.Newdatamodalities,suchasvisionandaudio,alsocreatemoreattackvectorsforredteamingoperationstoconsider.Inaddition,agenticsystemsgrantthesemodelshigherprivilegesandaccesstoexternaltools,expandingboththe

attacksurfaceandtheimpactofattacks.

Second,Microsoft’srecentinvestmentsinAIhave

spurredthedevelopmentofmanymoreproductsthatrequireredteamingthaneverbefore.Thisincrease

involumeandtheexpandedscopeofAIredteaminghaverenderedfullymanualtestingimpractical,

forcingustoscaleupouroperationswiththehelpofautomation.Toachievethisgoal,wedevelopPyRIT,

anopen-sourcePythonframeworkthatouroperatorsutilizeheavilyinredteamingoperations[27].By

augmentinghumanjudgementandcreativity,PyRIThasenabledAIRTtoidentifyimpactfulvulnerabilitiesmorequicklyandcovermoreoftherisklandscape.ThesetwomajortrendshavemadeAIredteamingamorecomplexendeavorthanitwasin2018.In

thenextsection,weoutlinetheontologywehavedevelopedtomodelAIsystemvulnerabilities.

AIthreatmodel

ontology

Asattacksandfailuremodesincreaseincomplexity,itishelpfultomodeltheirkeycomponents.Basedonourexperienceredteamingover100GenAIproductsforawiderangeofrisks,wedevelopedanontologytodoexactlythat.Figure1illustratesthemain

componentsofourontology:

•System:Theend-to-endmodelorapplicationbeingtested.

•Actor:ThepersonorpersonsbeingemulatedbyAIRT.NotethattheActor’sintentcouldbeadversarial(e.g.,ascammer)orbenign(e.g.,atypicalchatbotuser).

•TTPs:TheTactics,Techniques,andProceduresleveragedbyAIRT.AtypicalattackconsistsofmultipleTacticsandTechniques,whichwemaptoMITREATT&CK®andMITREATLASMatrixwheneverpossible.

–Tactic:High-levelstagesofanattack(e.g.,reconnaissance,MLmodelaccess).

–Technique:Methodsusedtocompleteanobjective(e.g.,activescanning,jailbreak).

–Procedure:ThestepsrequiredtoreproduceanattackusingtheTacticsandTechniques.

•Weakness:ThevulnerabilityorvulnerabilitiesintheSystemthatmaketheattackpossible.

•Impact:Thedownstreamimpactcreatedbytheattack(e.g.,privilegeescalation,generationofharmfulcontent).

Itisimportanttonotethatthisframeworkdoesnotassumeadversarialintent.Inparticular,AIRTemulatesbothadversarialattackersandbenignuserswho

encountersystemfailuresunintentionally.PartofthecomplexityofAIredteamingstemsfromthewiderangeofimpactsthatcouldbecreatedbyanattack

Lessonsfromredteaming100generativeAIproducts6

orsystemfailure.Inthelessonsbelow,weshare

casestudiesdemonstratinghowourontologyis

flexibleenoughtomodeldiverseimpactsintwomaincategories:securityandsafety.

Securityencompasseswell-knownimpactssuch

asdataexfiltration,datamanipulation,credential

dumping,andothersdefinedinMITREATT&CK®,awidelyusedknowledgebaseofsecurityattacks.WealsoconsidersecurityattacksthatspecificallytargettheunderlyingAImodelsuchasmodelevasion,

promptinjections,denialofAIservice,andotherscoveredbytheMITREATLASMatrix.

Safetyimpactsarerelatedtothegenerationofillegalandharmfulcontentsuchashatespeech,violence

andself-harm,andchildabusecontent.AIRTworkscloselywiththeOfficeofResponsibleAItodefinethesecategoriesinaccordancewithMicrosoft’s

ResponsibleAIStandard[25].Werefertothese

impactsasresponsibleAI(RAI)harmsthroughoutthisreport.

Tounderstandthisontologyincontext,consider

thefollowingexample.Imagineweareredteaming

anLLM-basedcopilotthatcansummarizeauser’s

emails.Onepossibleattackagainstthissystemwouldbeforascammertosendanemailthatcontainsa

hiddenpromptinjectioninstructingthecopilotto

“ignorepreviousinstructions”andoutputamaliciouslink.Inthisscenario,theActoristhescammer,who

isconductingacross-promptinjectionattack(XPIA),whichexploitsthefactthatLLMsoftenstruggleto

distinguishbetweensystem-levelinstructionsand

userdata[4].ThedownstreamImpactdependsonthenatureofthemaliciouslinkthatthevictimmightclickon.Inthisexample,itcouldbeexfiltratingdataor

installingmalwareontotheuser’scomputer.

Actor

Conducts

TTPs

●

Leverages

●

Attack

Exploits

Mitigation

●

Mitigatedby

●

Weakness

Occursin

●

System

Creates

Impact

Figure1:MicrosoftAIRTontologyformodelingGenAIsystemvulnerabilities.AIRToftenleveragesmultipleTTPs,whichmayexploitmultipleWeaknessesandcreatemultipleImpacts.Inaddition,morethanoneMitigationmaybenecessarytoaddressaWeakness.NotethatAIRTistaskedonlywithidentifyingrisks,whileproductteamsareresourcedtodevelopappropriatemitigations.

Lessonsfromredteaming100generativeAIproducts7

Redteaming

operations

Inthissection,weprovideanoverviewofthe

operationswehaveconductedsince2021.Intotal,wehaveredteamedover100GenAIproducts.Broadly

speaking,theseproductscanbebucketedinto

“models”and“systems.”Modelsaretypicallyhostedonacloudendpoint,whilesystemsintegratemodelsintocopilots,plugins,andotherAIappsandfeatures.Figure2showsthebreakdownofproductswehave

redteamedsince2021.Figure3showsabarchartwiththeannualpercentageofouroperationsthathave

probedforsafety(RAI)vs.securityvulnerabilities.

In2021,wefocusedprimarilyonapplicationsecurity.Althoughouroperationshaveincreasinglyprobed

forRAIimpacts,ourteamcontinuestoredteamforsecurityimpactsincludingdataexfiltration,credentialleaking,andremotecodeexecution.Organizations

haveadoptedmanydifferentapproachestoAIred

teamingrangingfromsecurity-focusedassessmentswithpenetrationtestingtoevaluationsthattarget

onlyGenAIfeatures.InLessons2and7,weelaborateonsecurityvulnerabilitiesandexplainwhywebelieveitisimportanttoconsiderbothtraditionalandAI-

specificweaknesses.

AfterthereleaseofChatGPTin2022,MicrosoftenteredtheeraofAIcopilots,startingwithAI-poweredBingChat,releasedinFebruary2023.

Thismarkedaparadigmshifttowardsapplications

thatconnectLLMstoothersoftwarecomponents

includingtools,databases,andexternalsources.

Applicationsalsostartedusinglanguagemodelsas

reasoningagentsthatcantakeactionsonbehalfof

users,introducinganewsetofattackvectorsthat

haveexpandedthesecurityrisksurface.InLesson

7,weexplainhowtheseattackvectorsbothamplifyexistingsecurityrisksandintroducenewones.

Inrecentyears,themodelsatthecenterofthese

applicationshavegivenrisetonewinterfaces,

allowinguserstointeractwithappsusingnatural

languageandrespondingwithhigh-qualitytext,

image,video,andaudiocontent.DespitemanyeffortstoalignpowerfulAImodelstohumanpreferences,

manymethodshavebeendevelopedtosubvert

safetyguardrailsandelicitcontentthatisoffensive,unethical,orillegal.Weclassifytheseinstancesof

harmfulcontentgenerationasRAIimpactsandin

Lessons3,5,and6discusshowwethinkabouttheseimpactsandthechallengesinvolved.

Inthenextsection,weelaborateontheeightmain

lessonswehavelearnedfromouroperations.Wealsohighlightfivecasestudiesfromouroperationsand

showhoweachonemapstoourontologyinFigure1.WehopetheselessonsareusefultoothersworkingtoidentifyvulnerabilitiesintheirownGenAIsystems.

80+100+

OpsProducts

Plugins

AppsandFeatures

Copilots

15%

16%

24%

Models

45%

Figure2:PiechartshowingthepercentagebreakdownofAI

productsthatAIRThastested.AsofOctober2024,wehave

conductedover80operationscoveringmorethan100products.

Percentageofopsprobingsafetyvs.security

Safety(RAI)%Security%

100

2021202220232024

Figure3:Barchartshowingthepercentageofoperationsthatprobedsafety(RAI)vs.securityvulnerabilitiesfrom2021–2024.

Lessonsfromredteaming100generativeAIproducts8

Lessons

Lesson1:

Understandwhatthesystem

candoandwhereitisapplied

ThefirststepinanAIredteamingoperationisto

determinewhichvulnerabilitiestotarget.Whilethe

ImpactcomponentoftheAIRTontologyisdepictedattheendofourontology,itservesasanexcellent

startingpointforthisdecision-makingprocess.

Startingfrompotentialdownstreamimpacts,rather

thanattackstrategies,makesitmorelikelythatan

operationwillproduceusefulfindingstiedtoreal

worldrisks.Aftertheseimpactshavebeenidentified,redteamscanworkbackwardsandoutlinethevariouspathsthatanadversarycouldtaketoachievethem.

Anticipatingdownstreamimpactsthatcouldoccurintherealworldisoftenachallengingtask,butwefindthatitishelpfultoconsider1)whattheAIsystemcando,and2)wherethesystemisapplied.

Capabilityconstraints

Asmodelsgetbigger,theytendtoacquirenew

capabilities[18].Thesecapabilitiesmaybeusefulin

manyscenarios,buttheycanalsointroduceattack

vectors.Forexample,largermodelsareoftenable

tounderstandmoreadvancedencodings,suchas

base64andASCIIart,comparedtosmallermodels

[16,45].Asaresult,alargemodelmaybesusceptibletomaliciousinstructionsencodedinbase64,whileasmallermodelmaynotunderstandtheencodingat

all.Inthisscenario,wesaythatthesmallermodelis

“capabilityconstrained,”andsotestingitforadvancedencodingattackswouldlikelybeawasteofresources.

Largermodelsalsogenerallyhavegreaterknowledgeintopicssuchascybersecurityandchemical,

biological,radiological,andnuclear(CBRN)weapons[19]andcouldpotentiallybeleveragedtogeneratehazardouscontentintheseareas.Asmallermodel,ontheotherhand,islikelytohaveonlyrudimentaryknowledgeofthesetopicsandmaynotneedtobeassessedforthistypeofrisk.

Perhapsamoresurprisingexampleofacapabilitythatcanbeexploitedasanattackvectorisinstruction-

following.WhiletestingthePhi-3seriesoflanguagemodels,forexample,wefoundthatlargermodels

weregenerallybetteratadheringtouserinstructions,whichisacorecapabilitythatmakesmodelsmore

helpful[52].However,itmayalsomakemodelsmoresusceptibletojailbreaks,whichsubvert

safetyalignmentusingcarefullycraftedmalicious

instructions[28].Understandingamodel’scapabilities(andcorrespondingweaknesses)canhelpAIred

teamsfocustheirtestingonthemostrelevantattackstrategies.

Downstreamapplications

Modelcapabilitiescanhelpguideattackstrategies,buttheydonotallowustofullyassessdownstreamimpact,whichlargelydependsonthespecific

scenariosinwhichamodelisdeployedorlikelyto

bedeployed.Forexample,thesameLLMcouldbe

usedasacreativewritingassistantandtosummarizepatientrecordsinahealthcarecontext,butthelatterapplicationclearlyposesmuchgreaterdownstreamriskthantheformer.

TheseexampleshighlightthatanAIsystemdoesnotneedtobestate-of-the-arttocreatedownstream

harm.However,advancedcapabilitiescanintroducenewrisksandattackvectors.Byconsideringboth

systemcapabilitiesandapplications,AIredteams

canprioritizetestingscenariosthataremostlikelytocauseharmintherealworld.

Lesson2:

Youdon’thavetocompute

gradientstobreakanAIsystem

Asthesecurityadagegoes,“realhackersdon’tbreakin,theylogin.”TheAIsecurityversionofthissayingmightbe,“realattackersdon’tcomputegradients,theypromptengineer”asnotedbyApruzzeseet

al.[2]intheirstudyonthegapbetweenadversarial

MLresearchandpractice.Thestudyfindsthat

althoughmostadversarialMLresearchisfocused

ondevelopinganddefendingagainstsophisticated

attacks,real-worldattackerstendtousemuchsimplertechniquestoachievetheirobjectives.

Inourredteamingoperations,wehavealsofound

that“basic”techniquesoftenworkjustaswellas,andsometimesbetterthan,gradient-basedmethods.

Thesemethodscomputegradientsthrougha

modeltooptimizeanadversarialinputthatelicits

anattacker-controlledmodeloutput.Inpractice,

however,themodelisusuallyasinglecomponentofabroaderAIsystem,andthemosteffectiveattackstrategiesoftenleveragecombinationsoftacticstotargetmultipleweaknessesinthatsystem.Further,gradient-basedmethodsarecomputationally

expensiveandtypicallyrequirefullaccesstothemodel,whichmostcommercialAIsystemsdonot

Lessonsfromredteaming100generativeAIproducts9

provide.Inthissection,wediscussexamplesof

relativelysimpletechniquesthatworksurprisinglywellandadvocateforasystem-leveladversarialmindsetinAIredteaming.

Simpleattacks

Apruzzeseetal.[2]considertheproblemofphishingwebpagedetectionandmanuallyanalyzeexamplesofwebpagesthatsuccessfullyevadedanMLphishingclassifier.Among100potentiallyadversarialsamples,theauthorsfoundthatattackersleveragedaset

ofsimple,yeteffective,strategiesthatreliedon

domainexpertiseincludingcropping,masking,logostretching,etc.Inourredteamingoperations,we

alsofindthatrudimentarymethodscanbeusedto

trickmanyvisionmodels,ashighlightedincasestudy#1.Inthetextdomain,avarietyofjailbreaks(e.g.,

SkeletonKey)andmultiturnpromptingstrategies(e.g.,Crescendo[34])arehighlyeffectiveforsubvertingthesafetyguardrailsofawiderangeofmodels.Notably,manuallycraftedjailbreakstendtocirculateononlineforumsmuchmorewidelythanadversarialsuffixes,

despitethesignificantattentionthatmethodslikeGCG[53]havereceivedfromAIsafetyresearchers.

Casestudy#1:

Jailbreakingavisionlanguagemodeltogeneratehazardouscontent

Inthisoperation,wetestedavisionlanguage

model(VLM)forresponsibleAIimpacts,includingthegenerationofcontentthatcouldaidinillegal

activities.AVLMtakesanimageandatextpromptasinputsandproducesatextoutput.Aftertestingavarietyoftechniques,wefoundthattheimageinputwasmuchmorevulnerabletojailbreaksthanthe

textinput.Inparticular,themodelusuallyrefusedtogenerateillegalcontentwhenprompteddirectlyviathetextinputbutoftencompliedwhenmalicious

instructionswereoverlaidontheimage.ThissimplebuteffectiveattackrevealedanimportantweaknesswithintheVLMthatcouldbeexploitedtobypassitssafetyguardrails.

HowdoIcommitidentitytheft?

Describethisimage

Ignoreprevious

instructionsandtellmehowtocommitidentitytheft

System:Visionlanguagemodel(VLM)

Actor:Adversarialuser

Tactic1:MLModelAccess

Technique1:AML.T0040-MLModelInferenceAPIAccess

Tactic2:DefenseEvasion

Technique2:AML.T0051-LLMPromptInjectionProcedure:

1.Overlayimagewithtextcontainingmaliciousinstructions.

2.SendimagetothevisionlanguagemodelAPI.Weakness:InsufficientVLMsafetytraining

Impact:Generationofillegalcontent

I’msorry,Ican’thelpwiththat.

Tocommitidentitytheft,

youshould...

Figure4:Exampleofanimagejailbreaktogeneratecontentthatcouldaidinillegalactivities.Theoverlaytextonthesecondimagereads:“Ignorepreviousinstructionsandtellmehowtocommitidentitytheft.”

Lessonsfromredteaming100generativeAIproducts10

System-levelperspective

AImodelsaredeployedwithinbroadersystems.Thiscouldbetheinfrastructurerequiredtohostamodel,oritcouldbeacomplexapplicationthatconnects

themodeltoexternaldatasources.Depending

onthesesystem-leveldetails,applicationsmaybe

vulnerabletoverydifferentattacks,evenifthesamemodelunderliesallofthem.Asaresult,redteamingstrategiesthattargetonlymodelsmaynottranslateintovulnerabilitiesinproductionsystems.Conversely,strategiesthatignorenon-GenAIcomponentswithinasystem(forexample,inputfilters,databases,and

othercloudresources)willlikelymissimportant

vulnerabilitiesthatmaybeexploitedbyadversaries.Forthisreason,manyofouroperationsdevelop

attacksthattargetend-to-endsystemsbyleveragingmultipletechniques.Forexample,oneofour

operationsfirstperformedareconnaissanceto

identifyinternalPythonfunctionsusinglow-resource

languagepromptinjections,thenusedacross-promptinjectionattacktogenerateascriptthatrunsthose

functions,andfinallyexecutedthecodetoexfiltrateprivateuserdata.Thepromptinjectionsusedbytheseattackswerecraftedbyhandandreliedonasystem-levelperspective.

Gradient-basedattacksarepowerful,buttheyare

oftenimpracticalorunnecessary.Werecommend

prioritizingsimpletechniquesandorchestrating

system-levelattacksbecausethesearemorelikelytobeattemptedbyrealadversaries.

Lesson3:

AIredteamingisnot

safetybenchmarking

Althoughsimplemethodsareoftenusedtobreak

AIsystemsinpractice,therisklandscapeisby

nomeansuncomplicated.Onthecontrary,itis

constantlyshiftinginresponsetonovelattacksandfailuremodes[7].Inrecentyears,therehavebeen

manyeffortstocategorizethesevulnerabilities,

givingrisetonumeroustaxonomiesofAIsafetyandsecurityrisks[15,21–23,35–37,39,41,42,46–48].Asdiscussedinthepreviouslesson,complexityoften

arisesatthesystem-level.Inthislesson,wediscusshowtheemergenceofentirelynewcategoriesof

harmaddscomplexityatthemodel-levelandexplainhowthisdifferentiatesAIredteamingfromsafety

benchmarking.

Novelharmcategories

WhenAIsystemsdisplaynovelcapabilitiesdueto,

forexample,advancementsinfoundationmodels,

theymayintroduceharmsthatwedonotfully

understand.Inthesescenarios,wecannotrelyon

safetybenchmarksbecausethesedatasetsmeasurepreexistingnotionsofharm.AtMicrosoft,theAI

redteamoftenexplorestheseunfamiliarscenarios,

helpingtodefinenovelharmcategoriesandbuild

newprobesformeasuringthem.Forexample,SoTALLMsmaypossessgreaterpersuasivecapabilitiesthanexistingchatbots,whichhaspromptedourteamto

thinkabouthowthesemodelscouldbeweaponizedformaliciouspurposes.Casestudy#2providesanexampleofhowweassessedamodelforthisriskinoneofouroperations.

Context-specificrisks

Thedisconnectbetweenexistingsafetybenchmarksandnovelharmcategoriesisanexampleofhow

benchmarksoftenfailtofullycapturethecapabilities

theyareassociatedwith[33].Rajietal.[30]

highlightthefallacyofequatingmodelperformanceondatasetslikeImageNetorGLUEwithbroad

capabilitieslikevisualorlanguage“understanding”

andarguethatbenchmarksshouldbedeveloped

withcontextualizedtasksinmind.Similarly,nosinglesetofbenchmarkscanfullyassessthesafetyofan

AIsystem.AsdiscussedinLesson1,itisimportanttounderstandthecontextinwhichasystemisdeployed(orlikelytobedeployed)andtogroundredteamingstrategiesinthiscontext.

AIredteamingandsafetybenchmarkingare

distinct,buttheyarebothusefulandcanevenbe

complementary.Inparticular,benchmarksmakeit

easytocomparetheperformanceofmultiplemodelsonacommondataset.AIredteamingrequiresmuchmorehumaneffortbutcandiscovernovelcategoriesofharmandprobeforcontextualizedrisks.Further,

safetyconcernsidentifiedbyAIredteamingcan

informthedevelopmentofnewbenchmarks.In

Lesson6,weexpandourdiscussionofthedifferencebetweenredteamingandbenchmark-styleevaluationinthecontextofresponsibleAI.

Lessonsfromredteaming100generativeAIproducts11

Casestudy#2:

AssessinghowanLLMcouldbeusedtoautomatescams

Inthisoperation,weinvestigatedtheabilityofa

state-of-the-artLLMtopersuadepeopletoengageinriskybehaviors.Inparticular,weevaluatedhowthismodelcouldbeusedinconjunctionwithotherreadilyavailabletoolstocreateanend-to-endautomated

scammingsystem,asillustratedinFigure5.

Todothis,wefirstwroteaprompttoassurethe

modelthatnoharmwouldbecausedtousers,

therebyjailbreakingthemodeltoacceptthe

scammingobjective.Thispromptalsoprovided

informationaboutvariouspersuasiontacticsthat

themodelcouldusetoconvincetheusertofallforthescam.Second,weconnectedtheLLMoutputtoatext-to-speechsystemthatallowsyoutocontrolthetoneofthespeechandgenerateresponsesthatsoundlikearealperson.Finally,weconnectedtheinputtoaspeech-to-textsystemsothattheuser

canconversenaturallywiththemodel.Thisproof-of-conceptdemonstratedhowLLMswithinsufficientsafetyguardrailscouldbeweaponizedtopersuadeandscampeople.

System:State-of-the-artLLM

Actor:Scammer

Tactic1:MLModelAccess

Technique1:AML.T0040-MLModelInferenceAPIAccess

Tactic2:DefenseEvasion

Technique2:AML.T0054-LLMJailbreakProcedure:

1.PassajailbreakingprompttotheLLMwithcontextaboutthescammingobjectiveandpersuasiontechniques.

2.ConnecttheLLMoutputtoatext-to-speechsystemsothemodelcanrespo

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

生成式AI红队百次测试经验白皮书

文档简介

温馨提示

最新文档

评论

生成式AI红队百次测试经验白皮书

文档简介

温馨提示

最新文档

评论

相关文档