基于 GPU 的检索增强生成(RAG) -NVIDIA - 英伟达 LLM day_第1页
基于 GPU 的检索增强生成(RAG) -NVIDIA - 英伟达 LLM day_第2页
基于 GPU 的检索增强生成(RAG) -NVIDIA - 英伟达 LLM day_第3页
基于 GPU 的检索增强生成(RAG) -NVIDIA - 英伟达 LLM day_第4页
基于 GPU 的检索增强生成(RAG) -NVIDIA - 英伟达 LLM day_第5页
已阅读5页,还剩96页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

11NVIDIA资深解决方案工程师2023年01月09日DrivingtheFutureofEnterpriseWorkAIassistantswilldriveincreasedproductivityforeveryjobfunction.Intelligentchatbotsarethenextkillerenterpriseapplication.Humans'workwillchangefromhavingtodoalotofmanuallook-upsandgatheringofinformation,todirectingteamsofLLMsandpullingtogethertheresults.Enterpriseswillhave100-1000softheseAIassistantsintheircompanyacrosseveryjobfunction.ITspendisbeingincreasedtoadoptthesenewcopilotfeaturesbecausetheydriveincreaseproductivity,productdifferentiation,andimproveexperience.Thesechatbotswillhaveintelligenceaswellasaccesstoproprietaryinformation2LLMsArePowerfulToolsbutNotAccurateEnoughforEnterpriseWithoutaconnectiontoenterprisedatasources,LLMscannotprovideaccurateinformationPromptResponsePromptUserFoundationModelLackingproprietaryknowledgeRiskofoutdatedinformationHallucinations HallucinationsAgenda.Retrievalaugmentedgenerationintroduction.KeytechniquesinRAG.SolutionsfromNVIDIA.AIcopilotdemo–RAGcopilot5WhatisRetrievalAugmentedGeneration(RAG)?RAGistoLLMswhatanopen-bookexamistohumans.PatrickLewisetal.Retrieval-AugmentedGenerationforKnowledge-IntensiveNLP.General-purposefine-tuningrecipe(1)Retrieve.combinepre-trainedparametricandnon-parametricmemoryfor(1)Retrievegeneration.Atechniqueforenhancingtheaccuracy(2)Augment(3)GenerateandreliabilityofgenerativeAImodelswithfacts(2)Augment(3)Generate.Thisapproachconstructsacomprehensivepromptenrichedwithcontext,historicaldata,andrecentorrelevantknowledge.•GenerativeAIKnowledgeBaseChatbot|NVIDIA•Retrieval-AugmentedGeneration(RAG):FromTheorytoLangChainImplementation•Lewis,P.,etal.(2020).Retrieval-augmentedgenerationforknowledge-intensiveNLPtasks.AdvancesinNeuralInformationProcessingSystems,33,9459–9474.NextGenerationofEnterpriseApplicationsConnectLLMstoEnterpriseDataRetrievalAugmentedGenerationImprovesLLMPerformanceandEfficiencyImprovedImprovedAccuracyNaturalNaturalLanguageInterfaceContextualContextualUnderstandingReducedReducedComputationalCosts$ImprovedImprovedEfficiencyModelscananswerHuman-readableAImodelsbetterReducedcomputationalModelscanproducequestionsaboutoutputtextsthatareunderstandcontextcostsfromdiverseoutputsinformationwithouteasierforpeopletowhengeneratingtextretrainingandmodelwithoutsacrificinghavingbeentrainedonthatdataunderstand,raisingusertrustorotheroutputssizeatinferenceaccuracyorefficiencyKeyTechniquesinRetrievalAugmentedGeneration(RAG)RAGistoLLMswhatanopen-bookexamistohumans(1)(1)Retrieve(3)Generate(2)Augment.Non-parametricmemory(knowledgesource):.DocumentsLoader.EmbeddingModel.VectorDatabase.DatabaseSearch.Pre-trainedparametric(LLM):.FoundationLLM.LLMDeploymentKeyTechniquesinRetrievalAugmentedGeneration(RAG)RAGistoLLMswhatanopen-bookexamistohumans.Non-parametricmemory(knowledgesource):(1)Retrieve.Documents(1)Retrieve.VectorDatabase.EmbeddingModel.DatabaseSearch(2)Augment(3)Generate.Pre-trainedparametric(LLM):.(2)Augment(3)Generate.LLMDeploymentKeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchIngestion:Encodingtheknowledgebase(offline)DocumentDocumentchunksDocumentsembeddingsVectorDatabaseKnowledgeVectorDatabase(KB)ChunkingEmbedding(KB)Retrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryQueryembeddingQueryUserEmbeddingModelVectorDatabaseTopKRelevantChunksBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel10KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchIngestion:Encodingtheknowledgebase(offline)DocumentDocumentchunksDocumentsembeddingsVectorDatabaseKnowledgeVectorDatabase(KB)ChunkingEmbedding(KB)Stepstopreparedatabase[2][3][4][5]:1.Loadthedocumentswithdifferenttypes:pdf,html,c++,python,…etc.2.Splitdocumentsintochunks3.Convertthetextchunksintovectorviaembeddingmodel4.Storedocumenttexts,vector,metadatatovectordatabase.BuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel11DocembeddingsDocumentchunksDocumentsVectorDatabaseChunkingEmbeddingModelKeyTechniquesinRetrievalAugmentedGeneration(DocembeddingsDocumentchunksDocumentsVectorDatabaseChunkingEmbeddingModelDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchIngestion:Encodingtheknowledgebase(offline)KnowledgeKnowledgebase(KB)Stepstopreparedatabase[2][3][4][5]:1.Loadthedocumentswithdifferenttypes:pdf,html,c++,python,…etc.、Tips:LangChainDocumentloaders、Tips:•Knowwhatyourtexttypes•Knowwhatyourtexttypesare•Cleanthedata••Cleanthedata•StructuredData•Semi-StructuredBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel12ThetrickhereistofindasizethatworksforyouKeyTechniquesinRetrievalAugmentedThetrickhereistofindasizethatworksforyouDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchIngestion:Encodingtheknowledgebase(offline)DocumentchunksDocumentchunksDocumentsembeddingsVectorDatabaseKnowledgeVectorDatabaseChunkingEmbeddingModel(ChunkingEmbeddingModelStepstopreparedatabase[2][3][4][5]:2.Splitdocumentsintochunks-LLMs:Limited"window"oftextinputlengths•Smallerpiecesoftexts•Fixedsizechunking•Variablesizechunking(somemarkerisusedtosplitthetext)•Overlapbetweenchunks13BuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel13Ingestion:Encodingtheknowledgebase(offline)DocumentsKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelDocumentchunksKeyTechniquesinRetrievalAugmentedGeneration(RAG)Ingestion:Encodingtheknowledgebase(offline)DocumentsKnowledgebase(KB)VectorDatabaseChunkingEmbeddingModelDocumentchunksDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchDocDocembeddings [3][4][5]:Stepstoprepare [3][4][5]:3.ConvertthetextchunksintovectorviaembeddingmodelGeneralencoderarchitecturemodelConverttext,image,etctoMulti-dimensionalvectors.Modeltrainedtoembedsimilarinputsclosetogether.BuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel14KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchIngestion:Encodingtheknowledgebase(offline)DocumentchunksDocumentchunksDocumentsembeddingsVectorDatabaseKnowledgeVectorDatabaseChunkingEmbeddingModel(ChunkingEmbeddingModelStepstoprepareStepstopreparedatabase[2]4.Storedocumenttexts,vector,metadatatovectordatabase.ChromaFAISSMilvusChromaFAISSMilvusRedisWeaviate[6]OptimizingRAG:AGuidetoChoosingtheRightVectorDatabaseBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel15QueryembeddingQueryUserKeyTechniquesinRetrievalAugmentedGeneration(RAG)QueryembeddingQueryUserDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:1.Converttheinputquerytovectorviathesameembeddingmodel2.SimilaritysearchinthevectordatabaseRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryEmbeddingModelVectorDatabaseTopKRelevantChunks16BuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel16UserEmbeddingModelVectorDatabaseTopKRelevantChunksKeyTechniquesinUserEmbeddingModelVectorDatabaseTopKRelevantChunksDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:1.Converttheinputquerytovectorviathesameembeddingmodel2.Similaritysearchinthevectordatabase-Precision-RecallRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel17EmbeddingsandtheVectorDatabaseSearchingviasemanticsimilarityScientificcomputingHigh-performancecomputingSpecializedmedicaltopicsSpeechAI2Drepresentationofa768-dimensionembeddingspace.Embeddingsaredata(text,image,orotherdata)representedasnumericalvectors.Inputtexttoembeddingmodeltooutputvector.Partofsemanticsearch.Modeltrainedtoembedsimilarinputsclosetogether.Otherusecases:classification,clustering,topicdiscovery.Manypretrainedandtrainableembeddingmodelsources.ModernonesareoftendeepneuralnetworksQueryQuery:Whowillleadtheconstructionteam?Chunk1:Theconstructionteamfoundleadinthepaint.Chunk2:Ozzyhasbeenpickedtoleadthegroup.Chunk1sharesmorekeywordswiththequery,butsemanticsearchcandifferentiatethemeaningsof"lead"andunderstandthat"team"and"group"aresimilar,soChunk2maybemorehelpfulforthequery.18UserEmbeddingModelVectorDatabaseTopKRelevantChunksKeyTechniquesinRetrievalUserEmbeddingModelVectorDatabaseTopKRelevantChunksDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase••Usequeryrouting:•LlamaIndex-RetrieverRouterQueryEngine•LangChain-MultiIndex;LangChain-Router•Querytransformations[8][9]:Rephrasing;HyDE[10],Sub-queries•Sentence-windowretrieval[11]•Auto-mergeretrieval[11]•Differentindextypes:hybridwithkey-wordandembedding•Re-ranker[7]•Meta-datafiltering•PromptCompressionRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel19KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase•Usequeryrouting:•LlamaIndex-RouterQueryEngine•LangChain-MultiIndex;LangChain-RouterDirectinguserqueriestoappropriateIndex.RetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingUserEmbeddingModelTopKRelevantChunksVectorDatabaseQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel20KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase•Querytransformations[8][9]:•Rephrasing;•HyDE[10]•Sub-queries/QuestionDecompositionChangetheinputquerytoimproveretrievedcontextsRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingUserEmbeddingModelTopKRelevantChunksVectorDatabaseQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel21UserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’sUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase•Sentence-windowretrieval[11]•Auto-mergeretrieval[11]Smallsizeofchunk,•Retrievalthewindowofappropriatesentencesbeforeandaftertheretrievedone•Organizedinatree-likestructure,mergesmallerchunksintolargercontextforLLMBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel22UserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryUserEmbeddingModelVectorDatabaseTopKRelevantChunksRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase•Differentindextypes:hybridwithkey-wordandembeddingHybridwithKey-wordandembeddings:•Key-wordbasedindexforqueriesrelatingtoaspecificproduct•EmbeddingsforgeneralcustomersupportBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel23UserEmbeddingModelVectorDatabaseTopKRelevantChunksKeyTechniquesinRetrievalUserEmbeddingModelVectorDatabaseTopKRelevantChunksDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase•Re-ranker[7]Usingre-rankermodeltore-rankretrieveddocuments.SolvetheissueofdiscrepancybetweensimilarityandrelevanceRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel24UserEmbeddingModelVectorDatabaseTopKRelevantChunksKeyTechniquesinRetrievalUserEmbeddingModelVectorDatabaseTopKRelevantChunksDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase•Meta-datafilteringAddmeta-datatoyourchunksUsemeta-datafilteringtohelpprocessresultsRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel25UserEmbeddingModelVectorDatabaseTopKRelevantChunksKeyTechniquesinRetrievalUserEmbeddingModelVectorDatabaseTopKRelevantChunksDocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchStepswhensearchingdatabase[5]:2.Similaritysearchinthevectordatabase•PromptCompression[17]CompressingirrelevantcontexthighlightingpivotalparagraphsreducingtheoverallcontextlengthRetrievalRetrieval:Retrievalfromvectordatabasebasedontheuser’squeryQueryembeddingQueryBuildEnterpriseRetrieval-AugmentedGenerationAppswithNVIDIARetrievalQAEmbeddingModel26KeyTechniquesinRetrievalAugmentedGeneration(RAG)DocumentsLoader|VectorDatabase|EmbeddingModel|DatabaseSearchMethodMechanismQueryroutingDirectinguserqueriestoappropriateIndex.QuerytransformationsChangetheinputquerytoimproveretrievedcontextsSentence-windowretrievalSmallsizeofchunk,retrievalthewindowofappropriatesentencesbeforeandaftertheretrievedoneAuto-mergeretrievalSmallsizeofchunk,organizedinatree-likestructure,mergesmallerchunksintolargercontextforLLMDifferentindextypesHybridwithKey-wordandembeddings:•Key-wordbasedindexforqueriesrelatingtoaspecificproduct•EmbeddingsforgeneralcustomersupportRe-rankerUsingre-rankermodeltore-rankretrieveddocuments.SolvetheissueofdiscrepancybetweensimilarityandrelevanceMeta-datafilteringAddmeta-datatoyourchunksUsemeta-datafilteringtohelpprocessresultsPromptcompressionUsingsmalllanguagemodeltoCompressingirrelevantcontext;highlightingpivotalparagraphs;reducingtheoverallcontextlength27KeyTechniquesinRetrievalAugmentedGeneration(RAG)LLMGeneration•FoundationModels•PromptEngineering•CustomizedLLM•Easytodeploy•Lowlatency,highthroughput28ResponseEvaluatingRAGPipelineResponseRAGASEvaluationframeworkforyourRetrievalAugmentedGeneration(RAG)pipelinesragasscoreRetrievalGenerationRetrievalcontextprecisionThesignaltonoiserationofretrievedcontextcontextrecallCanitretrievealltherelevantinformationrequiredtoanswerthequestionfaithfulnessHowfactuallyaccurateisthegeneratedanswerHowrelevantisthegeneratedanswertothequestiontrulensEvaluationandTrackingforLLMExperimentsRAGTriad-TruLensAnswerRelevanceAnswerRelevanceIstheanswerrelevanttothequery?Istheretrievedcontextrelevanttothequery?ContextGroundednessIstheresponsesupportedbythecontext?29NVIDIASolutionsforRAGNVIDIASolutionforRetrievalAugmentedGeneration.Retrieval.RAPIDSRAFTtoacceleratevectordatabasessearch.FoundationModels.ModelDeployment.ReferenceSamples63%56%BestCommercialGradeEmbeddingModelforLLMs63%56%PartofNeMoRetriever,NVIDIATextQAEmbeddingperforms20ptsbetterthancommercialofferingsoutoftheboxHigheraccuracyresultsReducedfinetuningrequirementsLoweroccurrenceofhallucinations75%50%25%0%RecallTop5Benchmark79%76%LexicalSearchLexicalSearch(BM-25)E5UnsupervisedNVIDIARetrievalQAEmbedding(bestnon-commercial)ComparingNVIDIATextQAEmbeddingModelvsOtherAvailableOptions.RecallTop5,300tokenchunksize,averagingacrossrepresentativecustomerdatasetsfromTelco,IT,Consulting,EnergyExperiencetheNVIDIARetrievalQAEmbeddingModel33VECTORDATABASESAREBECOMINGVECTORDATABASESAREBECOMINGESSENTIALEmbeddingsIndexingVectorDatabaseQueryingRetrievingAppsANNOUNCINGNEWPARTNERSLEVERAGINGRAFTredisGPU-AcceleratedVectorSearchforLargeLanguageModelsRAFTTURBOCHARGESVECTORSEARCH•Vectorsearchenginesallowuserstoquerymassivedatasetsofembeddingsforapproximatematches•VectorSearchtypicallyhappensusingNearestNeighbor(NN)orApproximateNearestNeighbor(ANN)Methods•RAFTlibraryoffersveryfastNNandANNprimitivesonGPU•Acceleratesindexing,loading,andretrievingabatchofneighborsforasinglequeryUSE-CASESLargeLargeLanguageModelsRecSysRecSysComputerComputerVision34RAPIDSRAFTGPU-AcceleratedVectorSearchforLargeLanguageModels•Brute-force•AlgorithmsforANNsearch:•IVF-Flat•IVF-PQ•CAGRAMaterials:.RAFTDocument.AcceleratingVectorSearch:UsingGPU-PoweredIndexeswithRAPIDSRAFT.AcceleratingVectorSearch:Fine-TuningGPUIndexAlgorithms.AcceleratedVectorSearch:ApproximatingwithRAPIDSRAFTIVF-FlatRAPIDS/RAFTGITHUB35FastestResponsesNemotron-38BGPT-8Bw/3.5TFastestResponsesNemotron-38BGPT-8Bw/3.5Ttokens.+SFT,SteerLM.53LanguagesI/O:4KtokensForComplexTasksNemotron-243BGPT-43Bw/1.1Ttokens.+SFTprivatemix.50Languages.I/O:4KtokensSuiteofgenerativefoundationlanguagemodelsbuiltforenterprisehyper-personalizationBalanceofAccuracy-BalanceofAccuracy-LatencyNemotron-222BGPT-22Bw/1.1Ttokens.+SFTprivatemix.50Languages.I/O:4KtokensExploreFoundationModelsinNGC36EnterpriseGradeFoundationModelswithNVIDIANemotron-38BDesignedforproductionreadygenerativeAIthatcanbecustomizedanddeployedatscaleEnterprise-ReadyEnterprise-ReadyFoundationModelsTrainedonresponsiblysourcesdata,withhighaccuracyoptimizedforsmoothenterpriseintegrationOneOneModelforAllMajorLanguagesTrainedon53languagesand37codinglanguages.Nemotron-3BoffersthebestopenlyavailablemultilingualLLMAdvancedAdvancedandFlexibleCustomizationBaseforcustomization,includingPEFTandcontinuouspre-trainingfordomain-adaptedLLMsChat-SFTisabuildingblockforinstructiontuningcustommodelsoruser-definedalignmentChat-RLHFforbestout-of-the-boxchatmodelperformanceChat-SteerLMforbestout-of-the-boxchatmodelwithflexiblealignmentatinferencetimeQuestion&AnswerLLMsAcustomizedonknowledgebasesRLHFSteerLMSFTQNVIDIAAIFoundationModels:BuildCustomEnterpriseChatbotsandCo-PilotswithProduction-ReadyLLMs37SoTAPerformanceforLargeLanguageModelsforProductionDeploymentsChallenges:LLMperformanceiscrucialforreal-time,cost-effective,productiondeployments.RapidevolutionintheLLMecosystem,withnewmodels&techniquesreleasedregularly,requiresaperformant,flexiblesolutiontooptimizemodels.TensorRT-LLMisanopen-sourcelibrarytooptimizeinferenceperformanceonthelatestLargeLanguageModelsforNVIDIAGPUs.ItisbuiltonFasterTransformerandTensorRTwithasimplePythonAPIfordefining,optimizing,&executingLLMsforinferenceinproduction.SoTAPerformanceEaseExtensionLLMBatchingwithTritonLeverageTensorRTcompilation&kernelsfromFasterTransformers,CUTLASS,OAITriton,++AddnewoperatorsormodelsinPythontoquicklysupportnewLLMswithoptimizedperformanceMaximizethroughputandGPUutilizationthroughnewschedulingtechniquesforLLMs#defineanewactivationdefsilu(input:Tensor)→Tensor:returninput*sigmoid(input)#implementmodelslikeinDLFWsclassLlamaModel(Module)def__init__(…)self.layers=ModuleList([…])hidden=self.embedding(…)forlayerinself.layers:hidden_states=layer(hidden)38SoTAPerformanceforLargeLanguageModelsforProductionDeploymentsKeyFeaturesTensorRT-LLMcontainsexamplesthatimplementthefollowingfeatures.•Multi-headAttention(MHA)•Multi-queryAttention(MQA)•Group-queryAttention(GQA)•In-flightBatching•PagedKVCachefortheAttention•TensorParallelism•PipelineParallelism•INT4/INT8Weight-OnlyQuantization(W4A16&W8A16)•SmoothQuant•GPTQ•AWQ•FP8•Greedy-search•Beam-search•RoPEInthisreleaseofTensorRT-LLM,someofthefeaturesarenotenabledforallthemodelslistedintheexamplesfolder.ModelsThelistofsupportedmodelsis:•Baichuan•BART•Bert•Blip2•BLOOM•ChatGLM•FairSeqNMT•Falcon•Flan-T5•GPT•GPT-J•GPT-Nemo•GPT-NeoX•InternLM•LLaMA•LLaMA-v2•mBART•Mistral•MPT•mT5•OPT•Qwen•ReplitCode•SantaCoder•StarCoder•T5•Whisper•StarCoder39NVIDIATritonBackendSourceforTensorRT-LLMTritonBackendTensorRT-LLMAvailableNow!NVIDIATritonBackendSourceforTensorRT-LLMTritonBackendKeyresourcesforTensorRT-LMTensorRT-LLMGithubSourceforTensorRT-LLMGettingStartedBlogLearntooptimizeanddeployTensorRT-LLMwithTritonServerTensorRT-LLMdocumentationAPIdocs,archoverviews,&perfdata40RetrievalAugmentedGeneration(RAG)withGuardrailsRAGistoLLMswhatanopen-bookexamistohumansGuardrails(1)Retrieve GuardrailsGuardrails(1)Retrieve(2)Augment(3)Generate42UserOpenSourceonGitHubUserOpenSourceonGitHub/NVIDIA/NeMo-Guardrails43OpenSourceSoftwareForDevelopingSafeandTrustworthyLLM-poweredChatbotsNeMoGuardrailsENTERPRISEAPPLICATIONLLMsThird-PartyAppsLLMAppToolkitsIntegratedIntoIntegratedIntotheNVIDIANeMoFrameworkPartofNVIDIAAIEnterpriseSoftwareSuiteEvaluatingRAGPipelineRAGAS:EvaluationframeworkforyourRetrievalAugmentedGeneration(RAG)pipelinesRAGEvaluationToolThereare3componentsneededforevaluatingtheperformanceofaRAGpipeline:1.Datafortesting.2.Automatedmetricstomeasureperformanceofboththecontextretrievalandresponsegeneration.3.Human-likeevaluationofthegeneratedresponsefromtheend-to-endpipeline.44RAGPipelineSamplesinNVIDIAGenerativeAIout-of-boxSamplewithRAGGenerativeAISamplesGenerativeAIreferenceworkflowsoptimizedforacceleratedinfrastructureandmicroservicearchitecture.LinuxdeveloperRAG•LangChain+LlamaIndex•LLM:Llama2-13B•Embeddingmodel:e5-large-v2•Deployment:TRT-LLMandTriton•DB:MilvusWindowsdeveloperRAG•LangChain+LlamaIndex•LLM:Llama2-13B•Embeddingmodel:all-MiniLM-L6-v2•Deployment:TRT-LLM•DB:FAISS454646CaseStudyExample:RAGCopilotsQuestionAnsweringChatbotandinteractivecodegenerationwithVSCodeExtension47Example:ChipNemoCustomtokenizers|Domain-adaptivecontinuedpretraining|Supervisedfine-tuning(SFT)withdomain-specificinstructions|domain-adaptedretrievalmodels.ChipNeMo:Domain-AdaptedLLMsforChipDesignSiliconVolley:DesignersTapGenerativeAIforaChipAssist•Engineeringassistantchatbot•EDAscriptsgeneration•Bugsummarization48RetrievalAugmentedGenerationandFineTuningRetrievalAugmentedGeneration(RAG)Definition•Longtermmemory•Modifyingthebasemodel•Teachesthemodelhowtofollowuserspecifiedinstructions.•Replicatespecificstructures,styles,orformats•Shorttermmemory•Pipeline:Retrieval+Generate•

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论