Deep Learning:AI 与 Dell EMC 和 Bitfusion 的生命周期_第1页
Deep Learning:AI 与 Dell EMC 和 Bitfusion 的生命周期_第2页
Deep Learning:AI 与 Dell EMC 和 Bitfusion 的生命周期_第3页
Deep Learning:AI 与 Dell EMC 和 Bitfusion 的生命周期_第4页
Deep Learning:AI 与 Dell EMC 和 Bitfusion 的生命周期_第5页
已阅读5页,还剩33页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

DeepLearning/AILifecycle

with

DellEMCand

bitfusionBhavesh

PatelDell

EMC

Server

Advanced

EngineeringAbstractThis

talk

gives

an

overview

of

the

end

to

end

application

life

cycle

ofdeep

learning

in

the

enterprise

along

with

numerous

use

cases

andsummarizes

studies

done

by

Bitfusion

and

Dell

on

a

high

performanceheterogeneous

elastic

rack

of

DellEMC

PowerEdge

C4130s

with

NvidiaGPUs.

Some

of

the

use

cases

that

will

be

talked

about

in

detail

will

beability

to

bring

on-demand

GPU

acceleration

beyond

the

rack

across

the

enterprise

with

easy

attachable

elastic

GPUs

for

deep

learningdevelopment,

as

well

as

the

creation

of

a

cost

effective

software

definedhigh

performance

elastic

multi-GPU

system

combiningmultipleDellEMC

C4130

servers

at

runtime

for

deep

learning

training.Deep

Learning

and

AI

Are

being

adoptedacross

a

wide

range

of

market

segmentsIndustry/FunctionAI

RevolutionComputer

Vision

&Speech,Drones,DroidsInteractive

Virtual

&

Mixed

RealitySelf-Driving

Cars,

Co-PilotAdvisorPredictive

Price

Analysis,Dynamic

DecisionSupportDrug

Discovery,

Protein

SimulationPredictive

Diagnosis,Wearable

IntelligenceGeo-Seismic

Resource

DiscoveryAdaptive

Learning

CoursesAdaptive

Product

RecommendationsDynamic

Routing

OptimizationBots

And

Fully-Automated

ServiceDynamic

Risk

Mitigation

And

Yield

OptimizationROBOTICSENTERTAINMENTAUTOMOTIVEFINANCEPHARMAHEALTHCAREENERGYEDUCATIONSALESSUPPLY

CHAINCUSTOMER

SERVICEMAINTENANCE...but

few

people

have

the

time,knowledge,

resources

to

even

get

startedPROBLEM

1:

HARDWARE

INFRASTRUCTURE

LIMITATIONSIncreased

cost

with

dense

serversTOR

bottleneck,

limited

scalabilityLimited

multi-tenancy

on

GPUservers

(limited

CPU

and

memoryper

user)Limited

to

8-GPU

applicationsDoes

not

support

GPU

apps

with:High

storage,

CPU,

MemoryrequirementsPROBLEM

2:

SOFTWARE

COMPLEXITYOVERLOADSoftware

ManagementGPU

Driver

ManagementFramework

&

Library

InstallationDeep

Learning

Framework

ConfigurationPackageManagerJupyter

Server

or

IDE

SetupData

ManagementData

UploaderShared

Local

File

SystemData

Volume

ManagementData

Integrations

&

PipeliningModel

ManagementCode

Version

ManagementHyperparameter

OptimizationExperiment

TrackingDeployment

AutomationDeployment

Continuous

IntegrationWorkload

ManagementJob

SchedulerLog

ManagementUser

&

Group

ManagementInference

AutoscalingInfrastructure

ManagementCloud

or

Server

OrchestrationGPU

Hardware

SetupGPU

Resource

AllocationContainer

OrchestrationNetworking

Direct

BypassMPI

/RDMA

/RPI/gRPCMonitoringNeed

to

Simplify

andScaleSOLUTION

1/2:

CONVERGED

RACK

SOLUTIONComposable

computebundleUp

to

64

GPUs

per

applicationGPU

applications

with

varied

storage,memory,

CPU

requirements30-50%

less

cost

per

GPU>

{cores,

memory}

/

GPU>>

intra-rack

networking

bandwidthLess

inter-rack

loadComposable

-

Add-as-you-goSOLUTION

2/2:

COMPLETE,

STREAMLINED

AI

DEVELOPMENTDevelop

on

pre-installed,

quickstart

deep

learning

containers.••Get

to

work

quickly

withworkspaces

with

optimized

pre-configured

drivers,

frameworks,libraries,andnotebooks.Start

with

CPUs,

and

attachElasticGPUs

on-demand.Allyour

code

and

data

issavedautomatically

and

sharable

withothers.Transition

from

developmentto

training

with

multipleGPUs.•Seamlessly

scale

out

to

moreGPUs

on

a

shared

training

clusterto

train

larger

models

quickly

andcost-effectively.Support

and

manage

multipleusers,teams,

and

projects.Train

multiple

models

in

parallelfor

massive

productivityimprovements.Pushtrained,

finalized

modelsinto

production.•Deploy

a

trained

neural

networkinto

production

and

perform

real-time

inference

across

differenthardware.Managemultiple

AI

applicationsand

inference

endpointscorresponding

to

different

trainedmodels.•GPUGPUGPUGPUGPUGPGPUGPUGPUU

GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPU12Dell

EMC

Deep

Learning

Optimized

serversVerticalSegmentApplicationsOpenSourceFrameworksOptimizedLibrariesOperatingSystemProcessor/AcceleratorComputePlatformC4130R730C6320P

inC6300GPUKNLPhiinC6320P

SledNvLink-GPUC4130

DEEP

LEARNING

ServerFront(optional)

RedundantPower

SuppliesDual

SSDbootdrivesBackIDRAC

NIC2x

1GbNICFrontPowerSuppliesGPUaccelerators(4)CPU

sockets(under

heatsinks)8fansGPU

DEEP

LEARNING

RACK

SOLUTIONFeaturesR730C4130CPUE5-2669

v3@2.1GHzE5-2630

v3@

2.4GhzMemory4GB1TB/node;

64G

DIMMStorageIntel

PCIe

NVMEIntel

PCIe

NVMENetworking

IOCX3

FDRInfiniBandCX3

FDRInfiniBandGPUNAM40-24GBTOR

SwitchMellanox

SX6036-

FDRSwitchCablesFDR

56G

DCA

CablesConfiguration

DetailsR730C4130Pre-Built

AppContainersGPU

and

WorkspaceManagementElastic

GPUs

across

theDatacenterSoftware

definedScaled

out

GPU

ServersGPU

DEEP

LEARNING

RACK

SOLUTIONPre-Built

App

ContainersGPUandWorkspaceManagementElastic

GPUs

across

theDatacenterSoftware

definedScaledoutGPU

Servers1

Develop2

Train3DeployEnd

to

End

Deep

Learning

Application

Life

CycleGPUGPU

GPU

GPUGPUGPU

GPU

GPUGPUGPU

GPU

GPUGPUGPU

GPU

GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUC4130

#1GPU

NodesInfinibandSwitchCPU

NodesC4130

#2C4130

#3C4130

#4R730

#1R730

#2…but

wait,

‘converged

compute’requires

network

attached

GPUs...R730C4130BITFUSION

CORE

VIRTUALIZATIONGPU

Device

VirtualizationAllows

dynamic

GPU

attach

on

a

per-application

basisFeaturesAPIs: CUDA,

OpenCLDistribution:

scale-out

to

remote

GPUsPooling:

Oversubscribe

GPUsResourceProvisioning:

Fractional

vGPUsHigh

Availability:

Automatic

DMRManageability:

Remote

nvidia-smiDistributed

CUDA

Unified

MemoryNative

support

for

IB,

GPUDirect

RDMAFeature

complete

with

CUDA

8.0PUTTING

IT

ALL

TOGETHERCLIENT

SERVERGPUSERVERGPUSERVERGPUSERVERBitfusion

Flex,managed

containersBitfusion

Service

DaemonBitfusion

Client

LibraryNATIVE

VS.

REMOTE

GPUsCPUGPU

0GPU

1PCIeCPUGPU

0HCAPCIeCPUHCAGPU

1PCIeCompletely

transparent:

All

CUDA

Apps

see

local

and

remote

GPUs

as

if

directly

connectedResultsREMOTE

GPUs

-

LATENCY

AND

BANDWIDTHData

movement

overheads

is

the

primary

scalinglimiterMeasurements

done

at

application

level

–cudaMemcpyFast

Local

GPU

copiesPCIe

Intranode

copies16

GPU

virtual

system:

Naive

implementation

w/

TCP/IPC4130Fast

local

GPUcopiesIntranode

copies

via

PCIeLow

BW,

High

Latency

remote

copiesOSBypass

needed

to

avoidprimary

TCP/IP

overheadsAIapps

are

very

latency

sensitivenode

0node

1node

2node

316

GPU

virtual

system:

Bitfusion

optimized

transport

and

runtimeSame

FDRx4

transport,

but

drop

IPoIBReplace

remotecallswith

native

IB

verbsRuntime

selectionof

intranode

RDMA

vs.cudaMemcpyMulti-rail

communications

where

availaRbemleote=~

Native

Local

GPUsRuntime

optimizations:

pipelining,

speMciunilmaatlivNUeMA

effectsexecution,

distributed

caching

&

eventcoalescing,…SLICE

&

DICE

-

MORE

THAN

ONE

WAY

TO

GET

4

GPUsCaffe

GoogleNetTensorFlowPixel-CNNR730C4130Native

GPU

performance

with

networkattached

GPUsRun

time

comparison

(lower

is

better)

→Multiple

ways

to

create

a

virtual

4

GPU

node,with

nativeefficiency(secsto

trainCaffeGoogleNet,

batch

size:

128)TRAINING

PERFORMANCEContinued

Strong

ScalingCaffe

GoogleNetWeak-scalingAccelerate

Hyper

parameter

OptimizationCaffe

GoogleNet

TensorFlow1.0

with

Pixel-CNN74%73%55%53%86%PCIe

host

bridge

limit124816nativeremoteR730C4130Other

PCIe

GPU

Configurations

AvailableCurrently

TestingConfig

‘G’Further

reading:/techcenter/high-performance-computing/b/gener

al_hpc/archive/2016/11/11/deep-learning-performance-with-p100-gpushttp:///techcenter/high-performance-computing/b/general_h

pc/archive/2017/03/22/deep-learning-inference-on-p40-gpuso3f0YNvLink

Configuration••••4P100-16GBSXM2GPU2CPUPCIeswitch1

PCIe

slot

EDRIBSXM2#3Config

‘K’SXM2#2SXM2#4SXM2#1o3f1YNvLink

Configuration•••••4P100-16GBSXM2GPU2CPUPCIeswitch1

PCIe

slot

EDRIBMemory

:

256GBw/16GB@

2133OS:

Ubuntu

16.04CUDA:

8.1••Config

‘L’SXM2#3SXM2#2SXM2#4SXM2#1PCIe

SwitchSoftware

Solutionso3f319Overview

Bright

ML

Dell

EMC

has

partnered

withBrightComputing

to

offertheir

BrightML

package

as

the

software

stack

onDell

EMC

Deep

learninghardwaresolution.o3f419Bright

ML

OverviewMachine

Learning

in

SeismicImaging

Using

KNL

+

FPGA–Project

#1Bhavesh

Patel

Server

Advanced

EngineeringRobert

Dildy

-

Product

Technologist

Sr.

Consultant,Engineering

Solutions36AbstractThis

paper

is

focused

on

how

to

apply

Machine

Learning

to

seismic

imaging

with

the

use

of

FPGA

as

aco-accelerator.It

will

cover

2

hardware

technologies:

1)

Intel

KNL

Phi

2)

FPGA

and

also

address

how

to

use

Machine

learningforseismic

imaging.There

are

different

types

of

accelerators

like

GPU,

Intel

Phi

but

we

are

choosing

to

study

how

we

can

use

i-ABRAplatform

on

KNL

+

FPGA

to

train

the

neural

network

using

Seismic

Imaging

data

and

then

doing

the

inference.Machine

learning

in

a

broader

sense

can

be

divided

into

2

parts

namely

:

Training

and

Inference.37BackgroundSeismic

Imaging

is

a

standard

data

processing

technique

used

in

creating

an

image

of

subsurface

structures

ofthe

Earth

from

measurements

recorded

at

the

surface

via

seismic

wave

propagations

captured

from

varioussound

energy

sources.There

are

certain

challenges

with

Seismic

data

interpretation

like

3D

is

starting

to

replace

2D

for

seismicinterpretation.There

has

been

rapid

growth

in

use

of

computer

vision

technology

&

several

companies

developing

imagerecognition

platforms.

This

technology

is

being

used

for

automatic

photo

tagging

and

classificatio

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论