版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
DeepLearning/AILifecycle
with
DellEMCand
bitfusionBhavesh
PatelDell
EMC
Server
Advanced
EngineeringAbstractThis
talk
gives
an
overview
of
the
end
to
end
application
life
cycle
ofdeep
learning
in
the
enterprise
along
with
numerous
use
cases
andsummarizes
studies
done
by
Bitfusion
and
Dell
on
a
high
performanceheterogeneous
elastic
rack
of
DellEMC
PowerEdge
C4130s
with
NvidiaGPUs.
Some
of
the
use
cases
that
will
be
talked
about
in
detail
will
beability
to
bring
on-demand
GPU
acceleration
beyond
the
rack
across
the
enterprise
with
easy
attachable
elastic
GPUs
for
deep
learningdevelopment,
as
well
as
the
creation
of
a
cost
effective
software
definedhigh
performance
elastic
multi-GPU
system
combiningmultipleDellEMC
C4130
servers
at
runtime
for
deep
learning
training.Deep
Learning
and
AI
Are
being
adoptedacross
a
wide
range
of
market
segmentsIndustry/FunctionAI
RevolutionComputer
Vision
&Speech,Drones,DroidsInteractive
Virtual
&
Mixed
RealitySelf-Driving
Cars,
Co-PilotAdvisorPredictive
Price
Analysis,Dynamic
DecisionSupportDrug
Discovery,
Protein
SimulationPredictive
Diagnosis,Wearable
IntelligenceGeo-Seismic
Resource
DiscoveryAdaptive
Learning
CoursesAdaptive
Product
RecommendationsDynamic
Routing
OptimizationBots
And
Fully-Automated
ServiceDynamic
Risk
Mitigation
And
Yield
OptimizationROBOTICSENTERTAINMENTAUTOMOTIVEFINANCEPHARMAHEALTHCAREENERGYEDUCATIONSALESSUPPLY
CHAINCUSTOMER
SERVICEMAINTENANCE...but
few
people
have
the
time,knowledge,
resources
to
even
get
startedPROBLEM
1:
HARDWARE
INFRASTRUCTURE
LIMITATIONSIncreased
cost
with
dense
serversTOR
bottleneck,
limited
scalabilityLimited
multi-tenancy
on
GPUservers
(limited
CPU
and
memoryper
user)Limited
to
8-GPU
applicationsDoes
not
support
GPU
apps
with:High
storage,
CPU,
MemoryrequirementsPROBLEM
2:
SOFTWARE
COMPLEXITYOVERLOADSoftware
ManagementGPU
Driver
ManagementFramework
&
Library
InstallationDeep
Learning
Framework
ConfigurationPackageManagerJupyter
Server
or
IDE
SetupData
ManagementData
UploaderShared
Local
File
SystemData
Volume
ManagementData
Integrations
&
PipeliningModel
ManagementCode
Version
ManagementHyperparameter
OptimizationExperiment
TrackingDeployment
AutomationDeployment
Continuous
IntegrationWorkload
ManagementJob
SchedulerLog
ManagementUser
&
Group
ManagementInference
AutoscalingInfrastructure
ManagementCloud
or
Server
OrchestrationGPU
Hardware
SetupGPU
Resource
AllocationContainer
OrchestrationNetworking
Direct
BypassMPI
/RDMA
/RPI/gRPCMonitoringNeed
to
Simplify
andScaleSOLUTION
1/2:
CONVERGED
RACK
SOLUTIONComposable
computebundleUp
to
64
GPUs
per
applicationGPU
applications
with
varied
storage,memory,
CPU
requirements30-50%
less
cost
per
GPU>
{cores,
memory}
/
GPU>>
intra-rack
networking
bandwidthLess
inter-rack
loadComposable
-
Add-as-you-goSOLUTION
2/2:
COMPLETE,
STREAMLINED
AI
DEVELOPMENTDevelop
on
pre-installed,
quickstart
deep
learning
containers.••Get
to
work
quickly
withworkspaces
with
optimized
pre-configured
drivers,
frameworks,libraries,andnotebooks.Start
with
CPUs,
and
attachElasticGPUs
on-demand.Allyour
code
and
data
issavedautomatically
and
sharable
withothers.Transition
from
developmentto
training
with
multipleGPUs.•Seamlessly
scale
out
to
moreGPUs
on
a
shared
training
clusterto
train
larger
models
quickly
andcost-effectively.Support
and
manage
multipleusers,teams,
and
projects.Train
multiple
models
in
parallelfor
massive
productivityimprovements.Pushtrained,
finalized
modelsinto
production.•Deploy
a
trained
neural
networkinto
production
and
perform
real-time
inference
across
differenthardware.Managemultiple
AI
applicationsand
inference
endpointscorresponding
to
different
trainedmodels.•GPUGPUGPUGPUGPUGPGPUGPUGPUU
GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPU12Dell
EMC
Deep
Learning
Optimized
serversVerticalSegmentApplicationsOpenSourceFrameworksOptimizedLibrariesOperatingSystemProcessor/AcceleratorComputePlatformC4130R730C6320P
inC6300GPUKNLPhiinC6320P
SledNvLink-GPUC4130
DEEP
LEARNING
ServerFront(optional)
RedundantPower
SuppliesDual
SSDbootdrivesBackIDRAC
NIC2x
1GbNICFrontPowerSuppliesGPUaccelerators(4)CPU
sockets(under
heatsinks)8fansGPU
DEEP
LEARNING
RACK
SOLUTIONFeaturesR730C4130CPUE5-2669
v3@2.1GHzE5-2630
v3@
2.4GhzMemory4GB1TB/node;
64G
DIMMStorageIntel
PCIe
NVMEIntel
PCIe
NVMENetworking
IOCX3
FDRInfiniBandCX3
FDRInfiniBandGPUNAM40-24GBTOR
SwitchMellanox
SX6036-
FDRSwitchCablesFDR
56G
DCA
CablesConfiguration
DetailsR730C4130Pre-Built
AppContainersGPU
and
WorkspaceManagementElastic
GPUs
across
theDatacenterSoftware
definedScaled
out
GPU
ServersGPU
DEEP
LEARNING
RACK
SOLUTIONPre-Built
App
ContainersGPUandWorkspaceManagementElastic
GPUs
across
theDatacenterSoftware
definedScaledoutGPU
Servers1
Develop2
Train3DeployEnd
to
End
Deep
Learning
Application
Life
CycleGPUGPU
GPU
GPUGPUGPU
GPU
GPUGPUGPU
GPU
GPUGPUGPU
GPU
GPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUC4130
#1GPU
NodesInfinibandSwitchCPU
NodesC4130
#2C4130
#3C4130
#4R730
#1R730
#2…but
wait,
‘converged
compute’requires
network
attached
GPUs...R730C4130BITFUSION
CORE
VIRTUALIZATIONGPU
Device
VirtualizationAllows
dynamic
GPU
attach
on
a
per-application
basisFeaturesAPIs: CUDA,
OpenCLDistribution:
scale-out
to
remote
GPUsPooling:
Oversubscribe
GPUsResourceProvisioning:
Fractional
vGPUsHigh
Availability:
Automatic
DMRManageability:
Remote
nvidia-smiDistributed
CUDA
Unified
MemoryNative
support
for
IB,
GPUDirect
RDMAFeature
complete
with
CUDA
8.0PUTTING
IT
ALL
TOGETHERCLIENT
SERVERGPUSERVERGPUSERVERGPUSERVERBitfusion
Flex,managed
containersBitfusion
Service
DaemonBitfusion
Client
LibraryNATIVE
VS.
REMOTE
GPUsCPUGPU
0GPU
1PCIeCPUGPU
0HCAPCIeCPUHCAGPU
1PCIeCompletely
transparent:
All
CUDA
Apps
see
local
and
remote
GPUs
as
if
directly
connectedResultsREMOTE
GPUs
-
LATENCY
AND
BANDWIDTHData
movement
overheads
is
the
primary
scalinglimiterMeasurements
done
at
application
level
–cudaMemcpyFast
Local
GPU
copiesPCIe
Intranode
copies16
GPU
virtual
system:
Naive
implementation
w/
TCP/IPC4130Fast
local
GPUcopiesIntranode
copies
via
PCIeLow
BW,
High
Latency
remote
copiesOSBypass
needed
to
avoidprimary
TCP/IP
overheadsAIapps
are
very
latency
sensitivenode
0node
1node
2node
316
GPU
virtual
system:
Bitfusion
optimized
transport
and
runtimeSame
FDRx4
transport,
but
drop
IPoIBReplace
remotecallswith
native
IB
verbsRuntime
selectionof
intranode
RDMA
vs.cudaMemcpyMulti-rail
communications
where
availaRbemleote=~
Native
Local
GPUsRuntime
optimizations:
pipelining,
speMciunilmaatlivNUeMA
effectsexecution,
distributed
caching
&
eventcoalescing,…SLICE
&
DICE
-
MORE
THAN
ONE
WAY
TO
GET
4
GPUsCaffe
GoogleNetTensorFlowPixel-CNNR730C4130Native
GPU
performance
with
networkattached
GPUsRun
time
comparison
(lower
is
better)
→Multiple
ways
to
create
a
virtual
4
GPU
node,with
nativeefficiency(secsto
trainCaffeGoogleNet,
batch
size:
128)TRAINING
PERFORMANCEContinued
Strong
ScalingCaffe
GoogleNetWeak-scalingAccelerate
Hyper
parameter
OptimizationCaffe
GoogleNet
TensorFlow1.0
with
Pixel-CNN74%73%55%53%86%PCIe
host
bridge
limit124816nativeremoteR730C4130Other
PCIe
GPU
Configurations
AvailableCurrently
TestingConfig
‘G’Further
reading:/techcenter/high-performance-computing/b/gener
al_hpc/archive/2016/11/11/deep-learning-performance-with-p100-gpushttp:///techcenter/high-performance-computing/b/general_h
pc/archive/2017/03/22/deep-learning-inference-on-p40-gpuso3f0YNvLink
Configuration••••4P100-16GBSXM2GPU2CPUPCIeswitch1
PCIe
slot
–
EDRIBSXM2#3Config
‘K’SXM2#2SXM2#4SXM2#1o3f1YNvLink
Configuration•••••4P100-16GBSXM2GPU2CPUPCIeswitch1
PCIe
slot
–
EDRIBMemory
:
256GBw/16GB@
2133OS:
Ubuntu
16.04CUDA:
8.1••Config
‘L’SXM2#3SXM2#2SXM2#4SXM2#1PCIe
SwitchSoftware
Solutionso3f319Overview
–
Bright
ML
Dell
EMC
has
partnered
withBrightComputing
to
offertheir
BrightML
package
as
the
software
stack
onDell
EMC
Deep
learninghardwaresolution.o3f419Bright
ML
OverviewMachine
Learning
in
SeismicImaging
Using
KNL
+
FPGA–Project
#1Bhavesh
Patel
–
Server
Advanced
EngineeringRobert
Dildy
-
Product
Technologist
Sr.
Consultant,Engineering
Solutions36AbstractThis
paper
is
focused
on
how
to
apply
Machine
Learning
to
seismic
imaging
with
the
use
of
FPGA
as
aco-accelerator.It
will
cover
2
hardware
technologies:
1)
Intel
KNL
Phi
2)
FPGA
and
also
address
how
to
use
Machine
learningforseismic
imaging.There
are
different
types
of
accelerators
like
GPU,
Intel
Phi
but
we
are
choosing
to
study
how
we
can
use
i-ABRAplatform
on
KNL
+
FPGA
to
train
the
neural
network
using
Seismic
Imaging
data
and
then
doing
the
inference.Machine
learning
in
a
broader
sense
can
be
divided
into
2
parts
namely
:
Training
and
Inference.37BackgroundSeismic
Imaging
is
a
standard
data
processing
technique
used
in
creating
an
image
of
subsurface
structures
ofthe
Earth
from
measurements
recorded
at
the
surface
via
seismic
wave
propagations
captured
from
varioussound
energy
sources.There
are
certain
challenges
with
Seismic
data
interpretation
like
3D
is
starting
to
replace
2D
for
seismicinterpretation.There
has
been
rapid
growth
in
use
of
computer
vision
technology
&
several
companies
developing
imagerecognition
platforms.
This
technology
is
being
used
for
automatic
photo
tagging
and
classificatio
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024建筑设计合同范文
- 智能化健身科技促进个人健康管理考核试卷
- 旅行社职工合同范例
- 橡胶制品的市场渗透与战略合作考核试卷
- 废钢供应合同范例
- 天然气综合利用与能源转型考核试卷
- 2021年主管护师(儿科护理)资格考试题库
- 2021年中医助理医师考试题库及答案解析(单选题)
- 服装设计师的创造力与创新能力考核试卷
- 物业停车位合同模板
- 工业厂房设计规划方案
- 安全生产检查咨询服务投标方案(技术方案)
- 急性粒细胞白血病护理查房
- 公司安全部门简介
- 危废仓库建筑合同
- 中医外科临床诊疗指南 烧伤
- 物业公司消防知识培训方案
- 门诊护患沟通技巧(简)
- GH/T 1419-2023野生食用菌保育促繁技术规程灰肉红菇
- ISO9001:2015标准内容讲解
- 银行合规风险讲义课件
评论
0/150
提交评论