面向AI东西向流量的高性能以太网络测试 2024_第1页
面向AI东西向流量的高性能以太网络测试 2024_第2页
面向AI东西向流量的高性能以太网络测试 2024_第3页
面向AI东西向流量的高性能以太网络测试 2024_第4页
面向AI东西向流量的高性能以太网络测试 2024_第5页
已阅读5页,还剩78页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

[编号ODCC-2024-05004]开放数据中心标准推进委员会2024.09发布版权声明转载、摘编或利用其它方式使用ODCC成果中的文字或者观点的,应注明编写组 1 2 2 4 5 7 7 9 9 10 10 12 12 12 13 13 18 18 20 通用面向AI东西向流量的高性能以太网络测试高吞吐量:AI训练任务通常涉及大量的数据传输,网络需要具低延时:在AI模型训练过程中,延时会直接影响训练速度和模低抖动:抖动是指数据包传输时间的波动,对于AI训练任务来高可用性:AI训练任务通常需要长时间连续运行,任何网络故1本文档旨在详细介绍AI集群中的关键网络技术,技术和方法,可以了解如何构建和优化高性能的AI通信网络,以满(ECMP)等传统IP路由技术已被证明无法有效应对,由于在AI2交换机会选择那些与数据包的流量类别相匹配的队列进行负载评估。3种类型的拥塞的常见原因是多对一拥塞或incast拥塞,其特征是具4拥塞控制算法实现机制所针对的场景与AI应用场景有很大不同,其相应时延和和颗粒度无法满足AI大流量场景结果的准确性和可重复性。测试床采用无阻塞胖树(Fat-Tree)网络54台Leaf交换机:每台Leaf交换机负责连接多台GPU服务器,服务器到Leaf交换机的连接:每台服务器通过8条高速线路连6三、基准性能测试一对一映射:确保所有的GPU和网卡都按照主机的拓扑做了一流量转发场景:确保存在同Leaf交换机和跨Leaf交换机两种流7):):8数据分析:对比同Leaf交换机和跨Leaf交换机的测试结果,分为了全面评估和优化AI集群的网络性能,特制定如下NCCL集合通信测试方案。本方案旨在通过多种NCCL9配置整个集群的所有服务器为一个NCCLAll配置整个集群的所有服务器为一个NCCLReduceScatter工作负载,测试集群在数据归约散射操作中的平均带宽表现。ReduceScatter操作涉及将所有节点的数据进行归约计算,并将结果节点都分布在不同的Leaf交换机下。(二)AllReduce性能隔离测试最优部署场景、常规部署场景和分散部署场收集并分析目标工作负载的NCCLAlltoAll和NCCLAllReduce在2个Leaf交换机下的4个节点上运行AI模型,以构造AI工见的混合负载场景,从而评估网络噪声对AI工作负载的影响。具体来说,AI工作负载将包括复杂的深度学习模型训练任务,这些任务需要频繁的集合通讯操作(AllReduce、Al任务对网络带宽的占用,形成干扰。在这种情况下,测试将记录AI模型训练的迭代时间和网络通信的带宽利用率,以评估网络噪声对AI工作负载的具体影响,确保测试环境尽可能接近实际使用场景,五、拥塞场景测试(一)多打一场景下的NCCLAllReduce测试在本次测试中,将构建两组负载以评估多打一场景下的NCCL载场景。第二组负载则在剩余的两台GPU服务器上构建:在16号器上的1个GPU发送数据,形成背景负载,从而并记录其初始带宽性能数据。接着,在两台背景负载服务器上启动ib_write_bw工具,制造多打一拥塞场景,并持续监测All-Reduce工NCCLAllReduce平均带宽(GB/s)。在本次测试中,将构建两组负载以评估多打一场景下的NCCL组负载则在剩余的两台GPU服务器上构建:在16号GPU服务器上的8个GPU将通过ib_write_bw工具向1号GPU服务器上的1个GPU我们可以观测AlltoAll工作负载在拥塞条件记录其初始带宽性能数据。接着,在两台背景负载服务器上启动ib_write_bw工具,制造多打一拥塞场景,并持续NCCLAlltoAll平均带宽(GB/s)。本文详细介绍了面向AI东西向流量的高性能以太网络测试方法和结果。随着大规模AI模型的迅猛发展,传统网络架构已无法满足和拥塞控制技术,我们能够有效地优化网络性能,确保AI训练任务用无阻塞胖树架构,配置了高性能交换机和GPU服务器。通过基准附录A性能测试参考数据AllReduceAllReduce主机侧IP地址分配方案如下:host-01172.0.0.2172.32.0.2172.64.0.2172.96.0.2172.128.0.2172.160.0.2172.192.0.2172.224.0.2host-02172.0.0.4172.32.0.4172.64.0.4172.96.0.4172.128.0.4172.160.0.4172.192.0.4172.224.0.4host-03172.0.0.6172.32.0.6172.64.0.6172.96.0.6172.128.0.6172.160.0.6172.192.0.6172.224.0.6host-04172.0.0.8172.32.0.8172.64.0.8172.96.0.8172.128.0.8172.160.0.8172.192.0.8172.224.0.8host-05172.0.0.10172.32.0.10172.64.0.10172.96.0.10172.128.0.10172.160.0.10172.192.0.10172.224.0.10host-06172.0.0.12172.32.0.12172.64.0.12172.96.0.12172.128.0.12172.160.0.12172.192.0.12172.224.0.12host-07172.0.0.14172.32.0.14172.64.0.14172.96.0.14172.128.0.14172.160.0.14172.192.0.14172.224.0.14host-08172.0.0.16172.32.0.16172.64.0.16172.96.0.16172.128.0.16172.160.0.16172.192.0.16172.224.0.16host-09172.0.0.18172.32.0.18172.64.0.18172.96.0.18172.128.0.18172.160.0.18172.192.0.18172.224.0.18host-10172.0.0.20172.32.0.20172.64.0.20172.96.0.20172.128.0.20172.160.0.20172.192.0.20172.224.0.20host-11172.0.0.22172.32.0.22172.64.0.22172.96.0.22172.128.0.22172.160.0.22172.192.0.22172.224.0.22host-12172.0.0.24172.32.0.24172.64.0.24172.96.0.24172.128.0.24172.160.0.24172.192.0.24172.224.0.24host-13172.0.0.26172.32.0.26172.64.0.26172.96.0.26172.128.0.26172.160.0.26172.192.0.26172.224.0.26host-14172.0.0.28172.32.0.28172.64.0.28172.96.0.28172.128.0.28172.160.0.28172.192.0.28172.224.0.28host-15172.0.0.30172.32.0.30172.64.0.30172.96.0.30172.128.0.30172.160.0.30172.192.0.30172.224.0.30host-16172.0.0.32172.32.0.32172.64.0.32172.96.0.32172.128.0.32172.160.0.32172.192.0.32172.224.0.32eth1eth2eth3eth4eth5eth6eth7eth8NCCL_IB_ADAPTIVE_ROUTING=1NCCL_IB_QPS_PER_CONNECTION=4NCCL_TESTS_SPLIT_MASK=0x7--run_infinitely-q1--report_gbits--connection=RC消息大小范围:minbytes=(aslowaspossible,1KB)/maxbytes=host-01:8,host-02:8,host-03:8,host-04:8,host-05:8,host-06:8,host-07:8,host-08:8,host-09:8,host-10:8,host-11:8,host-12:8,host-13:8,hos5:8,host-16:8UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-xMELLANOX_VISIBLE_DEVICES=--gmcabtltcp,self./nccl-tests/build/all_reduce_perf--minbytes8Ghost-01:8,host-02:8,host-03:8,host-04:8,host-05:8,host-06:8,host-07:8,host-08:8,host-09:8,host-10:8,host-11:8,host-12:8,host-13:8,hos5:8,host-16:8UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-x--gmcabtltcp,self./nccl-tests/build/alltoall_perf--minbytes8Ghost-01:8,host-02:8,host-03:8,host-04:8,host-05:8,host-06:8,host-07:8,host-08:8,host-09:8,host-10:8,host-11:8,host-12:8,host-13:8,hos5:8,host-16:8UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9NCCL_COLLNET_ENABLE=0-xNCCL_IB_GID_INDEX=3NCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-x--gmcabtltcp,self./nccl-tests/build/reduce_scatter_perf--minbytes8Ghost-01:8,host-02:8,host-03:8,host-04:8,host-05:8,host-06:8,host-07:8,host-08:8,host-09:8,host-10:8,host-11:8,host-12:8,host-13:8,hos5:8,host-16:8UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-x--gmcabtltcp,self./nccl-tests/build/all_gather_perf--minbytes8G4组工作负载中,1组为观测负载,3组为背景噪声,增加噪声-Hhost-01:8,host-05:8,host-09:8,host-13UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9NCCL_COLLNET_ENABLE=0-xNCCL_IB_GID_INDEX=3NCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-xMELLANOX_VISIBLE_DEVICES=--gmcabtltcp,self/nccl-tests/build/all_reduce_perf--minbytes4G-Hhost-02:8,host-06:8,host-10:8,host-14UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-xMELLANOX_VISIBLE_DEVICES=--gmcabtltcp,self/nccl-tests/build/all_reduce_perf--minbytes4G-Hhost-03:8,host-07:8,hosUCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-xMELLANOX_VISIBLE_DEVICES=--gmcabtltcp,self/nccl-tests/build/all_reduce_perf--minbytes4G-Hhost-04:8,host-08:8,host-12:8,host-16:8UCX_IB_TRAFFIC_CLASS-xNCCL_DEBUG=warnNCCL_SHM_DISABLE=0UCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-xMELLANOX_VISIBLE_DEVICES=--gmcabtltcp,self/nccl-tests/build/all_reduce_perf--minbytes4G-Hhost-01:8,host-05:8,host-09:8,host-13UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-x--gmcabtltcp,self./nccl-tests/build/alltoall_perf--minbytes4G-Hhost-02:8,host-06:8,host-10:8,host-14UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-x--gmcabtltcp,self./nccl-tests/build/alltoall_perf--minbytes4G-Hhost-03:8,host-07:8,hosUCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-xMELLANOX_VISIBLE_DEVICES=--gmcabtltcp,self./nccl-tests/build/alltoall_perf--minbytes4G-Hhost-04:8,host-08:8,host-12:8,host-16UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-x--gmcabtltcp,self./nccl-tests/build/alltoall_perf--minbytes4G-Hhost-01:8,host-05:8,host-09:8,host-13:8UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-xMELLANOX_VISIBLE_DEVICES=--gmcabtltcp,self./nccl-tests/build/all_reduce_perf--minbytes4G-Hhost-02:8,host-06:8,host-10:8,host-14UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-x--gmcabtltcp,self./nccl-tests/build/alltoall_perf--minbytes4G-Hhost-03:8,host-07:8,hosUCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-xMELLANOX_VISIBLE_DEVICES=--gmcabtltcp,self./nccl-tests/build/alltoall_perf--minbytes4G-Hhost-04:8,host-08:8,host-12:8,host-16UCX_IB_TRAFFIC_CLASS=96-xNCCL_DEBUG=warn-xNCCL_P2P_DISABLE=0-xNCCL_SHM_DISABLE=0-xUCX_TLS=rc,sm-xCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7-xUCX_NET_DEVICES=mlx5_0:1-xNCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,_8,mlx5_9-xNCCL_COLLNET_ENABLE=0UCX_IB_GID_INDEX=3-xNCCL_IB_GID_INDEX=3-xNCCL_IB_TC=96-xNCCL_BUFFSIZE=16777216-xNCCL_IB_ADAPTIVE_ROUTING=1NCCL_SOCKET_IFNAME=enp226s0-x--gmcabtltcp,self./nccl-tests/build/alltoall_perf

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论