找回密码
 立即注册

QQ登录

只需一步,快速开始

查看: 51|回复: 1

DAY12:阅读CUDA C Runtime 之多GPU编程

[复制链接]
发表于 2018-5-15 13:57:31 | 显示全部楼层 |阅读模式
3.2.6. Multi-Device System


3.2.6.1. Device Enumeration
A host system can have multiple devices. The following code sample shows how to enumerate these devices, query their properties, and determine the number of CUDA-enabled devices.
QQ图片20180515141631.png
3.2.6.2. Device Selection
A host thread can set the device it operates on at any time by calling cudaSetDevice(). Device memory allocations and kernel launches are made on the currently set device; streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0.
The following code sample illustrates how setting the current device affects memory allocation and kernel execution.
QQ图片20180515141644.png
3.2.6.3. Stream and Event Behavior
A kernel launch will fail if it is issued to a stream that is not associated to the current device as illustrated in the following code sample.
QQ图片20180515141658.png
A memory copy will succeed even if it is issued to a stream that is not associated to the current device.
cudaEventRecord() will fail if the input event and input stream are associated to different devices.
cudaEventElapsedTime() will fail if the two input events are associated to different devices.
cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associated to a device that is different from the current device.
cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.
Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device.
3.2.6.4. eer-to-Peer Memory Access
When the application is run as a 64-bit process, devices of compute capability 2.0 and higher from the Tesla series may address each other's memory (i.e., a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.
Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.
A unified address space is used for both devices (see Unified Virtual Address Space), so the same pointer can be used to address memory from both devices as shown in the code sample below.
QQ图片20180515141712.png
3.2.6.5. Peer-to-Peer Memory Copy
Memory copies can be performed between the memories of two different devices.
When a unified address space is used for both devices (see Unified Virtual Address Space), this is done using the regular memory copy functions mentioned in Device Memory.
Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(), cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync() as illustrated in the following code sample.
QQ图片20180515141723.png
A copy (in the implicit NULL stream) between the memories of two different devices:
  • does not start until all commands previously issued to either device have completed and
  • runs to completion before any commands (see Asynchronous Concurrent Execution) issued after the copy to either device can start.

Consistent with the normal behavior of streams, an asynchronous copy between the memories of two devices may overlap with copies or kernels in another stream.
Note that if peer-to-peer access is enabled between two devices via cudaDeviceEnablePeerAccess() as described in Peer-to-Peer Memory Access, peer-to-peer memory copy between these two devices no longer needs to be staged through the host and is therefore faster.

回复

使用道具 举报

 楼主| 发表于 2018-5-15 16:31:16 | 显示全部楼层
streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0
一旦你设定了设备后,例如cudaSetDevice(3)选择了3号卡,则以后你进行显存分配(cudaMalloc),或者流创建,或者kernel启动,都将在这个设定的卡上启动,都将在这个设定的卡上进行。换句话说,如果你有4张卡, 你需要在这4张卡上分配分配1GB显存,你需要分配在CudaSetDevice了0,1,2,3后,再进行cudaMalloc。也换句话说,还是你有4张卡,你需要分别在cudaSetDevice了0,1,2,3后,再分别进行4次单独的启动,才能在这4张卡上运行你的kernel。而不是直接启动一次,就在这4张卡上全部使用了。换句话说,多卡编程是手动的,而不是自动的。 如果你不设置的话,就是默认在device0的设备, 那样的话剩下的卡就浪费了。以及,需要说明是,cuda 9进入了协作组,允许一个很特别的API能同时在多个卡上启动kernel,但有很多限制条件,以及,限制使用C++,这个以后再说。以及,还需要说明的是,很多库(例如自带的cublas)可以自动利用多卡。但这个也以后再说。你需要知道cublas这样的能自动多卡的,内部也是这样手工使用了多张卡,只是对用户屏蔽了这点,看上去是自动的。

A kernel launch will fail if it is issued to a stream that is not associated to the current device  流和当前的卡必须对应,试图直接使用另外一张卡(通过cudaSetDevice到卡2例如),和前一张卡上的流(例如卡1上的流),是无法在这样的组合下启动kernel的。也就是说,你不能试图在卡2上启动一个kernel,却使用另外不是本卡的流。(流和Kernel是啥关系? kernel必须在一个流中才能启动的,流中的所有操作都是顺序进行的,流在OpenCL中的对等概念叫CommandQueue)

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device.
多卡的环境下,因为每张卡都有自己的默认流,所以发布给不同的卡的默认流中的命令,它们之间的执行关系是乱序的。 这段话其实是句废话。这不显然么。 因为乱序执行已经足够说明了。 可能kernel 1在kernel 2前面,也可能kernel 2在kernel 1前面,也可能他俩同时开始,同时完成。都有可能的。

Peer-to-peer memory access must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. Each device can support a system-wide maximum of eight peer connections.。P2P内存访问必须在两个设备间,通过出cudaDeviceEnablePeerAccess()来启用, 在一个系统内,每张卡最多能和8张其他的卡建立起来P2P访存。

Peer-to-Peer Memory Access和Peer-to-Peer Memory Copy是啥区别?
前者是卡B,能直接像自己的显存那样的,使用卡A的显存,后者各个是P2P复制,必须卡B将卡A的显存中的内容复制到自己的显存,然后卡B(上的kernel)才能用。前者能直接用。后者需要复制过来。 能用前者建议总是用前者,除非:
(1)主板不支持(例如你将两张卡分别插在2路CPU各自管辖的PCI-E下面)
(2)系统不支持(例如Windows平台下面试图使用,却是家用卡,不支持TCC)
(3)神马都支持,完全可以直接使用前者。但你考虑到这段缓冲区会被反复使用,总是跨PCI-E访问另外一张卡的显存效率低,则可以手工复制过来,然后使用本卡的副本。
注意Windows下的P2P Copy是完全开放的,P2P Access却需要专业卡+TCC,P2P Copy在不能直接复制的时候,会自动通过内存中转(例如之前的情况1),而P2P Access会直接失败。P2P Access有个好处,就是一张卡能用2张卡的显存,甚至3张,4张,8张,对跑一些适合需要显存容量的应用很方便。以及,P2P Access有个超级强化版。就是DGX上的那个。卡间的P2P Access不仅仅可以通过PCI-E,还能通过NVLink提供超级高的带宽,这样DGX上的所有卡的显存几乎都可以聚合起来。适合那种跑超级大显存的应用。普通版本的P2P Access,在主板,系统,卡都支持的时候,虽然慢点(不如DGX),但依然解决了显存不够的问题。而P2P Copy,因为是将一张卡的显存复制到自己显存里,不能扩大等效显存容量的。所以没用。
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐上一条 /1 下一条

快速回复 返回顶部 返回列表