找回密码
 立即注册

QQ登录

只需一步,快速开始

查看: 62|回复: 1

DAY9:阅读CUDA异步并发执行中的Streams

[复制链接]
发表于 2018-5-11 20:06:55 | 显示全部楼层 |阅读模式
QQ图片20180503112158.jpg

                               
登录/注册后可看大图



我们正带领大家开始阅读英文的《CUDA C Programming Guide》,今天是第9天,我们用几天时间来学习CUDA 的编程接口,其中最重要的部分就是CUDA C runtime.希望在接下来的91天里,您可以学习到原汁原味的CUDA,同时能养成英文阅读的习惯。

本文共计582字,阅读时间15分钟



                               
登录/注册后可看大图


前情回顾:DAY5:阅读 CUDA C编程接口之CUDA C runtime

今天继续讲解异步并发执行中的Streams:

3.2.5.5.4. Implicit Synchronization【隐式同步】
Two commands from different streams cannot run concurrently【同时地】 if any one of the following operations is issued in-between them by the host thread
【 两个不同流中的命令不能同时执行,如果host线程在这两个命令中间发布了下面任意操作】:
· a page-locked host memory allocation,【分配page-locked内存】
· a device memory allocation,【分配显存】
· a device memory set,【指普通的memset()函数的cuda版本: cudaMemset,一般用来初始化或者显存清零之类的用途】
· a memory copy between two addresses to the same device memory,【从两个其他地址到相同显存地址的复制操作】
· any CUDA command to the NULL stream,【任何对默认流发布的命令】
· a switch between the L1/shared memory configurations described in Compute Capability 3.x and Compute Capability 7.x.【这计算能力3.X和7.x上进行L1 / shared memory的大小切换配置】
For devices that support concurrent kernel execution【内核并发执行】 and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete:
· Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;
· Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete.
Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery() on that stream. Therefore, applications should follow these guidelines to improve their potential for concurrent kernel execution:
· All independent operations should be issued before dependent operations,
· Synchronization of any kind should be delayed as long as possible.
3.2.5.5.5. Overlapping Behavior【重叠行为】
The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution , concurrent kernel execution , and/or concurrent data transfers.
For example, on devices that do not support concurrent data transfers, the two streams of the code sample of Creation and Destruction do not overlap at all because the memory copy from host to device is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it can only start once the memory copy from device to host issued to stream[0] has completed. If the code is rewritten the following way (and assuming the device supports overlap of data transfer and kernel execution)
QQ图片20180511200536.png
then the memory copy from host to device issued to stream[1] overlaps with the kernel launch issued to stream[0].
On devices that do support concurrent data transfers, the two streams of the code sample of Creation and Destruction do overlap: The memory copy from host to device issued to stream[1] overlaps with the memory copy from device to host issued to stream[0] and even with the kernel launch issued to stream[0] (assuming the device supports overlap of data transfer and kernel execution). However, for devices of compute capability 3.0 or lower, the kernel executions cannot possibly overlap because the second kernel launch is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0] is complete as per Implicit Synchronization. If the code is rewritten as above, the kernel executions overlap (assuming the device supports concurrent kernel execution) since the second kernel launch is issued to stream[1] before the memory copy from device to host is issued to stream[0]. In that case however, the memory copy from device to host issued to stream[0] only overlaps with the last thread blocks of the kernel launch issued to stream[1] as per Implicit Synchronization, which can represent only a small portion of the total execution time of the kernel.

回复

使用道具 举报

 楼主| 发表于 2018-5-11 20:09:28 | 显示全部楼层
Two commands from different streams cannot run concurrently【同时地】 if any one of the following operations is issued in-between them by the host thread:
  • 这下面列的这几点,我做个说明: a device memory set是指普通的memset()函数的cuda版本: cudaMemset,这函数可以对一段显存进行清零或者填充上特定的数据(例如0xff),一般用于初始化之类的。主要是,一般的配置型工作或者初始化都不能同时和其他操作进行。不过这种操作一般在程序开头,实际应用中不影响的。
  • 至于默认流那个,那个默认不能同时进行的。除非你建立流的时候明确带有允许异步执行的标志。
  • 至于3.X和7.X上的问题, 这是因为2.X(已经从9.0开始被放弃支持)和3.X,7.X都是L1 cache和shared memory大小可调的,而Maxwell改成了不可调,Pascal又延续了,然后7.X又改回来了,也就是:2.X可调 -> 3.X可调 -> 5.X不可调 -> 6.X不可 -> 7.x可。
  • a memory copy between two addresses to the same device memory 这个可能不是CUDA的限制,而是如果都是从a -> c和b -> c的复制操作,如果同时进行了,可能会造成逻辑上的混乱。例如用户原本从a->c复制了10MB, 然后从b -> c也复制了10MB,正常情况下先后进行是b覆盖了a的结果。但如果同时进行,结果可能是未知的。(例如c最后有一部分是从a传输来的,另外一部分是从b传输来的)这样结果可能是混乱的。如果从这个角度说,倒是说的过去。但是应当没有人会这样写吧。正常写的人心里应当知道这样是不安全的。不知道CUDA为何要单独说一下。就像一个人去餐厅可能先喝茶,再喝酒,但是一般没有人同时喝酒一口,再喝茶一口的。只不过餐厅现在贴出了通知:本餐厅不支持同时喝酒+喝茶,您每次可以只要一种。


For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete: 字面意思是:对于支持并发kernel执行的,同时计算能力小于等于3.0的设备(即Fermi和初代Kepler---请注意这CUDA 9个时候已经放弃了Fermi支持了,这里应该改成,仅对于初代Kepler(3.0)才好),需要查询或者等待(依赖)某流中的之前的某kernel完成状态的任何操作:
(1)该操作必须等待之前的CUDA Context中的所有流中的所有操作都开始执行后,才能开始执行;
(2)该操作将阻止之后的当前Context中的所有流中的所有操作执行,直到该操作如前所说的,所依赖的某kernel完成执行,或者查询结果返回(操作未完成)。
但是实际中,老卡上的第二点是不对的。主要是老卡只有一个物理上的Kernel Execution Queue, 和2个DMA Queues(Device -> Host 和 Host -> Device),导致了很多情况下原本能并发执行的操作不能并发执行。但是什么操作是所谓的“需要查询或者等待(依赖)某流中的之前的某kernel完成状态”的操作?
显然常见的只有Async结尾的cudaMemcpy*()函数,
以及,应当附加上cudaStreamQuery()
广义的说还有cudaMemcpy*()无async的同步版本和各种分配函数之类的,但这种就包含的广了。
和这里的这段英文说的不同的是,根据实际经验,在老卡(Fermi和计算能力3.0)上使用cudaStreamQuery,非但不像手册这段说的,会可能阻止多种操作的并发性,反而可能会增加老卡上的并发执行效果。(从老卡+Profiler的时间轴上能很容易看到这点)。 好在从计算能力3.5开始(例如K40?),Maxwell, Pascal这些,都具有Hyper-Q了。不存在这些种种限制了。用户也不用学习各种命令发布技巧了。新点的卡任何一种(无论深度,广度,还是用户自己随心所欲的任何一种发布方式),只要逻辑上能并行的,资源也允许的,卡就能给你并行,非常给力。

The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution , concurrent kernel execution , and/or concurrent data transfers.
overlap指的是 执行的操作在时间上重叠(同时执行),比如这个图:

QQ图片20180511200834.png
这个重叠比较多。 一共启动了6个kernel
两个流之间的执行重叠程度,取决于每个流中的命令发布顺序(特别对于无Hyper-Q的卡,这个很重要。例如手册说过的深度优先和广度优先这两种顺序),取决于是否设备支持数据传输和kernel执行重叠,取决于(设备是否支持)并发kernel执行,和/或(取决于)并发数据传输。(计算能力5.0(包含)一下的双向传输需要专业卡,计算能力5.2(包含)家用卡也支持数据双向传输(双Copy Engines)。双向原本是专业卡的特性,现在都开放),类似的一些TCC才能用远程桌面或者服务中使用CUDA,现在家用卡也可以了。很多以前的特性需要专业卡,现在都开放了。 类似的,以前NVENC需要买license才能用。现在NV家用卡开放编码能力,限两路同时编码。目前NV还有的常见限制是专业卡的double,ECC,编码以及虚拟化。(Titan系列算是准专业卡,连Jetson Tx2也有ECC哟)

However, for devices of compute capability 3.0 or lower, the kernel executions cannot possibly overlap because the second kernel launch is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it is blocked until the first kernel launch issued to stream[0] is complete as per Implicit Synchronization.  然后,因计算能力3.0或者更低的设备上的隐式同步问题,(多个)kernel之间的执行可能不能重叠,因为第二个流stream[1]中的kernel启动命令,是在第一个流中stream[0]中的D->H传输命令发布以后,这样它将阻塞,直到第一个流stream[0]中的第一个kernel执行完成以后(才能开始执行)。老卡有很多限制的。发布命令给多个流,需要注意顺序。多种问题。3.5+的卡无任何问题,只要是多流,逻辑上应该并发的,资源允许的情况下就会并发。而不管一些隐晦的限制条件。不过现在的GPU卡都至少5.0以上了

有不明白的地方,请在本文后留言
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐上一条 /1 下一条

快速回复 返回顶部 返回列表