你的两个问题:
(1)如果某指令中的某操作数,使用了c[XX][XX] 的形式,硬件执行上的策略区别,这个无具体资料。但是考虑到我们的GPU是一个latency processor,设计于掩盖延迟的:所以c[XX][XX] 作为操作数,可能读取上的确比直接读取寄存器,延迟要高一点,但这种高一点点,可能并不会影响GPU的任何性能。
(《dissecting XXX via microbenchmarking》, 其中XXX是GPU架构名,这系列文章里面有讲述constant memory的各种情况下的延迟,感兴趣可以看一下)。
(2)关于Maxwell/Pascal(5.X & 6.X)上的L1/Tex cache的特性。这里简单说一下:很多时候,5.0,5.2和6.1的硬件,无法使用L1 cache缓存读写的数据,但是用__ldg()进行只读加载,可以有效的被L1缓冲,从而越过了很多场合的必须从L2加载的弊端,往往可以提升性能。在这两代的硬件上,使用const __restrict__修饰的指针,可以提示编译器自动进行这一点,但我不知道编译器能否像老樊那样说的,全自动的发现这点(因为你这里没有贴出老樊的书的原文)。
扩展阅读:分别查找 __ldg, const __restrict__, LDG.E.CI以获取更多信息。阅读手册的计算能力的5.X和6.X的章节,和自带的pascal tuning/compatiblity guide手册,获取Pascal在特定情况下,L1被自动禁用的章节(或者自动修复occupancy为0的BUG的章节),以获取更多信息。
在官方论坛
https://forums.developer.nvidia.com/t/ldg-versus-textures/31383找到了ldg指令的一些解释:
The LDG instruction is a global memory load that uses the texture path. It has the advantage that it does not require the explicit use of textures. Explicit uses of textures causes a certain amount of code clutter and overhead (e.g. for API calls to bind textures), and textures are objects unfamiliar to many programmers new to CUDA. The introduction of LDG therefore increases the ease of use.
Whether the use of LDG results in higher or lower performance compared to the use of classical textures depends on the use case, I am not aware of any hard-and-fast rule about that. When comparing the performance with LDG vs regular global loads, I have found that use of LDG results in higher performance in most cases, but recall at least one real-life use case where this was not the case.
As eyalhir74 points out, declaring pointers as “const restrict” facilitates, but does not guarantee, the generation of LDG instructions on Kepler-class GPUs. The reason for this is that the use of “const restrict” pointers makes assertions about local read-only behavior, whereas use of LDG requires the data to be read-only for the lifetime of a kernel. Only the use of the __ldg() device function ensures that LDG is generated.