1
CUDA / Re: 我的GPU过程没进入计算
« 最后发表 作者 2017012835 于 三月 09, 2023, 08:23:59 pm »要不你用cuda-gdb或cuda-memcheck检查一下内存,我之前也发生过这种情况,就是一个设备上的数组空间大小分配错了,以及核函数中if条件句中的条件都不成立
BrandonH | Edge AI: From Model Development to Deployment [S52125] |
晓得薛定谔的喵 | From Tortoise to Hare: How AI Can Turn Any Driver into a Race Car Driver [S51328] |
魏武卒 | Accelerating Generative AI in Biology and Healthcare [S51257] |
生生 | Advances in Accelerated Computing for Scientific Computing [S52137] |
geo-sig | How to Write a CUDA Program [S51210] |
The Galaxy Rabbit | Accelerating Transformer-Based Encoder-Decoder Language Models for Sequence-to-Sequence Tasks [S51158] |
丙乙 | Portable Acceleration of HPC Applications using ISO C++ — Part 1: Fundamentals* [DLIT51169] |
白碧哲 | How to Build a Real-time Path Tracer [S51871] |
Brain Compiler | Watch Party: 如何设计优化 CUDA 程序 [WP51210] |
咖啡逗 | Accelerate AI Innovation with Unmatched Cloud Scale and Performance (Presented by Microsoft) [S52469] |
00101010 | Creating and Executing an Effective Cyberdefense Strategy in an AI-Driven Business [S51723] |
PixelPassion | Deep Reinforcement Learning with Real-World Data [S51826] |
Christy | Watch Party: CUDA 新特性和发展 [WP51225] |
16 | Accelerated AI Logistics and Route Optimization 101* [DLIT51886] |
Ryan | Developing Robust Multi-Task Models for AV Perception [SE50006] |
RP Cai | Addressing the AI Skills Gap [SE52129] |
蔡欣 | Speech-To-Speech Translation System for a Real-World Unwritten Language [S51780] |
亚克西 | Scaling Deep Learning Training: Fast Inter-GPU Communication with NCCL [S51111] |
Ranoe:) | Jetson Edge AI Developer Days: Accelerate Edge AI With NVIDIA Jetson Software [SE52433] |
d2real | Connect with the Experts: GPU-Accelerated Quantum Chemistry and Molecular Dynamics [CWES52130] |
slam | Connect with the Experts: A Deep-Dive Q&A with Jetson Embedded Platform Engineers [CWES52132] |
挠你个痒痒 | AI 初创企业在中国市场的发展和机会 ——探索中国 AI 初创力量 [SE52131] |
。 | Building Trust in AI for Autonomous Vehicles [S51934] |
位位位 | Introduction to Autonomous Vehicles [S51168] |
Tian_DY | A Comprehensive Low-Code Solution for Large-Scale 3D Medical Image Segmentation with MONAI Core and Auto3DSeg* [DLIT51974] |
Permanent Maniac | Watch Party: 基于 TensorRT 的端到端子图优化框架 [WP51416] |
王威 | 3D by AI: Using Generative AI and NeRFs for Building Virtual Worlds [S52163] |
您好,您第一个问题是不是数组索引越界了,for循环里的S_A[threadIdx.x * 12 + j],其中threadIdx.x最大是127,当它取最大值(或没有取到最大值时),threadIdx.x * 12 + j都超过了您设置的S_A[128]里面的128您好,我按照您说的索引越界问题进行了修改将S_A定义设置为__shared__ float S_A[1536],程序可以正确运行,但是将定义设置为__shared__ float S_A[1280]程序也能正确运行,所以对共享内存数组大小问题还是有疑问。
kernel2(float* A,float* M, float* Y,float* X)
{
const int n = blockDim.x * blockIdx.x + threadIdx.x;
E[n]=0;
__shared__ float S_E[sizeof(float) * 320];
__shared__ float S_A[sizeof(float) * 320];
S_E[threadIdx.x] = E[n];
__syncthreads();
for (int j = 0; j < 12; j++)
{
S_A[threadIdx.x * 12 + j] = A[n * 12 + j];
S_A[threadIdx.x * 12 + j]=M[Y[n]]+X[n * 12 + j];
atomicAdd(&S_E[threadIdx.x],S_A[threadIdx.x * 12 + j]/2);
}
__syncthreads();
}
请问如何才能最大程度减轻共享内存S_A和S_E共享内存的bank冲突问题?kernel2(float* A,float* M, float* Y,float* X)
{
const int n = blockDim.x * blockIdx.x + threadIdx.x;
__shared__ float S_A[sizeof(float) * 320];
for (int j = 0; j < 12; j++)
{
S_A[threadIdx.x * 12 + j] = A[n * 12 + j];
S_A[threadIdx.x * 12 + j]=M[Y[n]]+X[n * 12 + j];
}
__syncthreads();
}