So, I've been tracking a performance issue in a library I'm writing, and after about 2 days of nonstop hunting, I've been able to reproduce it in a different form (so hopefully the form of the issue I care about is relevant):
a) I have a buffer and I call calResAlloc*D() then calCtxGetMem() WITHOUT calling calResMap() followed by calResUnmap(). (I know the data allocated on the GPU is meaningless if you don't copy it...)
b) I do the same as a except that I do call calResMap() and calResUnmap()
In both a and b, I have:
3*32 1D buffers of size 8192 (inputs)
1 1D buffer of size 640 (input)
32 2D buffers of size 640x8192 (outputs)
Naturally, I have to run 32 instances of the kernels with different parameter mappings to fill all 32 outputs. So, I time the wall time between calling the first group of
calModuleGetName()
calCtxSetMem()
calCtxRunProgram(&event, ctx, kernelEntry, &domain);
So, I figured it out my main issue as well. I wasn't calling calResMap() and calResUnmap on the output buffers before using them. I was only calling calResMap() and Unmap() to get the buffers from the GPU after the kernel had completed. However, I was still getting the correct answer. I'm still puzzled why this would affect the performance of a kernel, other than maybe some weird cacheing issues.