Very strange and aggravating kernel performance issue related to calResMap()/Unmap

Discussion created by rick.weber on Jan 23, 2009
Latest reply on Jan 23, 2009 by rick.weber

So, I've been tracking a performance issue in a library I'm writing, and after about 2 days of nonstop hunting, I've been able to reproduce it in a different form (so hopefully the form of the issue I care about is relevant):

a) I have a buffer and I call calResAlloc*D() then calCtxGetMem() WITHOUT calling calResMap() followed by calResUnmap(). (I know the data allocated on the GPU is meaningless if you don't copy it...)

b) I do the same as a except that I do call calResMap() and calResUnmap()

In both a and b, I have:

3*32 1D buffers of size 8192 (inputs)

1 1D buffer of size 640 (input)

32 2D buffers of size 640x8192 (outputs)

Naturally, I have to run 32 instances of the kernels with different parameter mappings to fill all 32 outputs. So, I time the wall time between calling the first group of

calCtxRunProgram(&event, ctx, kernelEntry, &domain);

and the time when all 32 events != CAL_RESULT_PENDING (i.e. all 32 kernels are completed). Both cases a) and b) are run with the same domain as each other and run the same shader as each other.
However, a) takes .14s to complete while b) takes only .05s to complete.
What could cause this?