Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept II

Very strange and aggravating kernel performance issue related to calResMap()/Unmap

So, I've been tracking a performance issue in a library I'm writing, and after about 2 days of nonstop hunting, I've been able to reproduce it in a different form (so hopefully the form of the issue I care about is relevant):

a) I have a buffer and I call calResAlloc*D() then calCtxGetMem() WITHOUT calling calResMap() followed by calResUnmap(). (I know the data allocated on the GPU is meaningless if you don't copy it...)

b) I do the same as a except that I do call calResMap() and calResUnmap()

In both a and b, I have:

3*32 1D buffers of size 8192 (inputs)

1 1D buffer of size 640 (input)

32 2D buffers of size 640x8192 (outputs)

Naturally, I have to run 32 instances of the kernels with different parameter mappings to fill all 32 outputs. So, I time the wall time between calling the first group of

calCtxRunProgram(&event, ctx, kernelEntry, &domain);

and the time when all 32 events != CAL_RESULT_PENDING (i.e. all 32 kernels are completed). Both cases a) and b) are run with the same domain as each other and run the same shader as each other.
However, a) takes .14s to complete while b) takes only .05s to complete.
What could cause this?
1 Reply
Adept II

So, I figured it out my main issue as well. I wasn't calling calResMap() and calResUnmap on the output buffers before using them. I was only calling calResMap() and Unmap() to get the buffers from the GPU after the kernel had completed. However, I was still getting the correct answer. I'm still puzzled why this would affect the performance of a kernel, other than maybe some weird cacheing issues.