cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sgratton
Adept I

Do calResMap/Unmap block and are different kernels overlapping on the DPP?

Hi there,

I wonder if anybody could help me to understand a possible synchronization issue I'm having with a global memory based matrix program I've written (the performance is poor unfortunately but it's been a good learning exercise...). Basically the main matrix is stored in a global buffer, and multiple kernels (each in their own module but in the same context) operate on it multiple times controlled by a parameter in a constant buffer that is updated by the CPU each time. The reason for using multiple kernels is to enforce synchronization; each kernel needs the update to the global buffer that the previous one does in order to function correctly.

The program structure is:

compile and link kernels
allocate global and constant resources
map/setup/unmap global and constant resources
get the resources into the context
set up a module in the context for each kernel
Associate the memory resources with the modules

while
{
change const buffer using calresmap/unmap
kernel1
optional wait1 using calCtxIsEventDone
kernel2
optional wait2 using calCtxIsEventDone
kernel3
optional wait3 using calCtxIsEventDone
kernel4
optional wait4 using calCtxIsEventDone
}

shutdown


My first question is: Is it valid to change a resource in the loop in this way, and expect succeeding kernels to see the new values?

It does basically seem to work; I am relying on the statement that a mapped resource cannot be used in a CAL kernel, suggesting kernel1 cannot run until after calResUnmap returns, and on the assumption that the memory is actually altered before calResUnmap returns. Are these two assumptions true?

(If this is incorrect I'm thinking of trying introducing a local and a remote constant buffer, updating the remote buffer and using calMemCopy followed by a wait to update the local buffer, as an alternative to using calResMap/Unmap on the single buffer alone. Is this any better?)

By adding and removing wait4 and wait1 and comparing answers, I seem to be finding that a kernel will only start after calResUnmap completes, but that calResMap won't necessarily wait until a kernel completes.

Assuming this is okay, I however only get (what appear to be more correct...) answers with an additional wait between kernels 2 and 3.

This suggests either

A: kernel2 isn't finishing its global writes before kernel3 starts reading, or
B: kernel2 and kernel3 are actually overlapping their execution on the device

Are either of these the case?


Any advice much appreciated!

Thanks a lot,
Steven.

Edit: In case it helps, I've put the code up (cpp & IL) in here. I plan to tidy it up shortly...






0 Likes
2 Replies
sgratton
Adept I


Hi there,

I've now revised my code to use local buffers, and a similar issue with correctness persists in the absence of waits.

After rereading the CAL documentation, I am now wondering if the issue is to do with (not-) flushing and invalidating caches on the gpu. This is vaguely mentioned on p. 7 of the CAL programming guide, but I can't find a clear statement as to when this happens and whether or not waiting on calIsEventDone() or calling calCtxFlush() is guaranteed to do this. (My program seems to function correctly with either, but is slightly faster using the latter.) Perhaps somebody could clarify this?

Thanks a lot,
Steven.

0 Likes

Steven,
1) It is valid to change a constant buffer between kernel calls. The kernel waits on blocking memory operations to finish. Only on Asynchronous calls is the programmer forced to make the wait explicit.
2) What you are seeing between the interaction between a kernel call and calResMap is correct. You must wait for the kernel call to finish before waiting as the kernel call is asynchronous and calResMap is not.
3) As for kernels 2 and 3 possibly overlapping, this is a possibility. This depends on whether you are using the same buffer for input and output and if you are doing uncached/cached reads/writes. Since the GPU does not have a unified memory hierarchy between input and output this needs to be considered when designing an algorithm.
4) calCtxIsEventDone forces the GPU to execute the kernel if it has not executed yet, or when done in a loop waits until the kernel to finish, which is similiar to what calCtxFlush does. I'm not familiar enough with what calCtxFlush does to fully clarify the differences between the two.
0 Likes