Hi there,
I wonder if anybody could help me to understand a possible synchronization issue I'm having with a global memory based matrix program I've written (the performance is poor unfortunately but it's been a good learning exercise...). Basically the main matrix is stored in a global buffer, and multiple kernels (each in their own module but in the same context) operate on it multiple times controlled by a parameter in a constant buffer that is updated by the CPU each time. The reason for using multiple kernels is to enforce synchronization; each kernel needs the update to the global buffer that the previous one does in order to function correctly.
The program structure is:
compile and link kernels
allocate global and constant resources
map/setup/unmap global and constant resources
get the resources into the context
set up a module in the context for each kernel
Associate the memory resources with the modules
while
{
change const buffer using calresmap/unmap
kernel1
optional wait1 using calCtxIsEventDone
kernel2
optional wait2 using calCtxIsEventDone
kernel3
optional wait3 using calCtxIsEventDone
kernel4
optional wait4 using calCtxIsEventDone
}
shutdown
My first question is: Is it valid to change a resource in the loop in this way, and expect succeeding kernels to see the new values?
It does basically seem to work; I am relying on the statement that a mapped resource cannot be used in a CAL kernel, suggesting kernel1 cannot run until after calResUnmap returns, and on the assumption that the memory is actually altered before calResUnmap returns. Are these two assumptions true?
(If this is incorrect I'm thinking of trying introducing a local and a remote constant buffer, updating the remote buffer and using calMemCopy followed by a wait to update the local buffer, as an alternative to using calResMap/Unmap on the single buffer alone. Is this any better?)
By adding and removing wait4 and wait1 and comparing answers, I seem to be finding that a kernel will only start after calResUnmap completes, but that calResMap won't necessarily wait until a kernel completes.
Assuming this is okay, I however only get (what appear to be more correct...) answers with an additional wait between kernels 2 and 3.
This suggests either
A: kernel2 isn't finishing its global writes before kernel3 starts reading, or
B: kernel2 and kernel3 are actually overlapping their execution on the device
Are either of these the case?
Any advice much appreciated!
Thanks a lot,
Steven.
Edit: In case it helps, I've put the code up (cpp & IL) in
here. I plan to tidy it up shortly...