OK, while we are at it, could you explain all the barrier types. The documentation is nothing if not confusing about these instructions.
More specifically:
And regarding LDS memory, what is the imagined usage case for absolute thread memory addressing? How can you know how many groups will be scheduled per SIMD? My understanding is that all groups/ wavefronts on a single SIMD will share the same section of LDS on that SIMD, however each SIMD will have an independant LDS.
Edit: Another question while I'm at it. Does calCtxRunProgramGridArray reduce the kernel invokation overhead? (ie. what is the reason to use this rather than multiple independant invokations?)
Thanks, Malcolm
Thanks Micah for the quick reply.
My confusion arrises from the following line within the IL documentation:
Use at least one of the selections _lds, _memory, or _sr. The fence_threads itself is meaningless and illegal. (s. 6-32 IL spec.)
From this (and the rest of that documentation) I infer that _threads causes synchronisation across all threads within a group, with something else (unknown to me) happening otherwise (perhaps across warp only?).
The documentation indicates that these instructions do not behave like a CUDA __syncthreads() barrier, and I don't see how they can behave like usual CPU memory barriers. This is why I'm hoping for some more information...
Thanks, Malcolm