Further questions :
Can someone shed light on compute shaders? This is what I understood :
a) Only RV770 supports "compute shaders" specified in IL using il_cs_2_0.
b) One cannot access the render backends o0 to o7 from compute shaders.
c) Moreover the threading model in compute shaders is one dimensional. One can access the absolute thread id in a compute shader using vaTid0.x but cannot access vWinCoord0.xy in compute shaders. Similarly one cannot access vaTid0.x in pixel shaders.
d) LDS can only be accessed from compute shaders. Pixel shaders cannot use LDS. Compute shaders can write to LDS or to global memory but not to o0 to o7.
e) To run a compute shader based kernel, one needs to use the extension calCtxRunProgramGrid.
a) Any HD4XXX series card should support compute shader, however, there are hardware constraints that cause the behavior to be different on each card. The best bet is to use the HD4850, HD4870, HD4870x2 or firestream 9250 as they all have the same hardware constraints.
b) This is correct, the only method of writing data out is to use scatter. If you look at export_burst_perf and compare it to the performance of a similiar pixel shader using color buffers, you will see that the peak throughput is better in compute shader.
c) This is correct, however a similar index in pixel shader call vObjIndex0.x is also available, which is almost equivalent to the vaTid0.x.
d&e) These are both correct.
If you have any more LDS/Compute shader specific questions let me know.
I wonder if I could ask a bit more information on compute shaders and LDS. (I am trying to understand which wavefronts see what in "absolute" mode in order to decide whether or not to try and exploit this.)
What is the reason for the splitting of threads into groups, when e.g. LDS seems to operate on a "wavefront" basis?
Is it that groups are introduced simply for ease of indexing threads, or say is each wavefront in a group guaranteed to operate on the same SIMD?
Does only one group operate per SIMD, or can multiple groups operate on a SIMD at the same time? If the latter, does the LDS share across groups in absolute mode?
Does "absolute" sharing mode share across SIMDS?
calctxrunprogramgrid() takes two "domain" parameters. The first, block, one is effectively 1D and often seems to be set to 64, presumably one wavefront. What happens if you set this less than 64? Do you waste processing or do multiple blocks run in one wavefront?
The second, "grid" parameter, seems often to be set as effectively 1D too. Is this compulsory?
calctxrunprogramgridarray() seems to run multiple kernels at a time. Are they guaranteed to operate in any particular order, with say the results of the former being visible to the latter?
Thanks a lot,
Threads are arranged in groups in a sequential manner. For example, if you request that 1024 threads be executed in groups of 512, 0-511 are in group 1 and 512-1023 are in group 2, etc... The one dimensional nature of the thread information is a hardware constraint, but they can be address translated to either 2D or 3D addresses fairly easily using basic math and constant buffers to hold the arguments.
Your understanding of LDS is correct, it is a write-private read-anywhere model.
There are two examples in the SDK of LDS being used. One uses it to share filter data between all the threads in a group(NLM_Denoise_Compute) and the other sample uses it to do block-matrix transpose(LDS_Transpose). There are also some basic bandwidth tests in the samples/runtime directory.
Groups are used to represent the threads in your program that can communicate with each other on the same SIMD. All threads in a group are sequentially allocated and guaranteed to run on the same SIMD. So, each wavefront in the group is run on the same SIMD. This gives you rudimentary control over scheduling. In most cases, multiple groups will run on a SIMD in parallel as long as there are enough resources to allow such behavior. The main constraint here is the amount of memory of the LDS that is used per group. When addressing into the LDS, the addresses that you specify are translated into memory addresses by the hardware. The easiest way to think of it is through the formula: groupId * offset_1 + threadId * offset_2 + offset_3. offset_3 is what is specified by the kernel. This is a hardcoded literal in writes and the y component in reads. offset_1 in relative addressing mode is the group size, offset_2 is the lds_size_per_thread. In Absolute addressing mode, offset_1 is set to 0 and threadId is set to threadId % wavefrontsize. So, absolute addressing mode limits all threads in a group to the data of the first wavefront. This is a per simd setting.
For HD4XXX series of graphics cards, the SDK currently only supports 1D groups. The block parameter must be equal to your group size and the grid parameter must be an integer multiple of the block size. So, if you have a block size of 64 and you want to run 100 threads, you have to specify a grid size of 2. If the block size is not a multiple of the wavefront size, you waste resources.
The order in which you pass kernels to calCtxRunProgramGridArray is the order that they are executed.
Hope this helps
OK, while we are at it, could you explain all the barrier types. The documentation is nothing if not confusing about these instructions.
- If I don't specify _threads, does this mean that all threads in a warp are synchronised?
- Do they behave like thread barriers? (ie. all threads must meet the barrier before continuing).
And regarding LDS memory, what is the imagined usage case for absolute thread memory addressing? How can you know how many groups will be scheduled per SIMD? My understanding is that all groups/ wavefronts on a single SIMD will share the same section of LDS on that SIMD, however each SIMD will have an independant LDS.
Edit: Another question while I'm at it. Does calCtxRunProgramGridArray reduce the kernel invokation overhead? (ie. what is the reason to use this rather than multiple independant invokations?)
For the fence instruction, there must be a minimum of one flag used and up to all of them being used is ok. The ordering does not matter. the _lds flag inserts a barrier instruction such that no thread in the group can execute an LDS instruction after the LDS barrier until all threads reach the fence instruction. _threads makes sure that no threads execute beyond that point and no instructions are scheduled around that point. _memory is same as threads but for memory operations, and _sr is for shared register operations.
fence instructions are barriers and the flags are barriers for specific types. So fence_memory would be a memory barrier, fence_threads would be a thread barrier, etc...
calCtxRunProgramGridArray doesn't reduce kernel invocation, but it does guarnatee that all the kernels are executable sequentially with no interruption from the OS. This allows for persistent data across kernel calls via SR/LDS.
The use case for absolute addressing is for matrix transpose within a group using _neighborExch. So you write and read without requiring barrier. There are issues with this approach however, as the compiler does not guarantee that writes and reads happen sequentially.
The number of groups scheduled per SIMD is the max threads per group / threads per group. On the HD48XX series, there is a max group size of 1024.
Thanks Micah for the quick reply.
My confusion arrises from the following line within the IL documentation:
Use at least one of the selections _lds, _memory, or _sr. The fence_threads itself is meaningless and illegal. (s. 6-32 IL spec.)
From this (and the rest of that documentation) I infer that _threads causes synchronisation across all threads within a group, with something else (unknown to me) happening otherwise (perhaps across warp only?).
The documentation indicates that these instructions do not behave like a CUDA __syncthreads() barrier, and I don't see how they can behave like usual CPU memory barriers. This is why I'm hoping for some more information...
Thanks a lot for your helpful reply; much appreciated. Malcolm3141 has raised some interesting questions about synchronization related to what I've been thinking about too:
Just to confirm, in absolute mode, is LDS memory definitely per SIMD and not per group? (i.e. could one imagine inter-group communication with it?)
Do fence instructions apply per group, or per SIMD? (In particular say the _lds one in the case when there might be more than one group per SIMD?)
The usage I had in mind for absolute addressing mode is as follows. Imagine I have e.g. a 4x4 matrix I want to multiply into a 4x8192 matrix. Using an "if", can I get the 4x4 matrix to be loaded once per SIMD into the LDS (say by only one wavefront) and then get each of 8192/4=2048 threads to multiply this into the corresponding 4x4 submatrix of the 4x8192 one? This could reduce memory bandwidth by 1/3, basically only needing 2 4x4 matrix reads or writes per thread rather than 3 if each thread also read in the original 4x4 matrix itself. Does this seem sensible?
Of course, 16x16 might better be than 4x4 if using lds. If I really was using 4x4 blocks, could I use shared registers in a similar manner? Are they per simd or per group, and are the fences per simd or per group?
I'm looking into the fence instruction and going to get a better explanation written up, so will have to wait on that.
For what you are describing, it would probably be better to use shared registers and calCtxRunProgramGridArray. You would run two passes where the first pass you run one group per simd and then write the 4x4 matrix to the SR registers and the second pass you do the matrix multiply. Another way, if your LDS access is read only, you can store the data in LDS. However, SR registers are much faster than LDS.
LDS in either addressing mode is per SIMD. It currently is not possible for inter-group communication because the writes are based on the addressing mode and are not indexable and the reads are constrained by the compiler. The compiler also does not make any guarantees on how many groups will run per simd and is based on some simple heuristics that could change in future revisions.
So to run one group per simd to set up the shared registers do I have to run at least as many thread groups as there are simds and then either demand 1024 threads per group or 16kB relative LDS per group (to stop more than one group running on the same simd), or is there an easier way?
For the calCtxRunProgramGridArray to work, the setup thread has to have the exact same group size, shared register allocation size, and lds size. The only difference that should be happening is what the kernel does, the setup code should be equivalent and the number of threads being executed should be enough to completely fill up the card once. So, if your main kernel has a group size of 64, using 3 shared registers and an lds size of 16. Then you have to run 40 groups. 1 group fills up 1/4th of the LDS and you want to fill up the full LDS so, 64/16 = 4. And then 10 simds, so, 10 * 4 = 40 groups.
i'm just wondering what the absolute thread id register contains. Is it the of index of the thread on the SIMD ? (for instance 0 for the thread with vTid 0 of the first group running on a SIMD, groupsize for the thread with vTid 0 of the second group and so on)
vaTid, or absolute thread id, contains the ID of the thread within the execution domain. For example if you are running 1024x1024 threads, the value is between 0 and (1024*1024-1). This is an auto-incremented value, so thread 0 is the first thread to be spawned, followed by thread 1, etc... with thread (1024x1024-1) being the last one to be spawned.
vTid, is the group thread id, contains the ID of the thread within the group.
vTgroupid, is the group id, contains the ID of the group.