I would like to generate a separate stream of random numbers for each stream core on each SIMD engine. To do this, a thread needs to know which core and engine a it is executing on. Alternatively, I would settle for being able to generate a global address per thread that is guarenteed conflict-free, but with the total number of globals being close to the total number of cores.
With one workgroup per SIMD, then the SIMD ID is obviously the workgroup ID. If the workgroup size is larger than the number of thread cores, is the core ID the workgroup size mod 16? ie, is it assured that thread 0, 16, 32, 48 are all executed on the same thread core (assuming 16 thread cores per SIMD)?
Originally posted by: drstrip With one workgroup per SIMD, then the SIMD ID is obviously the workgroup ID. If the workgroup size is larger than the number of thread cores, is the core ID the workgroup size mod 16? ie, is it assured that thread 0, 16, 32, 48 are all executed on the same thread core (assuming 16 thread cores per SIMD)?
Assuming that you proceed following the way shown by Micah, you should get the core ID through the function "get_group_id":
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/get_group_id.html
Originally posted by: Fr4nz
Assuming that you proceed following the way shown by Micah, you should get the core ID through the function "get_group_id":
http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/get_group_id.html
I'm familiar with the function and was planning to use that to identify the thread within the work group. However, it is my understanding that I should make the workgroup size a multiple of the number of cores for maximum efficiency. In that case, the return value of this function will span a range larger than the number of thread cores, which brings us back to my original question. Let's say I make the workgroup size 64 and have 16 stream cores per SIMD. Do work_items 0-15 execute together, 16-31 together, etc? Do work_items 0,16,32, 48 execute on the same core, while 1, 17, 33, 49 on another, etc.?
Originally posted by: drstrip
I'm familiar with the function and was planning to use that to identify the thread within the work group. However, it is my understanding that I should make the workgroup size a multiple of the number of cores for maximum efficiency. In that case, the return value of this function will span a range larger than the number of thread cores, which brings us back to my original question. Let's say I make the workgroup size 64 and have 16 stream cores per SIMD. Do work_items 0-15 execute together, 16-31 together, etc? Do work_items 0,16,32, 48 execute on the same core, while 1, 17, 33, 49 on another, etc.?
Correct. The important thing is to use a work-group size equivalent to the dimension of a wave-front (which is executed on a single SIMD engine and has a dimension of 64 threads on 5xxx): in this way you'll be sure of what you're doing.
Originally posted by: Fr4nz
Correct. The important thing is to use a work-group size equivalent to the dimension of a wave-front (which is executed on a single SIMD engine and has a dimension of 64 threads on 5xxx): in this way you'll be sure of what you're doing.
And am I correct that the RV770 has wavefront size of 64 as do the new Cypress chips?
And what if I need more work_items than
number_of_SIMD_engines * max_work_group size?
Will work_group n be executed on
SIMD engine (n mod number_of_SIMD_engines) ?
e.g, for a 10 SIMD engine chipset,
workgroups 1, 11, 21, ... will execute on the same engine
workgroups 2, 12, 22, ... will execute on the same engine, etc
I haven't been able to come up with an experiment to test this conjecture, so if you have ideas ...
in the context of the code I'm working on, the kernel has no branches, so each execution should take the same time, modulo memory contention.
Also, my workgroup size is equal to wavefront size, so presumably that means a workgroup would "fill" a SIMD, right?
Why you don't add a get_compute_unit_id() to the OpenCL 1.1 spec? That would be fantastic, specially for RNG and also for debugging !
Originally posted by: MicahVillmow Up to ~24.8 wavefronts can fit on a single SIMD depending on resource constraints, so it depends on the scheduling mode for how the SIMD's receive wavefronts.
In the case of assignment by round-robin, am I correct in interpreting your statement to say that more than one wavefront can be assigned to a SIMD at the same time? If so, is there anyway to predict how execution is interleaved among the wavefronts? If a thread in one wavefront does a read-op-write sequence to some global location based on it's SIMD and local_id, can this sequence conflict with another wavefront on the SIMD with the same local_id?
In the case of the schedule till filled, we have the question above, plus the question of how do we tell how many work-groups have been assigned to the SIMD?
As bubu writes, a get_compute_unit_id() function would be great, though it will almost certainly take a long time for this to appear in code, even if agreed on tomorrow. It also still requires answers to the questions above about conflicts between wavefronts assigned to the same SIMD.