Due to to local memory limitations, I need to use global memory as cache for my work items.
Suppose I have 1000 work groups with 64 work items each. Each item needs 4K cache. Cache doesn't need to persist after work item completes.
I will allocate one single global memory buffer and assign one chunk of size 4K to each work item.
(I am targeting AMD GPUs)
What is the minimum size I would need to guarantee that there would not be any concurrency issues between work items?
Since AMD has <= 64 CUs, my guess is
64 * 128 * 4000 bytes, and use (global work item ID % (64*128)) to assign a cache chunk to a work item.
Thanks,
Aaron
Solved! Go to Solution.
What would be a safe max for in-flight wavefronts on current hardware?
On GCN device, the theoretical limit of max. in-flight wavefronts is: numOfCU * 4 * 10 [as each SIMD unit has instruction buffer for 10 wavefronts]
can I guarantee that work items are scheduled in order of their global work item id ?
I don't think you can assume an in-order execution unless you create your own chain of dependency. In this scenario, your program may under-utilise the capability of the running device.
Generally, a ND-range consists of multiple work-groups and work-groups are assigned to multiple CUs. All work-items in a work-group are executed on same CU, however a CU can process multiple work-groups at a time. CUs operate independently of each other. It is also possible for different SIMDs within a CU to execute different instructions. As a result, it is not expected that work-items will be processed in order of their global id.
If I can guarantee this, then if let's say M is max in-flight wavefronts, and each wavefront has 64 work items, then (global work item id) % (M*64) would give me a unique id into my cache that would prevent concurrency issues.
I think, the answer is same as previous.
Regards,
Hi Aaron,
Sorry, I don't understand the above calculation. I think the exact number depends on the device capability as well as the application itself. You have to consider max. number of in-flight wavefronts to avoid any kind of concurrency issue.
Regards,
Thanks, Dipak. What would be a safe max for in-flight wavefronts on current hardware? Also, can I guarantee that work items are scheduled in order of their global
work item id ? If I can guarantee this, then if let's say M is max in-flight wavefronts, and each wavefront has 64 work items, then
(global work item id) % (M*64)
would give me a unique id into my cache that would prevent concurrency issues.
What would be a safe max for in-flight wavefronts on current hardware?
On GCN device, the theoretical limit of max. in-flight wavefronts is: numOfCU * 4 * 10 [as each SIMD unit has instruction buffer for 10 wavefronts]
can I guarantee that work items are scheduled in order of their global work item id ?
I don't think you can assume an in-order execution unless you create your own chain of dependency. In this scenario, your program may under-utilise the capability of the running device.
Generally, a ND-range consists of multiple work-groups and work-groups are assigned to multiple CUs. All work-items in a work-group are executed on same CU, however a CU can process multiple work-groups at a time. CUs operate independently of each other. It is also possible for different SIMDs within a CU to execute different instructions. As a result, it is not expected that work-items will be processed in order of their global id.
If I can guarantee this, then if let's say M is max in-flight wavefronts, and each wavefront has 64 work items, then (global work item id) % (M*64) would give me a unique id into my cache that would prevent concurrency issues.
I think, the answer is same as previous.
Regards,
Thanks, Dipak. That clears things up for me.