Archives Discussions

boxerab · ‎03-09-2017

Due to to local memory limitations, I need to use global memory as cache for my work items.

Suppose I have 1000 work groups with 64 work items each. Each item needs 4K cache. Cache doesn't need to persist after work item completes.

I will allocate one single global memory buffer and assign one chunk of size 4K to each work item.

(I am targeting AMD GPUs)

What is the minimum size I would need to guarantee that there would not be any concurrency issues between work items?

Since AMD has <= 64 CUs, my guess is

64 * 128 * 4000 bytes, and use (global work item ID % (64*128)) to assign a cache chunk to a work item.

Thanks,

Aaron

dipak · ‎03-13-2017

What would be a safe max for in-flight wavefronts on current hardware?

On GCN device, the theoretical limit of max. in-flight wavefronts is: numOfCU * 4 * 10 [as each SIMD unit has instruction buffer for 10 wavefronts]

can I guarantee that work items are scheduled in order of their global work item id ?

I don't think you can assume an in-order execution unless you create your own chain of dependency. In this scenario, your program may under-utilise the capability of the running device.

Generally, a ND-range consists of multiple work-groups and work-groups are assigned to multiple CUs. All work-items in a work-group are executed on same CU, however a CU can process multiple work-groups at a time. CUs operate independently of each other. It is also possible for different SIMDs within a CU to execute different instructions. As a result, it is not expected that work-items will be processed in order of their global id.

If I can guarantee this, then if let's say M is max in-flight wavefronts, and each wavefront has 64 work items, then (global work item id) % (M*64) would give me a unique id into my cache that would prevent concurrency issues.

I think, the answer is same as previous.

Regards,

View solution in original post

dipak · ‎03-10-2017

Hi Aaron,

Sorry, I don't understand the above calculation. I think the exact number depends on the device capability as well as the application itself. You have to consider max. number of in-flight wavefronts to avoid any kind of concurrency issue.

Regards,

boxerab · ‎03-10-2017

Thanks, Dipak. What would be a safe max for in-flight wavefronts on current hardware? Also, can I guarantee that work items are scheduled in order of their global

work item id ? If I can guarantee this, then if let's say M is max in-flight wavefronts, and each wavefront has 64 work items, then

(global work item id) % (M*64)

would give me a unique id into my cache that would prevent concurrency issues.

dipak · ‎03-13-2017

What would be a safe max for in-flight wavefronts on current hardware?

On GCN device, the theoretical limit of max. in-flight wavefronts is: numOfCU * 4 * 10 [as each SIMD unit has instruction buffer for 10 wavefronts]

can I guarantee that work items are scheduled in order of their global work item id ?

I don't think you can assume an in-order execution unless you create your own chain of dependency. In this scenario, your program may under-utilise the capability of the running device.

Generally, a ND-range consists of multiple work-groups and work-groups are assigned to multiple CUs. All work-items in a work-group are executed on same CU, however a CU can process multiple work-groups at a time. CUs operate independently of each other. It is also possible for different SIMDs within a CU to execute different instructions. As a result, it is not expected that work-items will be processed in order of their global id.

If I can guarantee this, then if let's say M is max in-flight wavefronts, and each wavefront has 64 work items, then (global work item id) % (M*64) would give me a unique id into my cache that would prevent concurrency issues.

I think, the answer is same as previous.

Regards,

boxerab · ‎03-13-2017

Thanks, Dipak. That clears things up for me.

Archives Discussions

Global memory for work item cache