Due to to local memory limitations, I need to use global memory as cache for my work items.
Suppose I have 1000 work groups with 64 work items each. Each item needs 4K cache. Cache doesn't need to persist after work item completes.
I will allocate one single global memory buffer and assign one chunk of size 4K to each work item.
(I am targeting AMD GPUs)
What is the minimum size I would need to guarantee that there would not be any concurrency issues between work items?
Since AMD has <= 64 CUs, my guess is
64 * 128 * 4000 bytes, and use (global work item ID % (64*128)) to assign a cache chunk to a work item.