I'm facing with the following problem: I have to use LDS for a relatively long time and also I need to gather/scatter data across all the LDS memory.
Scheduling more than one kernel is not an option because I'll have to do 1024 [paralell LDS jobs] interleaved with 1024 [LDS gather operations]. In the final thing I gonna need 192K [paralell LDS jobs] per second, so that really isn't the clEnqueue's area.
WorkGroupSize=64, Total WorkItems=4x gpu streams, all WorkItems fit in LDS: guaranteed.
I tried this way:
int dstCnt=LoopIdx*cb->GroupCnt; //value to wait for after all workgroups are done with the atomic_incs
atomic_inc(&(out->globalCntr)); //inc for this workgroup
But I'm totally not trusting this (because I don't know if caching can interfere this), and it's kinda slow.
Is there a way to use GDS for this?
Also as a side question: The gather operation will sum up float values. Is it a good idea to convert the floats to integers and sum then with atomic_adds? Or is there a way to atomically sum floats?
Thanks in advance!