AnsweredAssumed Answered

Global synchronization inside the kernel

Question asked by realhet on Apr 16, 2013



I'm facing with the following problem: I have to use LDS for a relatively long time and also I need to gather/scatter data across all the LDS memory.

Scheduling more than one kernel is not an option because I'll have to do 1024 [paralell LDS jobs] interleaved with 1024 [LDS gather operations]. In the final thing I gonna need 192K [paralell LDS jobs] per second, so that really isn't the clEnqueue's area.

WorkGroupSize=64, Total WorkItems=4x gpu streams, all WorkItems fit in LDS: guaranteed.


I tried this way:


          int dstCnt=LoopIdx*cb->GroupCnt;   //value to wait for after all workgroups are done with the atomic_incs

          atomic_inc(&(out->globalCntr));        //inc for this workgroup

          while(out->globalCntr!=dstCnt){}       //wait


But I'm totally not trusting this (because I don't know if caching can interfere this), and it's kinda slow.


Is there a way to use GDS for this?


Also as a side question: The gather operation will sum up float values. Is it a good idea to convert the floats to integers and sum then with atomic_adds? Or is there a way to atomically sum floats?


Thanks in advance!