I'm currently working on the OpenCL Parallel Primitive library and so I currently read a lot of literature that come from NVidia.
A lot of optimizations come from the fact that they use the WARP concept (32 threads as SIMT).
So, I would like to know is there are some equivalence in ATI hardware and if I can benefit from it ?
Also, someone tell me that there is no "shared" memory on ATI hardware, that the "__local" memory is emulated with global memory. Is it right ?
So, if it is true... is there some way to optimize my code to avoid "global memory" access ?
Thanks for your help