Discussion created by spectral on Jun 6, 2011
Latest reply on Jun 8, 2011 by spectral


I'm currently working on the OpenCL Parallel Primitive library and so I currently read a lot of literature that come from NVidia.

A lot of optimizations come from the fact that they use the WARP concept (32 threads as SIMT).

So, I would like to know is there are some equivalence in ATI hardware and if I can benefit from it ?

Also, someone tell me that there is no "shared" memory on ATI hardware, that the "__local" memory is emulated with global memory. Is it right ?

So, if it is true... is there some way to optimize my code to avoid "global memory" access ?


Thanks for your help