1. Is "Wave front" and "Wave front granularity" with AMD equivalent to "Warp size" and "warp size granularity" with nVidia?
2. When creating a new variable in a kernel and not exclusively using "private/local/global/const/..." in declaration, for example "float newVar;", in what memory is it created, and what is the priority? Is it automatically global?
3. Is there any way to estimate how much private memory I have on my GPUs (nVidia GTX 470 and ATI HD5850)?
4. Is there any particular reason to use 2D or 3D work groups, other than it might be easier/prettier to map the threads to the work space? Performance gain for example?
5. Lets say that I want to operate on many small vectors of length 64, and my optimal work group size is 256 for my platform. Is it a bad idea (performance wise) to set group size to 32 or 64? Is it very important not to go too far below 256, and instead try to split the same work group out over different vectors? The reason why I ask is because splitting the work group up like that could potentially be bad in some aspects in my implementation.