Is there a good way to know a-priori a good size for the workgroup? The more the better?
In CUDA, we've an Excel's table where you can see the occupancy of the multiprocessors and shared memory. Currently, I need to test my kernel with several vales(32,64,128,256,512) for the workgroup and choose the one that runs faster.
Do you have a tool where we could see how many cycles, memory locks, SIMD split-branching, cache usage, etc... are used for a specific kernel? That would be useful too.
It would be useful to add to the documentation how the memory is cached, sizes, bank conflicts, etc... like it's done in the CUDA SDK(visually,graphically).