2 Replies Latest reply on Aug 11, 2010 12:02 AM by Fuxianjun

    how to explain different execution paths in the same warp ?


      following is quoted from http://www.cmsoft.com.br/index.php?option=com_content&view=category&layout=blog&id=92&Itemid=144

      Does anyone tell me the exact reason.

      another problem is :

      i need to conbine two algorithm into one kernel. one's optimal workitem number is 100, but another's is 2, how to get the optimal workitem number of the combined kernel ?

      It may be easier to use an example. The worst thing that can happen in a kernel, concerning execution paths, is: kernel void myKernel() { if (condition) { do work } } As you can see, some kernels will be launched and do nothing at all. This is not good. As a rule of thumb, have in mind that throwing a worker is an expensive task and you want each worker to effectively work. Remember that vector sum kernel almost every OpenCL tutorial posts as an example? It is not very effective because each worker only executes one sum. Another thing to avoid is: kernel void myKernel() { if (condition) { do work } else { do something completely different } } You would prefer something like: kernel void myKernel() { if (condition) { do work } else { do something with the exact same operations and order with different data } } I don't work for AMD or NVidia in order to know implementation details and explain exactly why this is bad. What I do know is that it messes with the parallel operations that the hardware can handle.

        • how to explain different execution paths in the same warp ?

          The reason is that the work items are not in any way independent execution paths in the hardware. What the hardware actually does is more like a sophisticated version of SSE. It executes one (VLIW) instruction over 64 lanes of SIMD in one cycle. The programming model, for historical and probably for beneficial reasons, exposes this on a per-lane basis but to some degree this has the risk of giving the wrong idea about how the hardware is working. It is not combining separate threads for efficiency, it is executing one 64-wide thread that you can program lane by lane because it's easier than writing 64-wide SSE-style gather and scatter instructions.

          I don't know if I can give you an optimal group size. I do find it hard to believe that the second kernel really has an optimum count of 2, though, based on past experience. Presumably you have more than one group of size 2, yes? Can you not combine them together in a single group? That way you can just have the first 512 for one kernel and the next 64 for the other, or something like that, rather than branching in the same wave. There are cases where that doesn't work, but usually you just need to slightly rethink the way you are trying to map to the hardware.

          If I were you I'd forget "threads", work items and work groups completely and redesign your entire algorithm in terms of wavefronts. That's how the hardware works, that's how I would program it. Don't say "how many work items do I need to achieve this" say "how many wavefronts do I need to achieve this" and design your data layouts, execution patterns and so on based on that assumption in the way that you would if you were developing for SSE.