3 Replies Latest reply on Sep 28, 2010 7:57 PM by LeeHowes

    Vector type execution?



      If I have a kernel looking something like this being enqueued as one work-item, will the four buffers be assigned in parallel on different processing elements or will one processing element do it serially? I.E will it use all four cores on a four core CPU or just some SIMD instructions? And are similar instructions available on ATI GPU?

      __kernel void swizzle(char4* buf){

            buf = buf.yzwx;


      If I want it to be done in parallel, would it be better to instantiate four work-items looking like below?

      __kernel void swizzle(char* buf){

          ID = getLocalID();

          if(ID = 3) buf[ID] = buf[0];

          else buf[ID] = buf[ID+1];



      Finally: will the Stream Kernel Analyzer and Stream Profiler be available on Linux? Or any similar tool?


        • Vector type execution?

          This is not really a suitable task for parallelization.

          @Version 1: it's only memory access and no calculations.

          @Version 2: Will not work as you will overwrite your values!

          buf[3] = buf[2];
          buf[0] = buf[3];  // whoops buf[3] was alread overwritten with buf[2]

          so you would need a temp variable and synchronization, which will be slow.

            • Vector type execution?

              I think it is a task very suitable for parallelization. To set many vector elements at the same instructions is better than setting them one at a time. And yes, one might need another buffer to avoid overwriting values.


              My question still remains: how many instructions will [buf = buf.yzwx;] take? And will that be faster than executing four [buf[ID] = buf[otherID];]?

              What about [buf = buf.s12306745b89acdef;] when buf is char16, how many instructions is that?

                • Vector type execution?

                  On a CPU it will use a single core. It's not going to break up the work item across multiple cores. It may be able to use SSE.

                  It won't break a single work item on a GPU across multiple cores either, but if you use multiple work items it will use the SIMD engine to execute it across multiple SIMD lanes. On the other hand in a single work-item it will execute on a single SIMD lane's five VLIW ALUs as a set of 4 VLIW instructions in a single instruction packet (plus possible overhead for using chars).