Vector type execution?

Discussion created by eklund.n on Sep 27, 2010
Latest reply on Sep 28, 2010 by LeeHowes


If I have a kernel looking something like this being enqueued as one work-item, will the four buffers be assigned in parallel on different processing elements or will one processing element do it serially? I.E will it use all four cores on a four core CPU or just some SIMD instructions? And are similar instructions available on ATI GPU?

__kernel void swizzle(char4* buf){

      buf = buf.yzwx;


If I want it to be done in parallel, would it be better to instantiate four work-items looking like below?

__kernel void swizzle(char* buf){

    ID = getLocalID();

    if(ID = 3) buf[ID] = buf[0];

    else buf[ID] = buf[ID+1];



Finally: will the Stream Kernel Analyzer and Stream Profiler be available on Linux? Or any similar tool?