Hi.
If I have a kernel looking something like this being enqueued as one work-item, will the four buffers be assigned in parallel on different processing elements or will one processing element do it serially? I.E will it use all four cores on a four core CPU or just some SIMD instructions? And are similar instructions available on ATI GPU?
__kernel void swizzle(char4* buf){
buf = buf.yzwx;
}
If I want it to be done in parallel, would it be better to instantiate four work-items looking like below?
__kernel void swizzle(char* buf){
ID = getLocalID();
if(ID = 3) buf[ID] = buf[0];
else buf[ID] = buf[ID+1];
}
Finally: will the Stream Kernel Analyzer and Stream Profiler be available on Linux? Or any similar tool?
Thanks.