Archives Discussions

eklund_n · ‎09-27-2010

Hi.

If I have a kernel looking something like this being enqueued as one work-item, will the four buffers be assigned in parallel on different processing elements or will one processing element do it serially? I.E will it use all four cores on a four core CPU or just some SIMD instructions? And are similar instructions available on ATI GPU?

__kernel void swizzle(char4* buf){

buf = buf.yzwx;

}

If I want it to be done in parallel, would it be better to instantiate four work-items looking like below?

__kernel void swizzle(char* buf){

ID = getLocalID();

if(ID = 3) buf[ID] = buf[0];

else buf[ID] = buf[ID+1];

}

Finally: will the Stream Kernel Analyzer and Stream Profiler be available on Linux? Or any similar tool?

Thanks.

Tasp · ‎09-28-2010

This is not really a suitable task for parallelization.

@Version 1: it's only memory access and no calculations.

@Version 2: Will not work as you will overwrite your values!

e.g.:
buf[3] = buf[2];
buf[0] = buf[3]; // whoops buf[3] was alread overwritten with buf[2]

so you would need a temp variable and synchronization, which will be slow.

eklund_n · ‎09-28-2010

I think it is a task very suitable for parallelization. To set many vector elements at the same instructions is better than setting them one at a time. And yes, one might need another buffer to avoid overwriting values.

My question still remains: how many instructions will [buf = buf.yzwx;] take? And will that be faster than executing four [buf[ID] = buf[otherID];]?

What about [buf = buf.s12306745b89acdef;] when buf is char16, how many instructions is that?

LeeHowes · ‎09-28-2010

On a CPU it will use a single core. It's not going to break up the work item across multiple cores. It may be able to use SSE.

It won't break a single work item on a GPU across multiple cores either, but if you use multiple work items it will use the SIMD engine to execute it across multiple SIMD lanes. On the other hand in a single work-item it will execute on a single SIMD lane's five VLIW ALUs as a set of 4 VLIW instructions in a single instruction packet (plus possible overhead for using chars).

Archives Discussions

Vector type execution?