d0.xy=d0.xz+d0.yw;d0.zw=d1.xz+d1.yw;d1.xy=d2.xz+d2.yw;d1.zw=d3.xz+d3.yw; d2.xy=d4.xz+d4.yw;d2.zw=d5.xz+d5.yw;d3.xy=d6.xz+d6.yw;d3.zw=d7.xz+d7.yw; d4.xy=d8.xz+d8.yw;d4.zw=d9.xz+d9.yw;d5.xy=d10.xz+d10.yw;d5.zw=d11.xz+d11.yw; d6.xy=d12.xz+d12.yw;d6.zw=d13.xz+d13.yw;d7.xy=d14.xz+d14.yw;d7.zw=d15.xz+d15.yw; if ( (d0.x>t.y)||(d0.y>t.y)||(d0.z>t.y)||(d0.w>t.y)||(d1.x>t.y)||(d1.y>t.y)||(d1.z>t.y)||(d1.w>t.y)|| (d2.x>t.y)||(d2.y>t.y)||(d2.z>t.y)||(d2.w>t.y)||(d3.x>t.y)||(d3.y>t.y)||(d3.z>t.y)||(d3.w>t.y)|| (d4.x>t.y)||(d4.y>t.y)||(d4.z>t.y)||(d4.w>t.y)||(d5.x>t.y)||(d5.y>t.y)||(d5.z>t.y)||(d5.w>t.y)|| (d6.x>t.y)||(d6.y>t.y)||(d6.z>t.y)||(d6.w>t.y)||(d7.x>t.y)||(d7.y>t.y)||(d7.z>t.y)||(d7.w>t.y) ){ was_pulse.y=1; }
Difficult to say without the full code (can't look at ISA with the code you've given directly)...
...however, generally speaking, reduce CF clauses (ISA doesn't allow packing across clauses) and reduce data dependency (I have no idea how large the compiler window is for this, I imagine it's fairly large though). Is it packing the d2.xywz > t.yyyy into a vector (for example)?
This should increase ALU Packing into VLIW –
int4 temp = (d0 - t.y) || (d1 - t.y) || (d2 - t.y) || (d3 - t.y) || (d4 - t.y) || (d5 - t.y) || (d6 - t.y) || (d7 - t.y) || (d8 - t.y) || (d9 - t.y) || (d10 - t.y) || (d11 - t.y) || (d12 - t.y) || (d13- t.y) || (d14 - t.y) || (d15 - t.y); if(temp.x > 0 || temp.y > 0 || temp.z > 0 || temp.w > 0) { was_pulse.y=1; }
First of all I think that omkaranathan code won't compile. You can't do d0 - t.y ( vector - scalar ).
Now few pointers
1. Whether you do
a = b - c;
or
a.x = b.x - c.x;
a.y = b.y - c.y;
...
a.w = b.w - c.w
doesn't matter. IL compiler will pack it into 5 wide instructions anyway ( so each version will output only 1 ISA instruction ).
2. If it's possible don't use 'if'. It has huge overhead in gpu !!!
Instead use select.
was_pulse.y = select( test, 1, 0 ( or was_pusle.y ) );
3. There is standard trick to increase alu packing in kernel.
In 1 work item do the work from n work items ( n=2..5 ).
So lets assume that you had workspace (1..5). Each kernel had totally sequential work so alu packing was 1. Now instead of 5 work items you create workspace with 1 workitem. And kernel is
old_kernel_for_item_0
old_kernel_for_item_1
...
old_kernel_for_item_4
Now IL compiler ( at the bottom of compilation stack ) will be able to make maximal ALU packing. It will take fist instruction from kernel_for_item_0 for slot x, first instruction from kernel_for_item_1 for slot y, ... and so on.
Of course this technique can't be used absentmindedly.There are some exception ( like loops ) which require a litte bit smarter kernel interleaving.
float4 tt=(float4)(t.x); pulse|=(d0>tt)|(d1>tt)|(d2>tt)|(d3>tt); and at and of loop: was_pulse.x=1&(pulse.x|pulse.y|pulse.z|pulse.w); I didn't build app with new code still, but hope it will improve ALUpacking and kernel code size too.
Originally posted by: Raistmer Thanks all for answers! I can't use workitems packing really cause there are only 32*6 workitems.
Then why do you bother with optimizing this ? Kernel start overhead will be much higher ( probably >1000x ) than kernel execution time.
With 32*6 workitems you can use only 3 simds - it doesn't make sense to use gpu for your task ?
Originally posted by: Raistmer 32x10 gives 10 wavefronts, not 5 (accordingly to OpenCL profiler) I suppose 32x6 will give 6 wavefronts -> 6 SIMDs.
Wavefront size is 32 only on some older cards ( <4600 ). If you have card with wavefront size 64 you can specify workgroup size 32. The effect will be that for wokspace size 32*10 10 simds will be used. But each will be doing 64 operations ( and results from 32 ops will be discarded ) - so there is no speed advantage.