cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

How to increase ALUpacking value ?

in particular kernel

Kernel consists mostly from such blocks (see listing):

power values loaded into 16 float4 registers, then compared with some threshold, then size of array halved and process repeated.
ALU packing for this kernel very low, only 45,5. Other kernels have >80

What prevents to pack those instructions into VLIWs ?

d0.xy=d0.xz+d0.yw;d0.zw=d1.xz+d1.yw;d1.xy=d2.xz+d2.yw;d1.zw=d3.xz+d3.yw; d2.xy=d4.xz+d4.yw;d2.zw=d5.xz+d5.yw;d3.xy=d6.xz+d6.yw;d3.zw=d7.xz+d7.yw; d4.xy=d8.xz+d8.yw;d4.zw=d9.xz+d9.yw;d5.xy=d10.xz+d10.yw;d5.zw=d11.xz+d11.yw; d6.xy=d12.xz+d12.yw;d6.zw=d13.xz+d13.yw;d7.xy=d14.xz+d14.yw;d7.zw=d15.xz+d15.yw; if ( (d0.x>t.y)||(d0.y>t.y)||(d0.z>t.y)||(d0.w>t.y)||(d1.x>t.y)||(d1.y>t.y)||(d1.z>t.y)||(d1.w>t.y)|| (d2.x>t.y)||(d2.y>t.y)||(d2.z>t.y)||(d2.w>t.y)||(d3.x>t.y)||(d3.y>t.y)||(d3.z>t.y)||(d3.w>t.y)|| (d4.x>t.y)||(d4.y>t.y)||(d4.z>t.y)||(d4.w>t.y)||(d5.x>t.y)||(d5.y>t.y)||(d5.z>t.y)||(d5.w>t.y)|| (d6.x>t.y)||(d6.y>t.y)||(d6.z>t.y)||(d6.w>t.y)||(d7.x>t.y)||(d7.y>t.y)||(d7.z>t.y)||(d7.w>t.y) ){ was_pulse.y=1; }

0 Likes
7 Replies
ryta1203
Journeyman III

Difficult to say without the full code (can't look at ISA with the code you've given directly)...

...however, generally speaking, reduce CF clauses (ISA doesn't allow packing across clauses) and reduce data dependency (I have no idea how large the compiler window is for this, I imagine it's fairly large though). Is it packing the d2.xywz > t.yyyy into a vector (for example)?

0 Likes

 

This should increase ALU Packing into VLIW  –

 

int4 temp = (d0 - t.y) || (d1 - t.y) || (d2 - t.y) || (d3 - t.y) || (d4 - t.y) || (d5 - t.y) || (d6 - t.y) || (d7 - t.y) || (d8 - t.y) || (d9 - t.y) || (d10 - t.y) || (d11 - t.y) || (d12 - t.y) || (d13- t.y) || (d14 - t.y) || (d15 - t.y); if(temp.x > 0 || temp.y > 0 || temp.z > 0 || temp.w > 0) { was_pulse.y=1; }

0 Likes

First of all I think that omkaranathan code won't compile. You can't do d0 - t.y ( vector - scalar ).

Now few pointers

1. Whether you do

    a = b - c;

or

   a.x = b.x - c.x;

   a.y = b.y - c.y;

   ...

   a.w = b.w - c.w

doesn't matter. IL compiler will pack it into 5 wide instructions anyway ( so each version will output only 1 ISA instruction ).

2. If it's possible don't use 'if'. It has huge overhead in gpu !!!

Instead use select.

was_pulse.y = select( test, 1, 0 ( or was_pusle.y ) );

3. There is standard trick to increase alu packing in kernel.

In 1 work item do the work from n work items ( n=2..5 ).

So lets assume that you had workspace (1..5). Each kernel had totally sequential work so alu packing was 1. Now instead of 5 work items you create workspace with 1 workitem. And kernel is

old_kernel_for_item_0

old_kernel_for_item_1

...

old_kernel_for_item_4

Now IL compiler ( at the bottom of compilation stack ) will be able to make maximal ALU packing. It will take fist instruction from kernel_for_item_0 for slot x, first instruction from kernel_for_item_1 for slot y, ... and so on.

Of course this technique can't be used absentmindedly.There are some exception ( like loops ) which require a litte bit smarter kernel interleaving.

 

 

 

 

 

 

 

0 Likes

Thanks all for answers!

I can't use workitems packing really cause there are only 32*6 workitems. SIMDs underloaded even now.

It seems in OpenCL comparison vector operation works like SIMD version for CPU, not as in Brook. All 4 values will be compared.
So I completely removed if statements and go with boolean logic instead like:

float4 tt=(float4)(t.x); pulse|=(d0>tt)|(d1>tt)|(d2>tt)|(d3>tt); and at and of loop: was_pulse.x=1&(pulse.x|pulse.y|pulse.z|pulse.w); I didn't build app with new code still, but hope it will improve ALUpacking and kernel code size too.

0 Likes

Originally posted by: Raistmer Thanks all for answers! I can't use workitems packing really cause there are only 32*6 workitems.


Then why do you bother with optimizing this ? Kernel start overhead will be much higher ( probably >1000x ) than kernel execution time.

With 32*6 workitems you can use only 3 simds - it doesn't make sense to use gpu for your task ?

 

 

0 Likes
Raistmer
Adept II

It does make sense This kernel should replace bringing whole data array back to CPU memory. And benchmark clearly show that with such slow kernel app works much faster than it move big data chunks and do processing on CPU.
Kernel execution time rather high, ~18ms on my hardware. It called many times so each saving will give performance boost.
32x10 gives 10 wavefronts, not 5 (accordingly to OpenCL profiler) I suppose 32x6 will give 6 wavefronts -> 6 SIMDs.
0 Likes

Originally posted by: Raistmer  32x10 gives 10 wavefronts, not 5 (accordingly to OpenCL profiler) I suppose 32x6 will give 6 wavefronts -> 6 SIMDs.


Wavefront size is 32 only on some older cards ( <4600 ). If you have card with wavefront size 64 you can specify workgroup size 32. The effect will be that for wokspace size 32*10 10 simds will be used. But each will be doing 64 operations ( and results from 32 ops will be discarded ) - so there is no speed advantage.

0 Likes