cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

aazmp
Journeyman III

VGPRs as intermediate storage

how to prevent usage of VGPRs for storing intermediate data, that compiler find to be useful in future?

example:

In my program each kernels stores 8x8 pixel(8bit) data to Local memory; pixels are packed into uchar4 by 4.

position of quadropixel is   array_in_LDS[get_local_size(0)*qpixel_num + get_local_id(0)]

At some time i place them in local memory and after some time load from local memory.

between these two times compiler stores each index in VGPR => 16 vgprs are used in this example

There is a workaround:

if at store time calculate index differently, for example mad24(get_local_size(0), qpixel_num, get_local_id(0)),

then no additional VGPRs are used

Calculating index is ~5 ticks or even 1 with mad24, while VGPRs are always expensive.

Are there any solutions to prevent storing such easy calculated data without fooling compiler? Maybe some directive or some option to clBuildProgram?

0 Likes
1 Solution
realhet
Miniboss

I also had to use this trick and I'm quiet sure that this behaviour is uncontrollable at the OpenCL level. Because all these register allocations and optimizations are handled by the AMD_IL compiler and there are no way to say it in AMD_IL that a particular intermediate calculation can be stored in regs or must be recalculated every time.

The mul24 trick is cool. In my problem I had to 'OR' the data with a 0 which was on a constant buffer.

View solution in original post

0 Likes
5 Replies
realhet
Miniboss

I also had to use this trick and I'm quiet sure that this behaviour is uncontrollable at the OpenCL level. Because all these register allocations and optimizations are handled by the AMD_IL compiler and there are no way to say it in AMD_IL that a particular intermediate calculation can be stored in regs or must be recalculated every time.

The mul24 trick is cool. In my problem I had to 'OR' the data with a 0 which was on a constant buffer.

0 Likes

Thanks!

P.S.

VGPR usage is strange for every piece of code and jumps up and down unpredictably.

with this fragment in code i get 104 VGPRs used

x = x0/4; y = y0/4;

//some code between lines using x,y

x = (x0 + 16)/4; y = y0/4;

x = x0/4; y = (y0 + 16)/4;

x = (x0 + 16)/4; y = (y0 + 16)/4;

with this, which is simpler in every way - 110 VGPRS

x = x0/4; y = y0/4;

x += 4;

x -= 4; y += 4;

x += 4;

and with mix - 102 VGPR

x = x0/4; y = y0/4;

x += 4;

x = x0/4; y = (y0 + 16)/4;

x = (x0 + 16)/4; y = (y0 + 16)/4;

I think there is no point to even try to optimize

0 Likes
aazmp
Journeyman III

I)

Just a story.

Found in my kernel place where i could store data to __local memory to avoid reading it from __global later;

Started to impelement it by small step.

Kernel launch options: GlobalWorkSize={14592;1;1} WorkGroupSize={256,1,1} HD7850

before optimization 123 VGPRs 34 SGPRs 0 Scratch

step 1) added  a local storage for 6x4 bytes without using it:

#define GR_SIZE 256

__local uchar4 prefetched[6*GR_SIZE];

VGPR usage went from 123 to 121

(adding more unused variables and allocation done nothing)

step 2) added one store line (but never load back)

prefetched[(int)lid] = (*lap4p2).s0123; //vector passed into function by pointer

VGPRs went from 121 to 128

SGPRs from 34 to 57

Scratch regs from 0 to 1320

Execution time from 23ms to 208ms

step 3) added all 6 store lines

prefetched[(int)lid] = (*lap4p2).s0123;

prefetched[(int)(lsz+lid)] = (*lap4p2).s4567;

prefetched[(int)mad24(lsz,2,lid)] = (*l).s0123;

prefetched[(int)mad24(lsz,3,lid)] = (*l).s4567;

prefetched[(int)mad24(lsz,4,lid)] = (*l).s89AB;

prefetched[(int)mad24(lsz,5,lid)] = (*l).sCDEF;

VGPRs: 128 -> 256

SGPRs: 57 -> 44

Scratched: 1320 -> 384

time: 208ms -> 28ms

...

II)

Also observation:

for(i = 0; i < 3; i+=1)

{

     if (i%2==0) very_large_func_1(i);

     else very_large_func_2(i)

}

works faster (by 10% in my case) and produce (twice) less VGPRs than

for (i = 0; i < 3; i+=2)

{

     j = i;

     very_large_func_1(j);

     j = i + 1;

     very_large_func_2(j)

}

it seems that compiler tries to parallelize code by variable renaming and you can't force serial code by using

a = b; do1(a);

a = b+1; do2(a);

and have to fight it with IFs FORs and PRAGMA_UNROLLs

III)

and another example

had a code like this

do(a[0]); do(a[1]); do(a[2]); do(a[3]); do(a[4]); do(a[5]);

it used 137 VGPRs

tried this

for(i = 0; i < 6; ++i) do(a);

it produced wrong result (i could make a mistake somewhere)

but then just combined two parts (unlooped rewrite looped results)

for(i = 0; i < 6; ++i) do(a);

do(a[0]); do(a[1]); do(a[2]); do(a[3]); do(a[4]); do(a[5]);

output became correct and VGPR usage dropped from 137 to 114!!!

0 Likes

Hi,

II) Maybe I see it wrong but those two loops would be equal only if: for(i = 0; i < 4; i+=1)

The first version which is 10% faster at you only issues 3 very_large_func while the second version issues 4.

It should be 33% faster, though.

Just some thoughts: Your card has 1024 streams, so global work size is not so good:

If you go for 128 VGPRS, then you should set global work size to a multiple of 8*1024 in order to fill the CUes with tasks all the time.

256 VGPRS -> multiple of 4096

So unless your program is memory bound or heavily divergent, you could simply raise global work size to 16K with no cost.

Low VGPRS usage is only important if you have to hide memory IO latency: to fill all the 16 CUes with not 4 but 8 or even 10 wavefronts so it will be able to give those 4 vector ALUs work all the time while another 4 or 6 wavefronts are waiting for IO.

0 Likes

Thanks.

With 3/4 my mistake in message. In real kernel loop does 4 steps.

Tried setting global size to multiple of 1024, made no effect. Originally it was multiple of work group size(256).

P.S

In the end managed to decrease VGPRs to 103 and gain speed by 20% (but this mostly to LDS usage i think).

Also found these actions to lower VGPR usage

1) pack data to preffered size vectors (not always)

2) using scalar operations even on vectors

a.x +b.x ... a.z + b.z  instead of a+b

3) not using manually unlooped code

even if in loop have to write something like

loop (i = 0..3) {

     do something with D

     A = (i==0) ? D : A;

     B = (i==1) ? D : B;

     C = (i==2) ? D : C

}

it works a little faster (and uses less vgprs)

than

do something with A

do something with B

do something with C

do something with D

0 Likes