cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Meteorhead
Challenger

reducing scratch register usage

Hi!

My question would be first of all how scratch registers are allocated by the HW? The Scratch Reg. usage that SKA indicates seems really high. (58 GPR, 24 SR) Is the SR number indicated here the highest number that is allocated at sometime during execution? This is all the memory allocation in the code (and global constants which in my mind should be compiled into the code and not take up registers). I think this many memory alloc does not justify the usage of 58 Registers (which is 58 vectors4 of any 32-bit datatype).

 

__constant uint stateMask = 1812433253u; __constant uint thirty = 30u; __constant uint4 mask4 = (uint4)(1812433253u); __constant uint4 thirty4 = (uint4)(30u); __constant uint4 one4 = (uint4)(1u); __constant uint4 two4 = (uint4)(2u); __constant uint4 three4 = (uint4)(3u); __constant uint4 four4 = (uint4)(4u); __constant uint4 zero4 = (uint4)(0u); __constant unsigned int thirteen = 13u; __constant unsigned int fifteen = 15u; __constant unsigned int shift = 8u * 3u; __constant unsigned int mask11 = 0xfdff37ffu; __constant unsigned int mask12 = 0xef7f3f7du; __constant unsigned int mask13 = 0xff777b7du; __constant unsigned int mask14 = 0x7ff7fb2fu; __constant float one = 1.0f; __constant float intMax = 4294967296.0f; // Chainlength is the number of vectors __kernel void main( __global uint4* seedArray, __global uint4* initial, const uint chainlength, __local uint4* share, __global uint4* partial, const uint timestep, const uint vectors_per_thread, const float4 P, const float4 Q, const float4 p, const float4 q, __global uint4* r_p, __global uint4* r_q, __local uint4* rp, __local uint4* rq, const uint resolution, __global int4* height, __global int4* T, __global float4* S, __global float* W) { // Global init int gidX = get_global_id(0); int tidX = get_local_id(0); int lsiX = get_local_size(0); int GIDX = get_group_id(0); // KPZ init uint4 XOR = (uint4)(0); uint4 factor = (uint4)(0); uint4 particle; uint4 shifted; uint4 redp; uint4 redq; uint4 reducedp; uint4 reducedq; uint my_share; uint survive; // PRNG init uint4 temp[8]; uint4 state1 = seedArray[GIDX * lsiX + tidX]; uint4 state2 = (uint4)(0); uint4 state3 = (uint4)(0); uint4 state4 = (uint4)(0); uint4 state5 = (uint4)(0); uint4 r1 = (uint4)(0); uint4 r2 = (uint4)(0); uint4 a = (uint4)(0); uint4 b = (uint4)(0); uint4 e = (uint4)(0); uint4 f = (uint4)(0); float4 random[16]; } // main

0 Likes
20 Replies
himanshu_gautam
Grandmaster

Hi meteorhead,

A variable-sized space in off-chip-memory that stores some of the “GPRs” is the scratch space.

AFAIK, The Scratch registers shown are the maximum that would be required at any particular time.

I am not sure of the constant variables optimization. There might also be some extra space reservation for the sake of alignment. 

 

0 Likes

The reason why I am confused, is because Stream Kernel Analyzer states that the kernel requires 58 GPR and 24 SR. I counted 24 vectors allocated (counting 4 seperate, preallocated integers as a single register, hoping that the compiler is smart enough to create a vector out of it) only. This means that partial results of calculations would double the required GPR count. This seems very irrational, not to mention the huge SR size.

Where is this off-chip memory exactly? Does one need to worry about some of these registers being slower than others?

Plus can someone tell me how to create "variables" that facilitate programming but do not reside in registers, but rather compiled into the code? The variables I declare at the beginning never change. I declared them to be __constant, becuase the compiler told me I cannot declare global variables that are not part of a memory namespace. Since they are truly static const variables, I declared them __constant. But could someone tell me how could these variables be compiled into the code? (pls, not #define, that is ugly)

Cheers,

Máté

0 Likes

From openCL Programming Guide:

"The compiler generates spill code (shuffling values to, and from, memory) if it cannot fit all the live values into registers. Spill code uses long-latency global memory and can have a large impact on performance. The ATI Stream Profiler reports the static number of register spills in the ScratchReg field".

As per my experiments there are some optimizations which are performed for __constant variables, but i have seen an increase in GPR count also in some situations.



0 Likes

Can some AMD employee (or someone who achieved this feat particular feat) explain how can the compiler be told that something is a compile time constant and can be inserted into the code and not take up memory space?

Edit: and thank you for the info himanshu. I seemed to overlook that info in the guide.

This spill memory sounds very good, taking that practically we have infinite amount of private memory (at the cost of speed and maximum work-group size), however I have a feeling that the excessive GPR/SR usage is due to another problem that someone has mentioned in another topic, namely that there is no compiler option to enforce optimizations even on long codes at the cost of compile time. (The guy mentioning this stated he cleanly saw that if a kernel became too long, he saw the "compiler give up" and use GPR without trying to find out which variables could be deleted.)

It is true that such a compiler option would be useful. The main argument by having such an option, to let the compiler optimize as long as it likes, was that there are applications that run days or even weeks. This particular problem that I am working on has runtimes of 4 hours to 7 days. My group is starting to work with the same 1 dimension problem in 2D, and later most likely 3D. Runtimes will be LOOONG, and I do not care if the program compiles for 1 hour if it can bring 10% speedup. If it compiles long, I'll learn to load kernels from binary.

0 Likes

try imply some smaller workgroup size with kernel attribute. that way compiler will allocate more GPR per work item. so you get rid of scratch registers.

0 Likes

Meteorhead,

I suspect compiler will not be able to merge all constant variables in all conditions. In some cases memory needs to be allocated for them.

I suggest you to send in some testcase which can show this problam. Also send your System Information: CPU,GPU,SDK,Driver,OS.

0 Likes

another technique can be manualy limit scope of the variables with {}. it is absolute legal enclose any code into {}. it will limit scope of the variables so it can help reduce register usage if there are not all variables needed all the time.

__kernel void aaaa() { int a,b,c,d; //some code working with a,b //some code working with c,b } transform into __kernel void aaaa() { { int a,b; //some code working with a,b } { int c,d; //some code working with c,b } }

0 Likes

I have tried out several things.

One that nou has suggested, to limit work-group size. By default the kernel would like to run in a group of 256, limiting this to 128. This way there were no SR, but the same register number was required (82), but this time no SR usage was needed. (58+24) -->> (82+0)

Second try was to place defines instead of __constant variables, however this seemed futile also, it seemed that memory needs to be allocated for every single compile time constant, none could be placed into code. Same register usage was imposed.

I read through th code, and I could only account for roughly 50-56 registers. This also includes variables coming as constant kernel arguments. Taking that some operations require additional memory to hold temporary values, I could imagine 10-12 more registers. However 82 is still 130% of the register usage I would have in mind in a kernel looking like this.

I am using Win7 with SDK 2.3, Catalyst 11.1 and Mobility 5870 as device. See attached code for particular test case.

Edit: Nou, unfortunately putting brackets will not help in my case, becuase all the __constant variables are used by the PRNG all throughout the program. Or to be precise, I need them to be global, because I wanted the rndm() function to be defined outside the main. Declaring all those variables inside the PRNG would make it possible to free them from time to time, but rndm() is called very often and allocating and freeing memory so many times would be inefficient. I am worried much more about that extra 30% of GPR usage that I cannot seem to be able to account for.

// Vector type shifting with carryover between elements void lshift(uint4 input, uint shift, uint4* output) { unsigned int invshift = 32u - shift; uint4 temp; temp.w = input.w << shift; temp.z = (input.z << shift) | (input.w >> invshift); temp.y = (input.y << shift) | (input.z >> invshift); temp.x = (input.x << shift) | (input.y >> invshift); *output = temp; } void rshift(uint4 input, uint shift, uint4* output) { unsigned int invshift = 32u - shift; uint4 temp; temp.x = input.x >> shift; temp.y = (input.y >> shift) | (input.x << invshift); temp.z = (input.z >> shift) | (input.y << invshift); temp.w = (input.w >> shift) | (input.z << invshift); *output = temp; } void lshift128(uint4 input, uint shift, uint4* output) { unsigned int invshift = 32u - shift; uint4 temp; temp.x = input.x << shift; temp.y = (input.y << shift) | (input.x >> invshift); temp.z = (input.z << shift) | (input.y >> invshift); temp.w = (input.w << shift) | (input.z >> invshift); *output = temp; } void rshift128(uint4 input, uint shift, uint4* output) { unsigned int invshift = 32u - shift; uint4 temp; temp.w = input.w >> shift; temp.z = (input.z >> shift) | (input.w << invshift); temp.y = (input.y >> shift) | (input.z << invshift); temp.x = (input.x >> shift) | (input.y << invshift); *output = temp; } __constant uint stateMask = 1812433253u; __constant uint thirty = 30u; __constant uint4 mask4 = (uint4)(1812433253u); __constant uint4 thirty4 = (uint4)(30u); __constant uint4 one4 = (uint4)(1u); __constant uint4 two4 = (uint4)(2u); __constant uint4 three4 = (uint4)(3u); __constant uint4 four4 = (uint4)(4u); __constant unsigned int thirteen = 13u; __constant unsigned int fifteen = 15u; __constant unsigned int shift = 8u * 3u; __constant unsigned int mask11 = 0xfdff37ffu; __constant unsigned int mask12 = 0xef7f3f7du; __constant unsigned int mask13 = 0xff777b7du; __constant unsigned int mask14 = 0x7ff7fb2fu; __constant float one = 1.0f; __constant float intMax = 4294967296.0f; void rand(uint4* r1, uint4* r2, uint4* a, uint4* b, uint4* e, uint4* f, uint4* state1, uint4* state2, uint4* state3, uint4* state4, uint4* state5, uint4* temp, float4* out) { uint i = 0; for (i = 0; i < 8u ; ++i) { switch(i) { case 0: *r1 = *state4; *r2 = *state5; *a = *state1; *b = *state3; break; case 1: *r1 = *r2; *r2 = temp[0]; *a = *state2; *b = *state4; break; case 2: *r1 = *r2; *r2 = temp[1]; *a = *state3; *b = *state5; break; case 3: *r1 = *r2; *r2 = temp[2]; *a = *state4; *b = *state1; break; case 4: *r1 = *r2; *r2 = temp[3]; *a = *state5; *b = *state2; break; case 5: *r1 = *r2; *r2 = temp[4]; *a = temp[0]; *b = temp[2]; break; case 6: *r1 = *r2; *r2 = temp[5]; *a = temp[1]; *b = temp[3]; break; case 7: *r1 = *r2; *r2 = temp[6]; *a = temp[2]; *b = temp[4]; break; default: break; } lshift128(*a, shift, e); rshift128(*r1, shift, f); temp.x = (*a).x ^ (*e).x ^ (((*b).x >> thirteen) & mask11) ^ (*f).x ^ ((*r2).x << fifteen); temp.y = (*a).y ^ (*e).y ^ (((*b).y >> thirteen) & mask12) ^ (*f).y ^ ((*r2).y << fifteen); temp.z = (*a).z ^ (*e).z ^ (((*b).z >> thirteen) & mask13) ^ (*f).z ^ ((*r2).z << fifteen); temp.w = (*a).w ^ (*e).w ^ (((*b).w >> thirteen) & mask14) ^ (*f).w ^ ((*r2).w << fifteen); out = convert_float4(temp) * one / intMax; } // Re-initialize states *state1 = temp[6]; *state2 = mask4 * (*state1 ^ (*state1 >> thirty4)) + one4; *state3 = mask4 * (*state2 ^ (*state2 >> thirty4)) + two4; *state4 = mask4 * (*state3 ^ (*state3 >> thirty4)) + three4; *state5 = mask4 * (*state4 ^ (*state4 >> thirty4)) + four4; for (i = 8; i < 16u ; ++i) { switch(i) { case 8: *r1 = *state4; *r2 = *state5; *a = *state1; *b = *state3; break; case 9: *r1 = *r2; *r2 = temp[0]; *a = *state2; *b = *state4; break; case 10: *r1 = *r2; *r2 = temp[1]; *a = *state3; *b = *state5; break; case 11: *r1 = *r2; *r2 = temp[2]; *a = *state4; *b = *state1; break; case 12: *r1 = *r2; *r2 = temp[3]; *a = *state5; *b = *state2; break; case 13: *r1 = *r2; *r2 = temp[4]; *a = temp[0]; *b = temp[2]; break; case 14: *r1 = *r2; *r2 = temp[5]; *a = temp[1]; *b = temp[3]; break; case 15: *r1 = *r2; *r2 = temp[6]; *a = temp[2]; *b = temp[4]; break; default: break; } lshift128(*a, shift, e); rshift128(*r1, shift, f); temp[i%8].x = (*a).x ^ (*e).x ^ (((*b).x >> thirteen) & mask11) ^ (*f).x ^ ((*r2).x << fifteen); temp[i%8].y = (*a).y ^ (*e).y ^ (((*b).y >> thirteen) & mask12) ^ (*f).y ^ ((*r2).y << fifteen); temp[i%8].z = (*a).z ^ (*e).z ^ (((*b).z >> thirteen) & mask13) ^ (*f).z ^ ((*r2).z << fifteen); temp[i%8].w = (*a).w ^ (*e).w ^ (((*b).w >> thirteen) & mask14) ^ (*f).w ^ ((*r2).w << fifteen); out = convert_float4(temp[i%8]) * one / intMax; } // Re-initialize states *state1 = temp[6]; *state2 = mask4 * (*state1 ^ (*state1 >> thirty4)) + one4; *state3 = mask4 * (*state2 ^ (*state2 >> thirty4)) + two4; *state4 = mask4 * (*state3 ^ (*state3 >> thirty4)) + three4; *state5 = mask4 * (*state4 ^ (*state4 >> thirty4)) + four4; } // rand // Chainlength is the number of vectors __kernel void test( __global uint4* seedArray, __global uint4* initial, const uint chainlength, __local uint4* share, __global uint4* partial, const uint timestep, const uint vectors_per_thread, const float4 P, const float4 Q, const float4 p, const float4 q, __global uint4* r_p, __global uint4* r_q, __local uint4* rp, __local uint4* rq, const uint resolution, __global int4* height, __global int4* T, __global float4* S, __global float* W) { // Global init int gidX = get_global_id(0); int tidX = get_local_id(0); int lsiX = get_local_size(0); int GIDX = get_group_id(0); // KPZ init uint4 XOR = (uint4)(0); uint4 factor = (uint4)(0); uint4 zero4 = (uint4)(0); uint4 particle; uint4 shifted; uint4 redp; uint4 redq; uint4 reducedp; uint4 reducedq; uint my_share; uint survive; // PRNG init uint4 temp[8]; uint4 state1 = seedArray[GIDX * lsiX + tidX]; uint4 state2 = (uint4)(0); uint4 state3 = (uint4)(0); uint4 state4 = (uint4)(0); uint4 state5 = (uint4)(0); uint4 r1 = (uint4)(0); uint4 r2 = (uint4)(0); uint4 a = (uint4)(0); uint4 b = (uint4)(0); uint4 e = (uint4)(0); uint4 f = (uint4)(0); //Initializing states. state2 = mask4 * (state1 ^ (state1 >> thirty4)) + one4; state3 = mask4 * (state2 ^ (state2 >> thirty4)) + two4; state4 = mask4 * (state3 ^ (state3 >> thirty4)) + three4; state5 = mask4 * (state4 ^ (state4 >> thirty4)) + four4; float4 random[16]; barrier(CLK_GLOBAL_MEM_FENCE); // All threads wait for each other after init for (uint t = 0 ; t < timestep ; ++t) { for (uint v = 0 ; v < vectors_per_thread ; ++v) { my_share = tidX * vectors_per_thread + v; prefetch(&partial[GIDX * chainlength + my_share + 1], sizeof(uint4)); prefetch(&r_p[my_share + 1], sizeof(uint4)); prefetch(&r_p[my_share + 1], sizeof(uint4)); particle = partial[GIDX * chainlength + my_share]; redp = r_p[my_share]; redq = r_q[my_share]; factor = zero4; XOR = zero4; rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, temp, random); // Odd timestep for (uint i = 0 ; i < 16u ; ++i) { shifted = ((particle << (uint4)(i*2u)) >> thirty4 ); reducedp = (redp << (uint4)(i*2u)) >> (uint4)(31u); reducedq = (redq << (uint4)(i*2u+1)) >> (uint4)(31u); factor = select(zero4, select(zero4, three4, random < select(P, p, reducedp == one4 )), shifted == two4 ); factor |= select(zero4, select(zero4, three4, random < select(Q, q, reducedq == one4 )), shifted == one4 ); XOR = XOR << two4; XOR |= factor; } particle ^= XOR; partial[GIDX * chainlength + my_share] = particle; } barrier(CLK_GLOBAL_MEM_FENCE); for (uint v = 0 ; v < vectors_per_thread ; ++v) { // Even timestep my_share = tidX * vectors_per_thread + v; prefetch(&partial[GIDX * chainlength + my_share + 1], 2 * sizeof(uint4)); prefetch(&r_p[my_share + 1], 2 * sizeof(uint4)); prefetch(&r_p[my_share + 1], 2 * sizeof(uint4)); particle = partial[GIDX * chainlength + my_share]; survive = particle.x; // survive used to save leftmost bit that is shifted out. lshift(particle, 1u, &particle); factor = partial[GIDX * chainlength + ((my_share + 1) % chainlength)]; // factor used as temp particle.w |= factor.x >> 31u; redp = r_p[my_share]; lshift(redp, 1u, &redp); redp.w |= r_p[(my_share + 1) % chainlength].x >> 31u; redq = r_q[my_share]; lshift(redq, 1u, &redq); redq.w |= r_q[(my_share + 1) % chainlength].x >> 31u; mem_fence(CLK_GLOBAL_MEM_FENCE); factor = zero4; XOR = zero4; rand(&r1, &r2, &a, &b, &e, &f, &state1, &state2, &state3, &state4, &state5, temp, random); for (uint i = 0 ; i < 16 ; ++i) { shifted = ((particle << (uint4)(i*2u)) >> thirty4 ); reducedp = (redp << (uint4)(i*2u)) >> (uint4)(31u); reducedq = (redq << (uint4)(i*2u+1)) >> (uint4)(31u); factor = select(zero4, select(zero4, three4, random < select(P, p, reducedp == one4 )), shifted == two4 ); factor |= select(zero4, select(zero4, three4, random < select(Q, q, reducedq == one4 )), shifted == one4 ); XOR = XOR << two4; XOR |= factor; } particle ^= XOR; factor = particle; // factor being used as temp to save rightmost bit rshift(particle, 1u, &particle); particle.x |= (survive >> 31u) << 31u; // keep first bit as we found it, survive used barrier(CLK_GLOBAL_MEM_FENCE); partial[GIDX * chainlength + my_share] = particle; barrier(CLK_GLOBAL_MEM_FENCE); survive = partial[GIDX * chainlength + ((my_share + 1) % chainlength)].x; // survive used as temp again for other purpose survive = ((survive << 1u) >> 1u) | (factor.w << 31u); // merge neighbours first vector component with bit to be written into it partial[GIDX * chainlength + ((my_share + 1) % chainlength)].x = survive; // make actual copy into neighbour barrier(CLK_GLOBAL_MEM_FENCE); } // vectors_per_thread barrier(CLK_GLOBAL_MEM_FENCE); } // timestep barrier(CLK_GLOBAL_MEM_FENCE); seedArray[GIDX * lsiX + tidX] = state1; // Save state of PRNG for next kernel call } // main

0 Likes

How can one dump ISA code after compilation? I set GPU_DUMP_DEVICE_KERNEL=2 but nothing happens. It tried setting it to 0,1,2,3 but nothing happens. This is under ubuntu.

0 Likes

Micah,

I have copied the ISA code from the runtime into SKA to see register usage, but it remains the same. (btw, SKA under linux would be nice indeed) I still have the feeling that the kernel uses +30% GPR than needed.

This is under Ubuntu 10.04, Catalyst 10.2 and HD5970.

0 Likes

meteorhead,
Are you dumping the ISA from the actual OpenCL runtime to find out if scratch registers are used or not? SKA does not always give the same results as the OpenCL runtime.
0 Likes

Micah can you explain the circumstances in which SKA does not produce the same results as the OpenCL runtime?

The only scenario I'm aware of is when the Catalyst version installed is not the same as the Catalyst versions that are "hard-coded" into SKA. SKA 1.7 is hard-coded with Catalyst 10.12, which means if you have 11.2 installed on your system, the ISA produced by the runtime and SKA can be different.

Is there another scenario?

Why doesn't SKA have an option to use the Catalyst that is installed?

 

0 Likes

Jawed,

Apart from the the reason you provided, the other reason is that SKA is a static tool and Profiler is a real time tool. So with profiler runtime optimization are also considered and results obtained are more correct. SKA never executes the kernel and gives information just as per the compilation.

Thanks

0 Likes

I've thought of another scenario where ISA varies.

In fact this variation can be seen from within SKA, so the problems for developers are even worse.

Take the attached code. Put it in SKA 1.7.

Now compile. The ISA will be for the kernel "test", as that's the only kernel of the four that can be "seen".

Now change the position of the "/*" so that the first two kernels are available for compilation. Now press the Compile button. DO NOT change the name of the "function" to be compiled.

The ISA is now different. Play some more by moving the "/*" and "*/" around. You will find other combinations of these four kernels will all result in different ISA results, even though all you are doing is compiling the same kernel, "test".

I have seen the same behaviour with other kernels I've worked on. The fact I'm using the double-precision extension here is irrelevant.

#pragma OPENCL EXTENSION cl_amd_fp64 : enable kernel void test(global double *A, global double *B) { int pos = get_global_id(0); B[pos] = pow(A[pos], A[pos + 1]); } /* kernel void test_ln_exp(global double *A, global double *B) { int pos = get_global_id(0); double C = log(A[pos]); B[pos] = exp(C * A[pos + 1]); } kernel void test2(global double2 *A, global double2 *B) { int pos = get_global_id(0); B[pos] = pow(A[pos], A[pos + 1]); } kernel void test_ln_exp2(global double2 *A, global double2 *B) { int pos = get_global_id(0); double2 C = log(A[pos]); B[pos] = exp(C * A[pos + 1]); } */

0 Likes

I don't understand what do you mean?

All the kernels visible are compiled when we use clBuildProgram.

So ISA must be different when different kernels are exposed. Commented out kernels will be ignored and never compiled.

0 Likes

Why would kernel "test" result in different ISA depending on the presence or absence of other, unrelated, kernels in the OpenCL source? Those other kernels have nothing to do with "test".

0 Likes

Maybe there are per program global resources that are being allocated to each kernel.

When there is only one kernel  it gets them all.

Maybe this is related to concurrent kernels. i.e. that all of the kernels to be run concurrently need to be within the same program so that resources can be pre allocated amongst them.

 

0 Likes

This is a scenario that applies to graphics, because a SIMD core can support multiple kernels concurrently (e.g. a vertex shader and a pixel shader). And there is a theoretical basis for altering the compilation of one shader based upon the properties of another shader which executes in a "pairing". In reality this is problematic because the "pairing" is not necessarily fixed. Vertex shader A doesn't necessarily always pair with pixel shader Z, and B with Y. So, in the end, I'm unsure if this kind of optimisation in compilation ever actually occurs.

Cayman GPU theoretically supports this concept of multiple kernels per SIMD core for compute kernels in addition to graphics kernels. (It's advertised, but no idea how many years it'll be before it's implemented.)

I have this compilation problem with just HD5870, which doesn't support multiple kernels per core for compute.

Kernels can overlap on the GPU as a whole, rather than within individual SIMD cores. As one kernel is finishing it occupies less and less SIMD cores, e.g. 20, 19, 16, 5, 1. And as it does so the kernel that follows in the execution queue can take over on SIMD cores that are becoming "idle". This, as far as I know, is supported on HD5870. But the compiler has no concept of the sequence of kernel calls, so it has no basis for making compilation optimisations when presented with a set of kernels which may or may not run sequentially with no intervals.

So I cannot find any logic in this situation.

0 Likes

Meteorhead,
I've looked into the ISA generation with the upcoming SDK release and there is no scratch generation anymore, but 62 registers are still used.
0 Likes

My guess was 50-56 registers, as I explained in my first post. 62 sounds better and more reasonable. I believe I won't be able to predict all compiler magic that requires a few registers.

Thanks for the help.

0 Likes