cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Raistmer
Adept II

Scratch registers - how to prevent their usage?

I use fixed-size array in registers to reduce fetch size required by kernel.

At some size (11 elements) kernel performance dropped considerably (3 times slowdown) and 22 scratch registers were used.

Kernel occupancy is 25% that corresponds to 8 waves per CU.

That is, instead of using only single workgroup of 4 waves and no scratch registers compiler decided to keep 8 waves (2 workgroups) per CU but introduce 22 scratch registers.

Cause performance dropped greatly it's obviously bad choice.

At array size of 10 there are no scratch registers at all, 8 waves and 3 1VGPR used (I profiling kernel on Loveland GPU).

At array size of 11 there are 22 scratch registers, 31 VGPR and 8 waves too.

Is it possible to tell compiler somehow not to use scratch registers and decrease number of waves in fly instead?

I expect much better performance with more register space used per workitem even if number of waves in flight will be decreased to only 4.

Here is ISA for length of 10:

; -------- Disassembly -------------------- 00 ALU_PUSH_BEFORE: ADDR(64) C - Pastebin.com

And here for len of 11:

; -------- Disassembly -------------------- 00 ALU_PUSH_BEFORE: ADDR(224 - Pastebin.com

0 Likes
9 Replies
gopal
Staff

Hi Raistmer,

Yes, there is a way to reduce the scratch register usage to 0. You can either tune the work-group size specified at launch time to expose more registers/kernel to compiler or re-write the kernel code. But it is limited with maximum number of GPRs that are available in your hardware. You can not decrease the number of waves in fly, it is decided by scheduler. Scheduler tries to launch as much number of waves as possible to hide memory latency, which is essential for performance.

Further, compiler generates scratch regs if it can not fit all the live values into registers. But it should not do so for 25-35 VGPRs. What is the maxVGPRs available on your card? Can you post your kernel code? What is the array's element size?

0 Likes

Well, I can't reduce workgroup size (it will decrease performance) or change kernel code (the longer registers array will be the less total fetch amount will be).

You list recommendations regarding kernel optimization, but I don't need kernel optimization, I need compiler work optimization.

How one can affect on choice compiler take? You list threshold of 35 registers. That where root of issue hidden. Much more registers can be allocated to single workitem still having 256 of workitems in workgroup.

All that compiler should do is to allocate more than 35 VGPRs to workitem. But instead it reject to allocate more and uses scratch ones.

You can not decrease the number of waves in fly, it is decided by scheduler. Scheduler tries to launch as much number of waves as possible to hide memory latency, which is essential for performance.

Well, one can limit it.


Actually I tried such experiment:

to add shared memory array allocation inside kernel with size of 32k bytes. That makes only single workgroup per CU (due to shared memory limits) possible.

I expected that compiler would realize that scheduler will not be able to allocate 8 waves per CU anyway and will allow more VGPRs per workitem... no, compiler happened to be not so clever. Even with 4 waves per CU it uses scratch registers.

But I have another kernel written at the time when ATi SDK was not able to create array in registers at all. So that kernel emulates array as separate variables x1, x2, x3,... instead of x.

Well, that kernel has 50 VGPRs allocated per workitem, runs only 4 wavefronts per CU (obviously) but uses no scratch registers.

So, the real issue in that compiler heuristic - never exceed 35VGPRs if arrays used. Is it possible to circumvent this with some pragma or environment variable? Or should I as few years ago do manual kernel unroll and use x1, ...x11 instead of x[11] ?

EDIT: it looks quite silly to be able to allow compiler unroll small arrays with 2-5 elements where code can be unrolled manually but not be able to allow compiler to do long unrolls where manual unroll will bloat kernel code considerably... just because compiler refuses to allocate long enough arrays in registers even if there is enough registers for such arrays...

EDIT2: kernel:

//R: these 3 defines set vector size of fetch kernel typedef float4 ftype; #d - Pastebin.com

0 Likes

Hi Raistmer,

Yes, developers have control to limit the number of active waves/CU, but it is limited with resource usage. It includes number of registers, LDS used by kernel. There is also hardware limit locally (per CU) as well as globally (across all the CUs), which restricts the number of waves on fly.  Scheduler tries to pick optimum of these numbers and lunch it.

"You list threshold of 35 registers."

There is no threshold of 35 registers. From your post, it seems that your kernel code uses 31VGPRs and hence i took a rough estimation of this number to better explain the issue. As per the table 7.7 from this book http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programm..., more than 35 VGPRs can be allocated. And for this family of GPUs, if the number of registers per work-item exceeds 124, then scratch regs would occur.

How one can affect on choice compiler take?

As of now there is no way to directly tell the compiler to use some fix number of registers per work-item. The number of registers used by a work-item is determined when kernel is compiled. Ideally OpenCL compiler knows the size of work-group at compile-time and hence it makes optimal register allocation. Without knowing the work-group size, the compiler assumes an upper-bound size to allocate more registers per work-item than the hardware actually contains. So in your case, make sure that you use upper-bound size of work-group (256).

Vectorization: It can lead to greater efficiency but it also increases more number VGPRs for storage. So while using vectorization, make sure that how well the kernel code is using five-wide (or four-wide depending on GPU) VLIW unit. This can be checked using ALUPacking counter provided by profiler. If it is below 70%, it indicates that dependency are preventing full use of the processor.

I am not able to access your link, is it possible for you to share the kernel code to have a look on it?

0 Likes

@And for this family of GPUs, if the number of registers per work-item exceeds 124, then scratch regs would occur@.


This contradicts with my experience. As one can see max possible vector size for listed below kernel is 10 (w/o scratch regs). It corresponds 31 VGPRs allocated.

Moving to 11 adds 22 scratch reg instantly. Though total number of regs well below 124 . Maybe one should qualify this as bug report then? Cause 124 VGPRs available definitely not the case for this kernel.


@So in your case, make sure that you use upper-bound size of work-group (256).@

That is, compiler doesn't know that 256 is max allowed workgroup size? One should specify 256 explicitly? Hm... I would understand direct specificatio  of workgroup size less then max to give hint to compiler, but max.. Well, I will try to see if it will chenge anything.


Can't access pastebin site? well, here the code re-copied from there:

  1. //R: these 3 defines set vector size of fetch kernel
  2. typedef float4 ftype;
  3. #define FETCH vload4
  4. #define FSHIFT 2 // 0 - scalar; 2 - float4; 3 - float 8; 4 - float16
  5. #define FD (480) //(480<<2) - scalar/float; (480) - float4; (480>>1) - float8; (480>>2) - float16
  6. #define FARRAY_SIZE 10
  7. __kernel void GPU_fetch_array_kernel_twin_1D_cl(__global float* src, __global float* src_twin,
  8.                                                                                           __global int* offsets,  __global int* f_int,
  9.                                                                                          __global ftype* dest, __global ftype* dest_twin,
  10.                 const uint offset, const uint stride, const uint max_per_int){
  11.         size_t threadID=get_global_id(0)+offset;
  12. //      size_t j=get_global_id(1);
  13. //      if(j>=fd)return; //R:matrix bounds guard
  14.         size_t l;
  15.         ftype acc[FARRAY_SIZE], acc_twin[FARRAY_SIZE];
  16.         //__local src_cache[1024];
  17.         //__local src_twin_cache[1024];
  18.         size_t n_per=f_int[threadID];
  19.         for(size_t outer_loop=0;outer_loop<max_per_int;outer_loop+=FARRAY_SIZE){
  20.                 for(size_t j=0;j<FARRAY_SIZE;j++){acc=0.f;acc_twin=0.f;}
  21.          for(size_t k=0;k<n_per;k++){
  22.                         l=offsets[k*stride+threadID];
  23.                         for(size_t j=0;j<FARRAY_SIZE;j++){
  24.                                 __global float* s=src+l+((outer_loop+j)<<FSHIFT);
  25.                                 acc+=FETCH(0,s);
  26.                                 s=src_twin+l+((outer_loop+j)<<FSHIFT);
  27.                                 acc_twin+=FETCH(0,s);
  28.                         }
  29.          }
  30.          for(size_t j=0;j<FARRAY_SIZE;j++){
  31.                 size_t index=threadID*FD+(outer_loop+j);
  32.                 dest[index]=acc;
  33.                 dest_twin[index]=acc_twin;
  34.          }
  35.         }
  36. }
0 Likes

Hi Raistmer,

Thanks for sharing the kernel code.

@That is, compiler doesn't know that 256 is max allowed workgroup size?@

Compiler knows the max work-group size. Here is an example how compiler uses this max work-group size. "Assume if compiler allocates 75 registers for the work-items and as per the table 7.7, only three waves(192 work-items) are supported. And if user later launches the kernel with a work-group size of four waves (256 work-items). the launch fails because work-group requires 75*256=19,200 registers, which is more than the hardware allows. To prevent this from happening, the compiler performs registers allocation with conservative assumptions that kernel is launched with max work-group size (i.e 256 work-items). The compiler guarantees that the kernel does not use more than 62 registers (the maximum number of registers which supports a work-group with four waves), and generate low performance spill code, if necessary."


@One should specify 256 explicitly?@

It is not necessary but i think it is recommended for the following scenario:

1. As we know that work-group size of 256 is recommended to ensure that each core is being used.

2. From LDS space, larger work-groups enable more work-items to efficiently share data, which can reduce the amount of slower global communication.

3. Assume a device with each CU has 16384 GP registers. if each work-item requires 260 registers. Suppose first time, if user launches the kernel with work-group size of 64, then in this case compiler guarantees that kernel does not use more than 256 registers and may generate spill code (because 256*64=16384 registers is the max limit). Second time, if user launches the same kernel with work-group size of 256, in this case also compiler guarantees that the kernel does not use more than 256 registers and may generate spill code, if necessary. In both the case, spill code may occur. So which case user should select? It seems that second case would cause more spill code, but it is not. Because in both the cases each wave has same number of VGPRs available (256*64=16384). Hence register spilling is no greater problem with four waves per work-group than it is with one wave per work-group.


@Hm... I would understand direct specificatio  of workgroup size less then max to give hint to compiler, but max.. Well, I will try to see if it will change anything.@

This is right, but based on user scenario, it may not be bad to explicitly specify the max work-group size.


@ Maybe one should qualify this as bug report then?@

I will try to re-produce the issue and get back to you.

0 Likes

Thanks. I would appreciate this.

Thanks for explanations regarding register allocations, you description coincede with my views on that topic.

In my tests I specially created situation with NULL (hence, compiler should assume 256) workgroup size and full allocated LDS that means 1 workgroup per CU hence 4 waves per CU.

In usual regime (with 10 elements per vector, no LDS allocated) there are 8 waves per CU (2 workgroups 4 waves each).

Indeed I see some speed degradation of 4 waves per CU variant vs 8 waves per CU. I intend to compensate this slowdown by other optimizations later.[EDIT: accordingly to this: Re: Why only 256 workitems per workgroup for ATi GPU? it's not possible to compensate memory access latency by waves swapping if only 1 workgroup in fly per CU] But this slowdown MUCH lower than 3-fold speed decrease due to register spilling. So for now I should limit register arrays size to 10 and use only part of total VGPR registers available on CU. Currently my kernel uses 31*64*4=7936 < 16k that available for CU. And my kernel doesn't need 2 or more workgroups per CU ( I plan to use persistent threads approach with further kernel modifications) so have only 4 waves with single workgroup and all available registers in use per CU is the aim.

EDIT: why direct specification of workgroup size can be worse than NULL:

global size of my kernel depends on external factors and not guarantied to be divisible by 256. With NULL as workgoup size I allow OpenCL runtime to use max possible when it's possible and still be able to execute kernel with less than 256 elems per workgroup if needed.

If I wold require workgroup size of 256 then such calls would fail.

0 Likes

Hi Raistmer,

Could you share the host code as well, i want to have a look over this?

Thanks,

0 Likes

Were you able to figure out acceptable workarounds?  Do send us the host code so that we can investigate this.

0 Likes
ndv
Journeyman III

I think this inappropriate usage of scratch registers has something to do with thread divergence.

I have two kernels: one which allocates float[20] and uses scratch memory and another which allocates float[128] and doesn't use scratch memory. Both running on the same workgroup size. The difference is that the first one uses divergent algorithm (quicksort), and another uses vectorizable algorithm (sequential bitonic sort).

My guess is that the compiler splits private memory into several partitions, so that threads could run after they diverge, each using its own partition. Probably because the minimum addressable unit is x16 vector, not a float. This splitting would only make sense if compiler still tries to run those threads in parallel after divergence, only I am not sure how this is possible.

Another strange thing about the compiler is that when I replace float[20] with uchar[20], usage of scratch registers increases.

0 Likes