I am developing a kernel for signal processing which uses a lot of registers and I stumbled over something I am not able to explain.
My kernel uses local variables like
#define NDATA 64
#define NORDER 5
__kernel void myKernel( __global float2* inputData, __global float2* outputCoeff)
float2 ef[NDATA]; // 128 * 32bit
float2 eb[NDATA]; // 128 * 32bit
float2 parcor[NORDER]; // 10 * 32bit
float2 arc[NORDER+1]; // 12 * 32bit
float2 arc_tmp[NORDER+1]; // 12 * 32bit
during the execution of the kernel some temporary variables are created but not many.
Now when I want to calculate how many 32bit registers the kernel uses I would say 290 plus some temporaries. Let's go with 300 32 bit regs.
I am running the kernel on an 6450 which has 16384 Vector Registers / CU available and each vector register is 128bit wide.
This would mean that each work-item needs 7 ( = 300 / 4) Vector Registers. So when I run the kernel with a work-group-size of say 5, I shouldn't see
any scratch register usage, because the CU has vector registers for much more than 5 work-items at once).
When I analyze this kernel with the Kernel-Analyzer the results are somewhat obscure to me:
Scratch-Regs used: 70
This is what confuses me.
Did I get something wrong from the documentation or am I just too deep stuck in the details so that I'm not able to see the obvious flaw in my calculations?
I hope someone has some advice on this.
Thanks in advance
Using arrays and then dynamically indexing into them might be the problem as in many cases the compiler is not able to promote the memory accesses into register accesses.
The granularity of registers is not as fine as that. There are 16384 vector registers, but those are shared across a minimum of 64 threads. That means at most 256 registers per thread. To complicate things, a thread can only access at most 128 registers. So in reality, we are limited to 128 registers and if you want to use all of the registers, then you need to have at least two wavefronts in flight per CU.
Since we are limited to 128 registers per thread, there's no benefit (as far as register allocation goes) to reducing the work group size below 64 threads.
In your case, the compiler doesn't know what your workgroup size will be, so it assumes the worst case (256 threads). That means that you have at most 64 registers per thread. If you want to avoid spilling, you can add "__attribute((reqd_work_group_size(64,1,1)))" to your kernel to tell the compiler what your workgroup size will be. In this case, we would be able to allocate 128 registers per thread and that would avoid spilling in your case (9 + 70 = 79 which is less than 128). Note that you will be reducing the number of waverfronts in flight per CU and may see worse performance in some cases.
Also, all compilers use registers as a cache: there is never a 1:1 mapping between variable declarations and register assignments. This doesn't necessarily mean register spillage either: registers only need to be available during the time they're in scope *and* in use. But obviously that depends on the code and as Micah said, dynamically addressed arrays can not be registerised in the first place (i.e. whose index is not knowable at compile time).
If you need dynamically addressed private arrays, use local memory, and then index them in multiples of 32 offset by localid.x to avoid bank conflicts.
BTW if you're aiming for performance (and lets face it if, you're using opencl and counting registers, you are), a workgroup size of 5 isn't any use: you'll need to add some parallelism and if there are multiple threads working on the same problem that'll mean local memory too.
Hey, thank you so much for your input. I really appreciate it. Also, the skill in this forum seems to be reasonably high - I like it .
Regarding my problem, I will try to change some things in the kernel like using local memory and then increasing the work-group size.
Perhaps it would also be a good idea to put some of the information you gave me (like "128 reg limit on a thread") into the documentation because I tried to find some information regarding my issue there first.
thanks a lot so far.
Table 4.11 in the AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf shows you how many wavefronts can be allocated based on register usage. Workgroups smaller than a wavefront will still allocate as many registers as a whole wavefront. This dramatically reduces the complexity of addressing registers in the GPU.
It seems that I am not able to view the kernel occupancy panel.
In my Performance-Counter-Panel there is no column named "...Occupancy..." and if I open the Kernel-Occupancy-Viewer the panel is blank.
I have installed APP Profiler 2.4 and I am using VS2010 (64 bit).
if we are limited to 128 registers per Thread - 128 vector registers, right? - and I need only 7 of those, the compiler should be able to allocate those the right way, or not?
I'm just thinking that it's not even 6% of the allowed register usage per thread so it should be doable. I know it's obviously not but I'm having difficulties to understand why.
Can you double check that you have the option enabled to generate occupancy information? In the APP Profiler Settings dialog, make sure that "Generate occupancy information for each OpenCL kernel profiled" is checked.
If you are profiling from the command line, make sure thatyou pass --occupancy option to sprofile.exe
Also, are you able to see Occupancy info if you profile one of the SDK sample applications? Just wondering if there is something specific with your application that prevents the occupancy feature from working.