I am developing a kernel for signal processing which uses a lot of registers and I stumbled over something I am not able to explain.
My kernel uses local variables like
#define NDATA 64
#define NORDER 5
__kernel void myKernel( __global float2* inputData, __global float2* outputCoeff)
float2 ef[NDATA]; // 128 * 32bit
float2 eb[NDATA]; // 128 * 32bit
float2 parcor[NORDER]; // 10 * 32bit
float2 arc[NORDER+1]; // 12 * 32bit
float2 arc_tmp[NORDER+1]; // 12 * 32bit
during the execution of the kernel some temporary variables are created but not many.
Now when I want to calculate how many 32bit registers the kernel uses I would say 290 plus some temporaries. Let's go with 300 32 bit regs.
I am running the kernel on an 6450 which has 16384 Vector Registers / CU available and each vector register is 128bit wide.
This would mean that each work-item needs 7 ( = 300 / 4) Vector Registers. So when I run the kernel with a work-group-size of say 5, I shouldn't see
any scratch register usage, because the CU has vector registers for much more than 5 work-items at once).
When I analyze this kernel with the Kernel-Analyzer the results are somewhat obscure to me:
Scratch-Regs used: 70
This is what confuses me.
Did I get something wrong from the documentation or am I just too deep stuck in the details so that I'm not able to see the obvious flaw in my calculations?
I hope someone has some advice on this.
Thanks in advance
Using arrays and then dynamically indexing into them might be the problem as in many cases the compiler is not able to promote the memory accesses into register accesses.
The granularity of registers is not as fine as that. There are 16384 vector registers, but those are shared across a minimum of 64 threads. That means at most 256 registers per thread. To complicate things, a thread can only access at most 128 registers. So in reality, we are limited to 128 registers and if you want to use all of the registers, then you need to have at least two wavefronts in flight per CU.
Since we are limited to 128 registers per thread, there's no benefit (as far as register allocation goes) to reducing the work group size below 64 threads.
In your case, the compiler doesn't know what your workgroup size will be, so it assumes the worst case (256 threads). That means that you have at most 64 registers per thread. If you want to avoid spilling, you can add "__attribute((reqd_work_group_size(64,1,1)))" to your kernel to tell the compiler what your workgroup size will be. In this case, we would be able to allocate 128 registers per thread and that would avoid spilling in your case (9 + 70 = 79 which is less than 128). Note that you will be reducing the number of waverfronts in flight per CU and may see worse performance in some cases.
if we are limited to 128 registers per Thread - 128 vector registers, right? - and I need only 7 of those, the compiler should be able to allocate those the right way, or not?
I'm just thinking that it's not even 6% of the allowed register usage per thread so it should be doable. I know it's obviously not but I'm having difficulties to understand why.
No. Your program needs ~77 registers as the compiler allocated 7 GPRs and 70 "scratch" registers. Since you are using the default workgroup size of 256, you can only allocate 64 registers per thread.
As I mentioned earlier, try specifying the "__attribute__((reqd_work_group_size(64,1,1)))". Just note that this may limit how many wavefronts that can be scheduled.
Alright, I got that. But I still need some clarification on the register width. The registers we are talking about here ( GPRs ) - are those the vector registers?
If so, then the documentation says that each vector register is 128 bit wide. If so, each register could hold 4 floats.
Is that the case or not?
If it is the case then I'm wondering why my program need so many registers.
Yes, that's 7 vector registers, so 28 scalar registers plus the 70 vector regs that got spilled which is an additional 280 scalar registers. Naively, your code seems to allocate 128 + 128 + 10 + 12 + 12 = 290 scalar regs in private arrays, which would equal ~73 vector registers if you didn't use any other temps.
This all assumes the compiler can't move the arrays into temps.
Also, all compilers use registers as a cache: there is never a 1:1 mapping between variable declarations and register assignments. This doesn't necessarily mean register spillage either: registers only need to be available during the time they're in scope *and* in use. But obviously that depends on the code and as Micah said, dynamically addressed arrays can not be registerised in the first place (i.e. whose index is not knowable at compile time).
If you need dynamically addressed private arrays, use local memory, and then index them in multiples of 32 offset by localid.x to avoid bank conflicts.
BTW if you're aiming for performance (and lets face it if, you're using opencl and counting registers, you are), a workgroup size of 5 isn't any use: you'll need to add some parallelism and if there are multiple threads working on the same problem that'll mean local memory too.
Hey, thank you so much for your input. I really appreciate it. Also, the skill in this forum seems to be reasonably high - I like it .
Regarding my problem, I will try to change some things in the kernel like using local memory and then increasing the work-group size.
Perhaps it would also be a good idea to put some of the information you gave me (like "128 reg limit on a thread") into the documentation because I tried to find some information regarding my issue there first.
thanks a lot so far.
Table 4.11 in the AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf shows you how many wavefronts can be allocated based on register usage. Workgroups smaller than a wavefront will still allocate as many registers as a whole wavefront. This dramatically reduces the complexity of addressing registers in the GPU.
You could also use APP Profiler to see the estimated number of in-flight wavefronts based on register usage:
It seems that I am not able to view the kernel occupancy panel.
In my Performance-Counter-Panel there is no column named "...Occupancy..." and if I open the Kernel-Occupancy-Viewer the panel is blank.
I have installed APP Profiler 2.4 and I am using VS2010 (64 bit).
Can you double check that you have the option enabled to generate occupancy information? In the APP Profiler Settings dialog, make sure that "Generate occupancy information for each OpenCL kernel profiled" is checked.
If you are profiling from the command line, make sure thatyou pass --occupancy option to sprofile.exe
Also, are you able to see Occupancy info if you profile one of the SDK sample applications? Just wondering if there is something specific with your application that prevents the occupancy feature from working.
I have checked that the option is enabled. It is! But still no kernel occupancy information in being generated.
I have not yet checked the samples.
Will do in the next time to provide more information.
One other thing to check: is the profiler creating a .occupancy file? If so, it would be in the same location as the .csv file that contains the performance counter data. From Visual Studio, you can quickly get to this location by right clicking the profile session in the "APP Profiler Session Explorer" window and then selecting the "Open Containing Folder" menu item. If there is an .occupancy file created, can you post its contents here so we can see if there is an obvious reason that the profiler client is not able to display the occupancy data?
If there is not an .occupancy file, we'll have to figure out why -- if you get a chance, please try to profile one of the SDK samples so we can see if the lack of occupancy data is a specific problem with your application of it is looks like a general problem on your machine.
I can confirm that no occupancy information is generated as per sprofile coming with SDK 2.6, linux, x86_64. The value is always zero. Occupancy file is generated. GPRs, local memory usage and limits are correct though,
Besides, I got strange results in the "normal" csv profiler output with some heavily ALU-bound kernels (RAR and WPA cracking ones) that involve loops and typical kernel execution time exceeds several seconds. I've got some kernel invocations reporting wildly varying values like those:
There is no "early termination if some condition met", inputs are similar (actually in that testcase they were all the same) and all kernel invocations should have the same wavefronts, ALU operations,fetch operations and so on. That's very strange I think.
Thanks for reporting this issue.
What linux driver do you use? We have confirmed that with Catalyst 12.1, the occupancy number is incorrect. It will be fixed in the next version of Catalyst.
What profiler version did you use? Did you use the profiler that comes with SDK 2.6? If so, there is a newer version in APP Profiler webpage that has addressed some of the performance counter issue. Can you also make sure there is no other application that uses GPU while you are profiling.
I am using the one coming with SDK 2.6. Didn't know there is an update, thanks. I used catalyst 12.1. There was no other GPGPU application running, however the system is not headless, it has KDE running on it.
now finally - more information
I've profiled a sample application (the fft sample) and the profiler does not generate the kernel occupancy information as well. The .occupancy file is created (you find it attached to this post).
The details of my system are to be found in this file I think.
It looks like you have Catalyst 12.1 driver. As you can see in the occupancy file, maximum LDS size is reported as 0, This issue will be fixed in up coming driver release.