No. Your program needs ~77 registers as the compiler allocated 7 GPRs and 70 "scratch" registers. Since you are using the default workgroup size of 256, you can only allocate 64 registers per thread.
As I mentioned earlier, try specifying the "__attribute__((reqd_work_group_size(64,1,1)))". Just note that this may limit how many wavefronts that can be scheduled.
Alright, I got that. But I still need some clarification on the register width. The registers we are talking about here ( GPRs ) - are those the vector registers?
If so, then the documentation says that each vector register is 128 bit wide. If so, each register could hold 4 floats.
Is that the case or not?
If it is the case then I'm wondering why my program need so many registers.
Yes, that's 7 vector registers, so 28 scalar registers plus the 70 vector regs that got spilled which is an additional 280 scalar registers. Naively, your code seems to allocate 128 + 128 + 10 + 12 + 12 = 290 scalar regs in private arrays, which would equal ~73 vector registers if you didn't use any other temps.
This all assumes the compiler can't move the arrays into temps.
I have checked that the option is enabled. It is! But still no kernel occupancy information in being generated.
I have not yet checked the samples.
Will do in the next time to provide more information.
One other thing to check: is the profiler creating a .occupancy file? If so, it would be in the same location as the .csv file that contains the performance counter data. From Visual Studio, you can quickly get to this location by right clicking the profile session in the "APP Profiler Session Explorer" window and then selecting the "Open Containing Folder" menu item. If there is an .occupancy file created, can you post its contents here so we can see if there is an obvious reason that the profiler client is not able to display the occupancy data?
If there is not an .occupancy file, we'll have to figure out why -- if you get a chance, please try to profile one of the SDK samples so we can see if the lack of occupancy data is a specific problem with your application of it is looks like a general problem on your machine.
I can confirm that no occupancy information is generated as per sprofile coming with SDK 2.6, linux, x86_64. The value is always zero. Occupancy file is generated. GPRs, local memory usage and limits are correct though,
Besides, I got strange results in the "normal" csv profiler output with some heavily ALU-bound kernels (RAR and WPA cracking ones) that involve loops and typical kernel execution time exceeds several seconds. I've got some kernel invocations reporting wildly varying values like those:
There is no "early termination if some condition met", inputs are similar (actually in that testcase they were all the same) and all kernel invocations should have the same wavefronts, ALU operations,fetch operations and so on. That's very strange I think.
Thanks for reporting this issue.
What linux driver do you use? We have confirmed that with Catalyst 12.1, the occupancy number is incorrect. It will be fixed in the next version of Catalyst.
What profiler version did you use? Did you use the profiler that comes with SDK 2.6? If so, there is a newer version in APP Profiler webpage that has addressed some of the performance counter issue. Can you also make sure there is no other application that uses GPU while you are profiling.
It looks like you have Catalyst 12.1 driver. As you can see in the occupancy file, maximum LDS size is reported as 0, This issue will be fixed in up coming driver release.
I am using the one coming with SDK 2.6. Didn't know there is an update, thanks. I used catalyst 12.1. There was no other GPGPU application running, however the system is not headless, it has KDE running on it.