Hi,
I am developing a kernel for signal processing which uses a lot of registers and I stumbled over something I am not able to explain.
My kernel uses local variables like
#define NDATA 64
#define NORDER 5
__kernel void myKernel( __global float2* inputData, __global float2* outputCoeff)
{
float2 ef[NDATA]; // 128 * 32bit
float2 eb[NDATA]; // 128 * 32bit
float2 parcor[NORDER]; // 10 * 32bit
float2 arc[NORDER+1]; // 12 * 32bit
float2 arc_tmp[NORDER+1]; // 12 * 32bit
...
during the execution of the kernel some temporary variables are created but not many.
Now when I want to calculate how many 32bit registers the kernel uses I would say 290 plus some temporaries. Let's go with 300 32 bit regs.
I am running the kernel on an 6450 which has 16384 Vector Registers / CU available and each vector register is 128bit wide.
This would mean that each work-item needs 7 ( = 300 / 4) Vector Registers. So when I run the kernel with a work-group-size of say 5, I shouldn't see
any scratch register usage, because the CU has vector registers for much more than 5 work-items at once).
When I analyze this kernel with the Kernel-Analyzer the results are somewhat obscure to me:
GPRs-Used: 9
Scratch-Regs used: 70
This is what confuses me.
Did I get something wrong from the documentation or am I just too deep stuck in the details so that I'm not able to see the obvious flaw in my calculations?
I hope someone has some advice on this.
Thanks in advance