In a kernel, when I use uint4 instead of uint, I get more ALU instructions...
Dear people,
I have a strange behaviour in a kernel and I am looking for advice to find why I have a loss of performance.
I have a kernel that operates with 32 bit integer values, this kernel has an input of 2 arrays of 64k data items each (uint32). The kernel uses logical operations (and/or) and shifts (<< / >>) to generate the output data (a 64k uint32 array)
when I use a kernel that uses uint as the data type, I have this profiling:
GlobalWorkSize: (65535;1;1)
GroupWorkSize: (64;1;1)
KernelTime: 2.1xxxx
LocalMem: 0
ALU: 314
Fetch: 2
Write: 1
WaveFront: 2048
ALUBusy: 99.75
ALUFetchRatio: 157
ALUPacking: 94.14
FetchUnitBusy: 1.27
FetchUnitStalled: 0
WriteUnitStalled: 0
(I use a HD4350, so no LDS...)
When I use a kernel that uses uint4 as the data type, I have this profiling:
GlobalWorkSize: (16384;1;1)
GroupWorkSize: (64;1;1)
KernelTime: 2.3xxxxx
LocalMem: 0
ALU: 1329
Fetch: 2
Write: 1
WaveFront: 512
ALUBusy: 99.66
ALUFetchRatio: 664.5
ALUPacking: 98.72
FetchUnitBusy: 1.26
FetchUnitStalled: 0.06
WriteUnitStalled: 0
So, uint4 generates 73 more alu instructions (1329 vs 1256=314x4) and start to generate stalls in the fetch unit. This also reflects in kernel time and of course in the total time of the application. I have to operate with a lot of sets of 64k, and with uint4 data I have 1.05 more process time than uint time.
So, my question is: what can I do to have better performance with uint4 data? it is supposed that working with uint4 is better, but I have a loss of performance with it. Any advice/insight is welcome. Thanks in advance for your cooperation.
best regards,
Alfonso