I have a strange behaviour in a kernel and I am looking for advice to find why I have a loss of performance.
I have a kernel that operates with 32 bit integer values, this kernel has an input of 2 arrays of 64k data items each (uint32). The kernel uses logical operations (and/or) and shifts (<< / >>) to generate the output data (a 64k uint32 array)
when I use a kernel that uses uint as the data type, I have this profiling:
(I use a HD4350, so no LDS...)
When I use a kernel that uses uint4 as the data type, I have this profiling:
So, uint4 generates 73 more alu instructions (1329 vs 1256=314x4) and start to generate stalls in the fetch unit. This also reflects in kernel time and of course in the total time of the application. I have to operate with a lot of sets of 64k, and with uint4 data I have 1.05 more process time than uint time.
So, my question is: what can I do to have better performance with uint4 data? it is supposed that working with uint4 is better, but I have a loss of performance with it. Any advice/insight is welcome. Thanks in advance for your cooperation.