afo

uint4 has less performance than uint?

Discussion created by afo on Feb 4, 2010
Latest reply on Feb 4, 2010 by Fr4nz
In a kernel, when I use uint4 instead of uint, I get more ALU instructions...

Dear people,

I have a strange behaviour in a kernel and I am looking for advice to find why I have a loss of performance.

I have a kernel that operates with 32 bit integer values, this kernel has an input of 2 arrays of 64k data items each (uint32). The kernel uses logical operations (and/or) and shifts (<< / >>) to generate the output data (a 64k uint32 array)

when I use a kernel that uses uint as the data type, I have this profiling:

GlobalWorkSize: (65535;1;1)

GroupWorkSize: (64;1;1)

KernelTime: 2.1xxxx

LocalMem: 0

ALU: 314

Fetch: 2

Write: 1

WaveFront: 2048

ALUBusy: 99.75

ALUFetchRatio: 157

ALUPacking: 94.14

FetchUnitBusy: 1.27

FetchUnitStalled: 0

WriteUnitStalled: 0

(I use a HD4350, so no LDS...)

When I use a kernel that uses uint4 as the data type, I have this profiling:

GlobalWorkSize: (16384;1;1)

GroupWorkSize: (64;1;1)

KernelTime: 2.3xxxxx

LocalMem: 0

ALU: 1329

Fetch: 2

Write: 1

WaveFront: 512

ALUBusy: 99.66

ALUFetchRatio: 664.5

ALUPacking: 98.72

FetchUnitBusy: 1.26

FetchUnitStalled: 0.06

WriteUnitStalled: 0

So, uint4 generates 73 more alu instructions (1329 vs 1256=314x4) and start to generate stalls in the fetch unit. This also reflects in kernel time and of course in the total time of the application. I have to operate with  a lot of sets of 64k, and with uint4 data I have 1.05 more process time than uint time.

So, my question is: what can I do to have better performance with uint4 data? it is supposed that working with uint4 is better, but I have a loss of performance with it. Any advice/insight is welcome. Thanks in advance for your cooperation.

 

best regards,

Alfonso

Outcomes