Archives Discussions

eugenek · ‎02-08-2011

See attached code.

Logically, the first version should run about as fast as the second version.

Instead I'm seeing that the first version can do 20M kernels/second and the second version can do 32M kernels/second on a 6970. This seems to be a compiler optimization bug, because the assembly code for the first version is a LOT bigger than the second version, a lot more than 4x (according to Kernel Analyzer). To be specific, the second version compiles into 76 VLIW instructions per loop, and the first version compiles into 520 instructions per loop. Hence the performance difference: 32/20 = 1.6, 520 / (76*4) ~ 1.7.

__kernel void test_kernel_1(__global long* out, __global ulong4* const d_p, ulong seed) { ulong sum = 0; seed += get_global_id(0); for(int i=0; i<128; i+=4) { ulong4 p = d_p; sum += mul_hi(seed, p.x); sum += mul_hi(seed, p.y); sum += mul_hi(seed, p.z); sum += mul_hi(seed, p.w); p = d_p[i+1]; sum += mul_hi(seed, p.x); sum += mul_hi(seed, p.y); sum += mul_hi(seed, p.z); sum += mul_hi(seed, p.w); p = d_p[i+2]; sum += mul_hi(seed, p.x); sum += mul_hi(seed, p.y); sum += mul_hi(seed, p.z); sum += mul_hi(seed, p.w); p = d_p[i+3]; sum += mul_hi(seed, p.x); sum += mul_hi(seed, p.y); sum += mul_hi(seed, p.z); sum += mul_hi(seed, p.w); } out[get_global_id(0)] = sum; } __kernel void test_kernel_2(__global long* out, __global ulong4* const d_p, ulong seed) { ulong sum = 0; seed += get_global_id(0); for(int i=0; i<128; i++) { ulong4 p = d_p; sum += mul_hi(seed, p.x); sum += mul_hi(seed, p.y); sum += mul_hi(seed, p.z); sum += mul_hi(seed, p.w); } out[get_global_id(0)] = sum; }

himanshu_gautam · ‎02-08-2011

eugnek,

Shouldn't the first kernel be taking more time, it is doing more work inside the loop ?

eugenek · ‎02-08-2011

Originally posted by: himanshu.gautam eugnek,

Shouldn't the first kernel be taking more time, it is doing more work inside the loop ?

It's just an unrolled version of the second. Notice the for() condition (i+=4 vs. i++).

MicahVillmow · ‎02-08-2011

eugenek,
Instruction count alone is not the only way to judge kernel performance. The unrolled kernel uses more registers decreasing the amount of work-group that can execute in parallel. Your unrolled kernel uses 20 registers per work-item, limiting the number of work-groups per SIMD to 3, but the non-unrolled version uses 10 registers per work-item, allowing 6 work-groups per SIMD to execute. Unrolling the loop is not a guarantee to have better performance because you increase register pressure and reduce the amount of parallelism during execution.

eugenek · ‎02-08-2011

Originally posted by: MicahVillmow eugenek, Instruction count alone is not the only way to judge kernel performance. The unrolled kernel uses more registers decreasing the amount of work-group that can execute in parallel. Your unrolled kernel uses 20 registers per work-item, limiting the number of work-groups per SIMD to 3, but the non-unrolled version uses 10 registers per work-item, allowing 6 work-groups per SIMD to execute.

Understood. 3 work groups is 12 wavefronts per unit, which looks to me like ample number to hide memory latencies (I could be wrong, of course).

But that's beside the point. Even regardless of the effect of unrolling on the number registers, there are simply too many instructions in the first version. It shouldn't go from 76 instructions to 520 instructions just because I unroll the loop four times.

It looks like the code takes a big hit in packing efficiency because of unrolling. The short version is very well packed (3.75 operations per VLIW). The unrolled version is poorly packed (2.2 operations per VLIW). Maybe packing optimization shuts off for kernels above certain length?

himanshu_gautam · ‎02-10-2011

oops sorry i missed that 4.

i think micah is right that unrolling may not always result in performance boost.

As far your hiding memory latencies is concerned I think that heavily depends on the access pattern and caching. But one thing I would like to look at is why the ALU packing should decrease by loop unrolling as you mentioned.

Can you post the complete code?

eugenek · ‎02-10-2011

There's no complete code. This is just a simplified test case that made while I was tweaking an actual kernel, when I noticed some weirdness in performance numbers. (Among other things, the original kernel was using my own faster implementation of mul_hi.)

In the code there's one 32-byte memory access per 4 mul_hi. Each mul_hi translates into 11 low-level multiplication instructions + some other stuff (so, 4 mul_hi's take at least 50 ns to execute). Even if there's no caching, I'd expect that 3 or 4 wavefronts per compute unit, 50 ns per memory access each, would hide all memory latency there is. I could try to move all accesses up into the beginning of the loop and see what happens.

One other thing I don't understand is why the unrolled version should use double the number of registers. It's the same work, just repeated. If the short version requires 10 registers, I'm quite sure that you can get the long version working with 11 or 12. That too looks like a compiler bug to me.

Do I need to learn low-level IL language to extract proper performance from this architecture?

Archives Discussions

Possible compiler bug