Archives Discussions

foomanchoo · ‎12-14-2012

good morning.

thanks for listening. here's the riddle:

i have written an opencl kernel, which is 99% alu bound, using only bit operation instructions,

(v_xor, v_not, v_or, v_bfi)

no access to memory (including lds). uses only v_ instructions, with the exception of s_cmp, s_add,

s_branch for the for loop. there are 100 alu instructions in the loop body, but the measurements

(valu utilization aka VALUBusy) do not change if there are 1000.

the loop is executed 10000 times per kernel invocation.

there are minimal read after write conflicts and all read after write accesses to registers happen in

non-adjacent instructions.

the loop body is a mix of ~70 VOP3 (v_bfi) instructions and ~30 VOP2 (and, or, xor).

the whole kernel program size is a mere 3k (fits instruction cache).

there is no thread divergance.

i cannot determine if the 8 wavefronts per CU are executed concurrently or sequentially (4 then 4)

(because i dont see how i can access s_memtime from opencl)

yet: sprofile only measures VALUBusy of 60%. doing the math myself (number of instructions vs kernel time)

i come to the same conclusion.

i am curious, how does a kernel with more than 90% VALUBusy look like. any examples.

binying · ‎12-14-2012

Try an example in the SDK 2.8?

realhet · ‎12-14-2012

I bet your loop burns the alu even when vgprs usage is as high as 256. Did you measured the complete kernel time and compared it to the estimated performance of the loop? Was that also 60% of the estimation?

foomanchoo · ‎12-14-2012

yes the number of instructions and kernel time reported by the profiler come to the same conclusion.

i have also understood now that only 4 wavefronts per CU are concurrently active (16k LDS per wavefront, but that

is accessed outside the 10000-iterations, 100 instructions body alu loop).

you are saying simple bit-alu should not need more than 4 WF per CU, if no memory access is involved. i would agree.

s_memtime should help solve the riddle. i assume that i would have to get familiar with editing the ELF file? is there a s_memtime in the IL? or can i edit the ISA in an ELF file too?

drallan · ‎12-15-2012

I think the answer to the riddle is that VOP3 instructions are 64 bit and can take longer to issue when grouped together.

However, when waves run in parallel, i.e., 8 waves/CU then gcn uses magic and can multiple issue certain instruction types/patterns from parallel waves. Dual issue of s_ and v_ instructions is the best example but this also reduces the latency issuing multiple 64 bit instructions. None of this happens with only 4 waves, no magic! The AMD presentation GCN_final.pdf, available on the net, describes all this briefly.

I have a program that can measure this using s_memtime. Below are issue and average execution times for VOP3 vs. VOP2 instructions. Times are in clock ticks. With 75% VOP3 instructions the average execution time is 1.5 clocks or about 66% efficiency. With 8 waves it is closer to 90%

Waves 4 8 12 16

Issue time 6.00 6.00 6.00 6.00 PATTERN VOP2.VOP3.VOP3.VOP3

Execute time 1.50 1.14 1.13 1.08

Issue time 4.10 4.10 4.10 4.10 PATTERN VOP2.VOP2.VOP2.VOP2

Execute time 1.04 1.01 1.00 1.00

8 waves is the real sweet spot for gcn.

realhet · ‎12-15-2012

In the past I did s/v pattern test but with zillions of threads. It turned out that VVVV (all 64bit v_) is easy for the GPU even I use 256 numvgprs (and reduce the wavefronts to 1/CU).

But as you say it needs 8/CU from the 'outside' and 4/CU in the 'inside' to work on 100%.

Anyways I still can't undestand why it needs 256KB register memory and not 64KB (256regs*4bytes*64threads) (maybe for 4x pipelining I guess ).

That Instruction Arbitrator thing is a big mistery. It can handle a 4*64bit v_ and 4*64bit s_ pattern but only with 64 vregs and ten-thousands of threads.

That would be an interesting 4 dimensional diagram from these parameters: [s/v pattern] x [register usage(64,84,128,256)] x [lds_usage] x [total threads]

drallan · ‎12-16-2012

In the past I did s/v pattern test but with zillions of threads. It turned out that VVVV (all 64bit v_) is easy for the GPU even I use 256 numvgprs (and reduce the wavefronts to 1/CU).
But as you say it needs 8/CU from the 'outside' and 4/CU in the 'inside' to work on 100%.
Anyways I still can't undestand why it needs 256KB register memory and not 64KB (256regs*4bytes*64threads) (maybe for 4x pipelining I guess ).
That Instruction Arbitrator thing is a big mistery. It can handle a 4*64bit v_ and 4*64bit s_ pattern but only with 64 vregs and ten-thousands of threads.
That would be an interesting 4 dimensional diagram from these parameters: [s/v pattern] x [register usage(64,84,128,256)] x [lds_usage] x [total threads]

< 4x pipelining > That's how I think of it. One CU has 64 ALUs in 4x16 vectors. Each ALU has a 4 clock latency with 4 time slots for instruction issue, so the CU needs 4 waves to issue into each time slot to run full speed. Thus 4 waves can run with 256 vgprs (the 256K bytes). I guess these are 'parallel' but do not overlapping and cannot dual issue. With 8 or more waves there are overlapping waves that can dual issue but the vgprs are cut down to 128/wave, etc.

realhet · ‎12-15-2012

Do your inner loop contains any s_ instructions (except loop management)?

What's the NumVGPRS?

For s_memtime let me show you my assembler -> http://devgurus.amd.com/message/1285450

There's an GCN_CAL_latency_test.hpas example, you can put your "v_xor, v_not, v_or, v_bfi" inner loop into that and see what happens.

Archives Discussions

low valu utilization without memory accesses