Archives Discussions

realhet · ‎12-07-2012

Hi,

Does OpenCL take advantage of the following techniques when using small local arrays?

- On VLIW -> indexed_temp_arrays (x0) (aka. R55[A0.x] indirect register addressing in ISA)

- On GCN -> v_movrel_b32 instruction

Or if OpenCL always uses LDS memory for local arrays, is there an extension to enable those faster techniques?

Thanks in advance.

drallan · ‎12-08-2012

GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy.

The maximum vgpr array size then depends on what other vgprs the compiler needs to use.

In one case, I saw the compiler wasted 74 vgprs as temporaries by loading blocks of data before writing to the array.

Even here, int array[160] was not a problem.

But beware the devil. When the array indices are not known at compile time, both gcn and VLIW will access registers serially one thread at a time.

gcn scans the 'lanes' for threads looking for an index. It will then read/write all threads with the same index in parallel using v_movereld/s, then repeat until all threads are processed. Worst case is all 64 indexes in a wave are different = 64 read/write loops (yes branching too). Best case, all indexes are the same and there is only one read/write. (actually that's pretty cool .) Although VLIW uses A0 register, it also does something similar to serially access different indices.

LDS might be faster but, there's not enough.

View solution in original post

binying · ‎12-07-2012

To find out the answer, I think you can wirte a simple kernal, in which you use a small local array, and compile it using kernel analyzer. Then check the result in the output window...

realhet · ‎12-07-2012

I'm not that lazy, but all I have right now is a HD4850, and on that OCL is terrible beta-ish

So I now got to have 160 dwords of this kind of fast 'memory' for my project (implementing it with amd_il+indexed_temp_array), and just wonder if OCL can do it.

I know that for GCN I have to use something hybrid LDS+register_array to stay inside the 128x dword vreg limit. But on VLIW this register array thing is just awesome (there is 128*4 regs limit).

hazeman · ‎12-08-2012

I've tested this feature on 58xx card. OpenCL compiler generates indexed_temp_array in IL.

The problem is what is going on in IL compiler. Almost randomly ( it slightly depends on size of array, whether you use parts (.x,.y,...) of indexed 4-vector ) it can use A indexing register or use scratch memory ( painfully slow ) to implement array.

Unfortunately in my kernel i couldn't trick IL compiler to reliably use A register indexing and I had to change kernel design so i wouldn't use this feature.

realhet · ‎12-08-2012

Hi,

That's cool that OCL can use A0 index.

I've played with it a little and found out that on HD4850 it will always use A0 indexing when total NumGPRS<=118. If NumGPRS would be>118 with the array then it will use scratch instead. And I only used the .x part. I think it doesn't check what parts we use as it will address 128bit array elements only. Maybe your kernel is around that NumGPRS limit.

It's approx 400 instantly accessible dwords...

But on GCN I think we can address only 100dwords (depends on other vreg usage) while not runing out of 128 vregs. I need 160 dwords total, and it doesn't fit into either LDS(would be 41KB for a wavefront) or VRegs. I'm afraid I have to mix those two if I want to avoid slow memory access.

drallan · ‎12-08-2012

GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy.

The maximum vgpr array size then depends on what other vgprs the compiler needs to use.

In one case, I saw the compiler wasted 74 vgprs as temporaries by loading blocks of data before writing to the array.

Even here, int array[160] was not a problem.

But beware the devil. When the array indices are not known at compile time, both gcn and VLIW will access registers serially one thread at a time.

gcn scans the 'lanes' for threads looking for an index. It will then read/write all threads with the same index in parallel using v_movereld/s, then repeat until all threads are processed. Worst case is all 64 indexes in a wave are different = 64 read/write loops (yes branching too). Best case, all indexes are the same and there is only one read/write. (actually that's pretty cool .) Although VLIW uses A0 register, it also does something similar to serially access different indices.

LDS might be faster but, there's not enough.

realhet · ‎12-09-2012

(unfortunately I cant accept 2x answers, though all two has proven a part of my question (vliw&gcn))

"But beware the devil."

From where the h3ll I get that v_movreld and A0 can access all the 64 lanes of register memory Individually in ONE CYCLE?! That would be so many wires and transistors just for these rare instructions. Thx for opening my eyes, haha!

Btw my case of course would be that every lane will accesses different regs.

"GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy."

That's when the instruction stream is not too dense. Pls take a look at these charts:

http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_4-12dwords.png

http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_8-16dwords.png

If your GCN code [uses a few S instructions and also there are some 64bit big instructions] AND [you're using more than 128 regs] your kernel can end up twice as fast than its estimated ideal performance. That's why I try to avoid 128+ regs (now I can't ) and aim for under 84 or even 64.

drallan · ‎12-10-2012

"But beware the devil."
From where the h3ll I get that v_movreld and A0 can access all the 64 lanes of register memory Individually in ONE CYCLE?! That would be so many wires and transistors just for these rare instructions. Thx for opening my eyes, haha!

I know your eyes are wide open but some might not realize how the compiler impliments C in a GPU environment.

I was a bit surprised when I first saw it.

I look forward to your solution, it's a tough problem!

realhet · ‎12-14-2012

My actual struggle in a picture ->

7 clocks instead of 1, This seemed like an easy 2-3x boost to my prog, but ouch

And it's not just the 4xxx, I've noticed it on 6xxx too.

With A0 the exact same thing is around 10% slower:

ushr r999.x, dwIdx, 2

iand r999.z, dwIdx, 1

iand r999.w, dwIdx, 2

mov r998 , x0[r999.x]

cmov_logical r999.xy, r999.ww, r998.zw, r998.xy

cmov_logical res, r999.z, r999.y, r999.x

Another discovery is that when I compared the above X0[] dword accessing with a uniform index (across wavefront), and I compared it to cb0 access (in the same way), the cb0 was faster. (It used a VTEX clause, but was slightly faster than A0).

realhet · ‎02-13-2013

Finally I had the chance to do some experiments on a 7970:

- v_movrels_b32 does nothing with the contents of the source operand, it only uses the index of it, so all the lanes will read from the same register. Maybe a0 indexing can access different regs/lane, but now I'm sure, that movrel can't.

- ds_readx2 is pretty effective (with different addresses for all laness)! I interleaved it with with 10-12 vector instructions and all the latency was hidden. (Make sure to set up the M0 register before using DS_ stuff! I wasted like an hour on this lol)

- The amd_il compiler can't deal with indexed arrays effectively: It always swaps the contents of the indexed array with unoptimized movs before using those. (x0[const1]+=x0[const2] uses 3 movs and an add)

himanshu_gautam · ‎02-14-2013

Hi realhet,

Can you please share some code, which can help us in reproducing the issue.

I will ask someone more knowledgeable for directions here.

Thanks

realhet · ‎02-14-2013

Hi!

I've managed to narrow it down: This is the simple operation it does over and over:

dcl_indexed_temp_array x0[![(bufLen+3)>>2]]

//array initialization goes here

//shuffle the elements of the array

forLoop(i,0,10000) //a loop so big that cannot be unrolled by the optimizer

iadd x0[0].w,x0[0].w,x0[0].x

iadd x0[0].x,x0[0].x,x0[0].y

iadd x0[0].y,x0[0].y,x0[0].z

iadd x0[0].z,x0[0].z,x0[0].w

endloop

And the unoptimal code is triggered by the way, I initialize the array:

If I do this: mov x0[0], cb0[0] then it compiles a perfect code (only add instructions are in the inner loop)

But if I initialize it with a dword indexing macro:

XWrite(x0, 0, cb0[0].x)

XWrite(x0, 1, cb0[0].y)

XWrite(x0, 2, cb0[0].z)

XWrite(x0, 3, cb0[0].w)

Where the XWrite C style macro is this: (It writes a dword in any array (cb0, x0, ...) at any dword position)

#define XWrite(XName,dwIdx,val) \\

ushr r999.x, dwIdx, 2 \\

iand r999.y, dwIdx, 3 \\

ifieq(r999.y,0) mov XName[r999.x].x, val \ endif \\

ifieq(r999.y,1) mov XName[r999.x].y, val \ endif \\

ifieq(r999.y,2) mov XName[r999.x].z, val \ endif \\

ifieq(r999.y,3) mov XName[r999.x].w, val \ endif \\

#define ifieq(a,b) \\

ieq r999.w, a, b \\

if_logicalnz r999.w \\

So If I touch that array with that flexible dword addressable thing, the compiler does the following:

- It realizes that the dword address is a constant so the ushr, iand calculations are constant too.

- It also drops 3 IFs and leaves a specific mov instruction behind

So it can optimize the whole XWrite(array , const, anything) macro into a single mov instruction which is great.

But later when I it got to thhe "iadd x0[0].w,x0[0].w,x0[0].x..." main loop, it does this:

mov tmp1, x0[0].x

mov tmp2, x0[0].w

add tmp2, tmp1, tmp2

mov x0[0].w, tmp2

And this is triggered by the XWrite(x0,0,1234) macro (that the resulting mov of that 4 IFs aren't optimized furter, even when operands are specified exactly (x0[0].x).

----------------------------------------------------------------------------------------------------

Bad:

; -------- Disassembly --------------------

00 ALU: ADDR(32) CNT(10) KCACHE0(CB0:0-15) //initialize with XWrite(x0,0,cb2[0].x) ...and so on

0 x: MOV R0.x, KC0[0].x

y: MOV R0.y, KC0[0].y

z: MOV R0.z, KC0[0].z

w: MOV R0.w, KC0[0].w

1 x: MOV R4.x, R0.x

2 y: MOV R4.y, R0.y

3 z: MOV R4.z, R0.z

4 w: MOV R4.w, R0.w

5 w: MOV R1.w, (0xFFFFFFFF, -1.#QNANf).x

01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5)

02 ALU: ADDR(42) CNT(3)

6 w: ADD_INT R1.w, R1.w, 1

7 x: PREDGE_INT ____, 10000, R1.w UPDATE_EXEC_MASK BREAK UPDATE_PRED

03 ALU: ADDR(45) CNT(15)

8 x: MOV R0.x, R4.x // iadd x0[0].w,x0[0].w,x0[0].x ...and so on

w: MOV R0.w, R4.w

9 z: MOV R0.z, R4.z

10 w: ADD_INT R0.w, R0.x, R0.w

11 w: MOV R4.w, R0.w

12 x: MOV R0.x, R4.x

y: MOV R0.y, R4.y

13 x: ADD_INT R0.x, R0.x, R0.y

z: ADD_INT R1.z, R0.w, R0.z

14 x: MOV R4.x, R0.x

15 y: MOV R0.y, R4.y

z: MOV R0.z, R4.z

16 y: ADD_INT R0.y, R0.y, R0.z

17 y: MOV R4.y, R0.y

18 z: MOV R4.z, R1.z

04 ENDLOOP i0 PASS_JUMP_ADDR(2)

05 ALU: ADDR(60) CNT(5) KCACHE0(CB1:0-15)

19 x: MOV R0.x, R4.x

20 y: MULADD_UINT24 R127.y, 0.0f, 4, KC0[0].x

21 x: LSHR R1.x, PV20.y, 2

06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R1].x___, R0, ARRAY_SIZE(4) VPM

07 END

END_OF_PROGRAM

Good:

; -------- Disassembly --------------------

00 ALU: ADDR(32) CNT(6) KCACHE0(CB0:0-15) //initialized with mov x0[0],cb2[0]

0 x: MOV R1.x, KC0[0].x

y: MOV R0.y, KC0[0].y

z: MOV R0.z, KC0[0].z

w: MOV R0.w, KC0[0].w

1 w: MOV R1.w, (0xFFFFFFFF, -1.#QNANf).x

01 LOOP_DX10 i0 FAIL_JUMP_ADDR(5)

02 ALU: ADDR(38) CNT(3)

2 w: ADD_INT R1.w, R1.w, 1

3 x: PREDGE_INT ____, 10000, R1.w UPDATE_EXEC_MASK BREAK UPDATE_PRED

03 ALU: ADDR(41) CNT(4)

4 x: ADD_INT R1.x, R1.x, R0.y

y: ADD_INT R0.y, R0.y, R0.z

w: ADD_INT R0.w, R1.x, R0.w

5 z: ADD_INT R0.z, R0.z, PV4.w

04 ENDLOOP i0 PASS_JUMP_ADDR(2)

05 ALU: ADDR(45) CNT(4) KCACHE0(CB1:0-15)

6 y: MULADD_UINT24 R127.y, 0.0f, 4, KC0[0].x

7 x: LSHR R0.x, PV6.y, 2

06 MEM_RAT_CACHELESS_STORE_DWORD__NI: RAT(11)[R0].x___, R1, ARRAY_SIZE(4) VPM

07 END

END_OF_PROGRAM

---------------------------------------------------------------------------------------------

On the GCN it also does this:

v_mov_b32 v4, v40 // 00001C9C: 7E080328

v_add_i32 v3, vcc, v37, v4 // 00001CA0: 4A060925

v_mov_b32 v40, v3 // 00001CA4: 7E500303

v_mov_b32 v4, v41 // 00001CA8: 7E080329

v_mov_b32 v5, v38 // 00001CAC: 7E0A0326

v_add_i32 v4, vcc, v4, v5 // 00001CB0: 4A080B04

v_mov_b32 v41, v4 // 00001CB4: 7E520304

v_mov_b32 v5, v42 // 00001CB8: 7E0A032A

v_mov_b32 v6, v39 // 00001CBC: 7E0C0327

v_add_i32 v5, vcc, v5, v6 // 00001CC0: 4A0A0D05

v_mov_b32 v42, v5 // 00001CC4: 7E540305

v_mov_b32 v6, v43 // 00001CC8: 7E0C032B

v_add_i32 v3, vcc, v3, v6 // 00001CCC: 4A060D03

But I failed to reproduce it with a small test program. It needs more 'pressure', It could be high VReg usage, or big program code or whatever. For small arrays it optimizes fine.

-----------------------------------------------------------------------------------------------------------------------

(Attaching an HD6970 compatible AMD_IL code.)

himanshu_gautam · ‎02-18-2013

Hi Realhet,

I will forward this to appropraite team. Can you let me know the some more details:

1. Platform - win32 / win64 / lin32 / lin64 or some other?

Win7 or win vista or Win8.. Similarly for linux, your distribution

2. Version of driver

3. CPU(s) or GPU(s) you worked on. I think this is HD 6970 and HD 7970. Please confirm.

realhet · ‎02-18-2013

Hi!

I've tried with the latest driver also (no changes).

Attaching many files to make it easy to reproduce/analyze.

Thank You

-------------------------------------------------------------------------------------------------------------------------------------------

This test in a nutshell:

GPU: HD6970

OS: win7 64

Cat: 12-10 and 13-1 (no differences in result)

Have an indexed array x0, length=1.

I do the following operation on that in a loop:

x0[0].x+=x0[0].y;

x0[0].y+=x0[0].x; //note the constant indexing

The compiled ISA loop is differencing basen on the way I use that array.

1) When I initialize it, with constant indexing:

x0[0].xy=cb2[0].xy

Then it will compile the loop to:

3 y: ADD_INT R0.y, R1.x, R0.y

4 x: ADD_INT R1.x, R1.x, PV3.y //2 cycles is the best time for this dependency chain

2) When I initialize it, with register indexing:

loop r1.x from 0 to 1 do

if(r1.x%4=0) x0[r1.x/4].x=cb2[r1.x/4].x

if(r1.x%4=1) x0[r1.x/4].y=cb2[r1.x/4].y

if(r1.x%4=2) x0[r1.x/4].z=cb2[r1.x/4].z

if(r1.x%4=3) x0[r1.x/4].w=cb2[r1.x/4].w

endloop

This is enought for the compiler, to mark the array that is it variable accessed, and then it will compile the loop to:

5 x: MOV R0.x, R4.x

y: MOV R0.y, R4.y

6 x: MOV R1.x, R4.x

7 y: ADD_INT R0.y, R0.x, R0.y

8 x: ADD_INT R1.x, PV7.y, R1.x

9 y: MOV R4.y, R0.y

10 x: MOV R4.x, R1.x

himanshu_gautam · ‎02-18-2013

Thank You for the testcase. I have reported the issue to AMD OpenCL compiler team. I will update the thread, once the issue has been fixed.

Archives Discussions

Small temporary arrays in OpenCL