cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

realhet
Miniboss
Miniboss

Small temporary arrays in OpenCL

Jump to solution

Hi,

Does OpenCL take advantage of the following techniques when using small local arrays?

- On VLIW -> indexed_temp_arrays (x0) (aka. R55[A0.x] indirect register addressing in ISA)

- On GCN -> v_movrel_b32 instruction

Or if OpenCL always uses LDS memory for local arrays, is there an extension to enable those faster techniques?

Thanks in advance.

0 Kudos
Reply
1 Solution

Accepted Solutions
drallan
Challenger
Challenger

Re: Small temporary arrays in OpenCL

Jump to solution

GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy.

The maximum vgpr array size then depends on what other vgprs the compiler needs to use.

In one case, I saw the compiler wasted 74 vgprs as temporaries by loading blocks of data before writing to the array.

Even here, int array[160] was not a problem.

But beware the devil. When the array indices are not known at compile time, both gcn and VLIW will access registers serially one thread at a time.

gcn scans the 'lanes' for threads looking for an index. It will then read/write all threads with the same index in parallel using v_movereld/s, then repeat until all threads are processed.  Worst case is all 64 indexes in a wave are different = 64 read/write loops (yes branching too). Best case, all indexes are the same and there is only one read/write. (actually that's pretty cool .)  Although VLIW uses A0 register,  it also does something similar to serially access different indices.

LDS might be faster but, there's not enough.

View solution in original post

0 Kudos
Reply
14 Replies
binying
Challenger
Challenger

Re: Small temporary arrays in OpenCL

Jump to solution

To find out the answer,  I think you can wirte  a simple kernal, in which you use a small local array, and compile it using kernel analyzer. Then check the result in the output window...

0 Kudos
Reply
realhet
Miniboss
Miniboss

Re: Small temporary arrays in OpenCL

Jump to solution

I'm not that lazy, but all I have right now is a HD4850, and on that OCL is terrible beta-ish

So I now got to have 160 dwords of this kind of fast 'memory' for my project (implementing it with amd_il+indexed_temp_array), and just wonder if OCL can do it.

I know that for GCN I have to use something hybrid LDS+register_array to stay inside the 128x dword vreg limit. But on VLIW this register array thing is just awesome (there is 128*4 regs limit).

0 Kudos
Reply
hazeman
Adept II
Adept II

Re: Small temporary arrays in OpenCL

Jump to solution

I've tested this feature on 58xx card. OpenCL compiler generates indexed_temp_array in IL.

The problem is what is going on in IL compiler. Almost randomly ( it slightly depends on size of array, whether you use parts (.x,.y,...) of indexed 4-vector ) it can use A indexing register or use scratch memory ( painfully slow ) to implement array.

Unfortunately in my kernel i couldn't trick IL compiler to reliably use A register indexing and I had to change kernel design so i wouldn't use this feature. 

realhet
Miniboss
Miniboss

Re: Small temporary arrays in OpenCL

Jump to solution

Hi,

That's cool that OCL can use A0 index.

I've played with it a little and found out that on HD4850 it will always use A0 indexing when total NumGPRS<=118. If NumGPRS would be>118 with the array then it will use scratch instead. And I only used the .x part. I think it doesn't check what parts we use as it will address 128bit array elements only. Maybe your kernel is around that NumGPRS limit.

It's approx 400 instantly accessible dwords...

But on GCN I think we can address only 100dwords (depends on other vreg usage) while not runing out of 128 vregs. I need 160 dwords total, and it doesn't fit into either LDS(would be 41KB for a wavefront) or VRegs. I'm afraid I have to mix those two if I want to avoid slow memory access.

0 Kudos
Reply
drallan
Challenger
Challenger

Re: Small temporary arrays in OpenCL

Jump to solution

GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy.

The maximum vgpr array size then depends on what other vgprs the compiler needs to use.

In one case, I saw the compiler wasted 74 vgprs as temporaries by loading blocks of data before writing to the array.

Even here, int array[160] was not a problem.

But beware the devil. When the array indices are not known at compile time, both gcn and VLIW will access registers serially one thread at a time.

gcn scans the 'lanes' for threads looking for an index. It will then read/write all threads with the same index in parallel using v_movereld/s, then repeat until all threads are processed.  Worst case is all 64 indexes in a wave are different = 64 read/write loops (yes branching too). Best case, all indexes are the same and there is only one read/write. (actually that's pretty cool .)  Although VLIW uses A0 register,  it also does something similar to serially access different indices.

LDS might be faster but, there's not enough.

View solution in original post

0 Kudos
Reply
realhet
Miniboss
Miniboss

Re: Small temporary arrays in OpenCL

Jump to solution

(unfortunately I cant accept 2x answers, though all two has proven a part of my question (vliw&gcn))

"But beware the devil."

From where the h3ll I get that v_movreld and A0 can access all the 64 lanes of register memory Individually in ONE CYCLE?! That would be so many wires and transistors just for these rare instructions. Thx for opening my eyes, haha!

Btw my case of course would be that every lane will accesses different regs.

"GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy."

That's when the instruction stream is not too dense. Pls take a look at these charts:

http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_4-12dwords.png

http://x.pgy.hu/~worm/het/7970_isa_test/7970_SV_timings_8-16dwords.png

If your GCN code [uses a few S instructions and also there are some 64bit big instructions] AND [you're using more than 128 regs] your kernel can end up twice as fast than its estimated ideal performance. That's why I try to avoid 128+ regs (now I can't ) and aim for under 84 or even 64.

0 Kudos
Reply
drallan
Challenger
Challenger

Re: Small temporary arrays in OpenCL

Jump to solution

"But beware the devil."

From where the h3ll I get that v_movreld and A0 can access all the 64 lanes of register memory Individually in ONE CYCLE?! That would be so many wires and transistors just for these rare instructions. Thx for opening my eyes, haha!

I know your eyes are wide open but some might not realize how the compiler impliments C in a GPU environment.

I was a bit surprised when I first saw it.

I look forward to your solution, it's a tough problem!

0 Kudos
Reply
realhet
Miniboss
Miniboss

Re: Small temporary arrays in OpenCL

Jump to solution

My actual struggle in a picture ->

7 clocks instead of 1, This seemed like an easy 2-3x boost to my prog, but ouch

And it's not just the 4xxx, I've noticed it on 6xxx too.

With A0 the exact same thing is around 10% slower:

  ushr r999.x, dwIdx, 2

  iand r999.z, dwIdx, 1

  iand r999.w, dwIdx, 2

  mov  r998  , x0[r999.x]

  cmov_logical r999.xy, r999.ww, r998.zw, r998.xy

  cmov_logical res, r999.z, r999.y, r999.x  

Another discovery is that when I compared the above X0[] dword accessing with a uniform index (across wavefront), and I compared it to cb0 access (in the same way), the cb0 was faster. (It used a VTEX clause, but was slightly faster than A0).

0 Kudos
Reply
realhet
Miniboss
Miniboss

Re: Small temporary arrays in OpenCL

Jump to solution

Finally I had the chance to do some experiments on a 7970:

- v_movrels_b32 does nothing with the contents of the source operand, it only uses the index of it, so all the lanes will read from the same register. Maybe a0 indexing can access different regs/lane, but now I'm sure, that movrel can't.

- ds_readx2 is pretty effective (with different addresses for all laness)! I interleaved it with with 10-12 vector instructions and all the latency was hidden. (Make sure to set up the M0 register before using DS_ stuff! I wasted like an hour on this lol)

- The amd_il compiler can't deal with indexed arrays effectively: It always swaps the contents of the indexed array with unoptimized movs before using those. (x0[const1]+=x0[const2] uses 3 movs and an add)

0 Kudos
Reply