Archives Discussions

realhet · ‎12-07-2012

Hi,

Does OpenCL take advantage of the following techniques when using small local arrays?

- On VLIW -> indexed_temp_arrays (x0) (aka. R55[A0.x] indirect register addressing in ISA)

- On GCN -> v_movrel_b32 instruction

Or if OpenCL always uses LDS memory for local arrays, is there an extension to enable those faster techniques?

Thanks in advance.

drallan · ‎12-08-2012

GCN can use up to 256 vgprs/thread with 4 waves per CU for full occupancy.

The maximum vgpr array size then depends on what other vgprs the compiler needs to use.

In one case, I saw the compiler wasted 74 vgprs as temporaries by loading blocks of data before writing to the array.

Even here, int array[160] was not a problem.

But beware the devil. When the array indices are not known at compile time, both gcn and VLIW will access registers serially one thread at a time.

gcn scans the 'lanes' for threads looking for an index. It will then read/write all threads with the same index in parallel using v_movereld/s, then repeat until all threads are processed. Worst case is all 64 indexes in a wave are different = 64 read/write loops (yes branching too). Best case, all indexes are the same and there is only one read/write. (actually that's pretty cool .) Although VLIW uses A0 register, it also does something similar to serially access different indices.

LDS might be faster but, there's not enough.

View solution in original post

Archives Discussions

Small temporary arrays in OpenCL