Question: the doc (OpenCL Programming Guide rev 1.03) says
"Each stream processor can generate up to two 4-byte LDS requests per cycle."
How do I actually achieve this for reads in IL? There are LDS_LOAD (one DWORD) and LDS_LOAD_VEC (four DWORDS). Both of them appear to be inefficient on 5870 chips.
In IL if you issue two distinct instructions:
lds_load_id(0) r1.x, r0.x
lds_load_id(0) r1.y, r0.y
then there's a good chance that this will compile as a coalesced read of 2 addresses (but it might not):
lds_load_vec should always result in 2 reads.
So, normally, it's fairly easy to get pairs of reads - but if you don't use lds_load_vec there's a chance that some reads will not be paired - a matter of luck...
To get 2 writes to happen in parallel you have to use something like:
lds_store_vec_id(0) mem.xyzw, r0.x, r0.0, r1
note here that r1 is what is being stored and this will produce a pair of
instructions, each of which writes 2 values to LDS.
r0.x is the address being written to and r0.0 is a simple way of specifying an offset of zero from the address specified in r0.x. This offset is only used if you declare LDS as structured, so r0.0 is just an easy way of specifying 0 for this example.