Question: the doc (OpenCL Programming Guide rev 1.03) says
"Each stream processor can generate up to two 4-byte LDS requests per cycle."
How do I actually achieve this for reads in IL? There are LDS_LOAD (one DWORD) and LDS_LOAD_VEC (four DWORDS). Both of them appear to be inefficient on 5870 chips.