I've recently ported the WarpStandard RNG for GPUs to IL:
http://www.doc.ic.ac.uk/~dt10/research/rngs-gpu-uniform.html
Summary: the RNG shares state accross the whole wavefront via LDS, to make it much faster than an MT.
What I'm noticing is that on a 4870, I have rather worse performance than the CUDA version running on a gtx260, and it seems to be due to slow LDS.
My 5770 is more than twice as fast as my 4870.
Here are the #s:
CUDA gtx260: 35 Gsamples/s
IL version on 5770: 21 Gsamples/s for 1280 threads or 7680 threads
IL version on 4870: 7.5 Gsamples/s for 7680 threads and 2.1 Gsamples/s for 1280 threads.
Any ideas why 7680 threads are so much faster on the 4870, whereas it makes no difference on the 5770?
In the attached disassembly, clearly alot has changed for LDS on the two platforms ...
4870 LDS operations: 12 TEX: ADDR(1772) CNT(1) 19 LOCAL_DS_WRITE (0) R2, STRIDE(4) SIMD_REL 13 TEX: ADDR(1774) CNT(1) 20 LOCAL_DS_READ R2, R6.xy WATERFALL 14 TEX: ADDR(1776) CNT(1) 21 LOCAL_DS_READ R1, R7.xy WATERFALL 5770 LDS operations 29 x: LDS_WRITE ____, R4.x, PV28.x 30 x: LDS_WRITE ____, R4.y, T0.w 31 x: LDS_WRITE ____, R4.z, T0.z 32 x: LDS_WRITE ____, R2.w, T0.y 33 x: LDS_READ2_RET QAB, R2.x, R3.y 34 x: LDS_READ2_RET QAB, R2.z, R3.w 35 x: LDS_READ2_RET QAB, R3.x, R6.y 36 x: LDS_READ2_RET QAB, R3.z, R4.w