I've recently ported the WarpStandard RNG for GPUs to IL:
http://www.doc.ic.ac.uk/~dt10/research/rngs-gpu-uniform.html
Summary: the RNG shares state accross the whole wavefront via LDS, to make it much faster than an MT.
What I'm noticing is that on a 4870, I have rather worse performance than the CUDA version running on a gtx260, and it seems to be due to slow LDS.
My 5770 is more than twice as fast as my 4870.
Here are the #s:
CUDA gtx260: 35 Gsamples/s
IL version on 5770: 21 Gsamples/s for 1280 threads or 7680 threads
IL version on 4870: 7.5 Gsamples/s for 7680 threads and 2.1 Gsamples/s for 1280 threads.
Any ideas why 7680 threads are so much faster on the 4870, whereas it makes no difference on the 5770?
In the attached disassembly, clearly alot has changed for LDS on the two platforms ...
4870 LDS operations: 12 TEX: ADDR(1772) CNT(1) 19 LOCAL_DS_WRITE (0) R2, STRIDE(4) SIMD_REL 13 TEX: ADDR(1774) CNT(1) 20 LOCAL_DS_READ R2, R6.xy WATERFALL 14 TEX: ADDR(1776) CNT(1) 21 LOCAL_DS_READ R1, R7.xy WATERFALL 5770 LDS operations 29 x: LDS_WRITE ____, R4.x, PV28.x 30 x: LDS_WRITE ____, R4.y, T0.w 31 x: LDS_WRITE ____, R4.z, T0.z 32 x: LDS_WRITE ____, R2.w, T0.y 33 x: LDS_READ2_RET QAB, R2.x, R3.y 34 x: LDS_READ2_RET QAB, R2.z, R3.w 35 x: LDS_READ2_RET QAB, R3.x, R6.y 36 x: LDS_READ2_RET QAB, R3.z, R4.w
I only need 128bits per thread in a wavefront ... might it be possible to use shared temp registers instead of lds? Perhaps they are faster. I guess I would need to use them as an array using dclarray. Is this even possible for shared registers? Something like:
dcl_shared_temp sr64
dclarray sr0,sr64
mov a0.x, r1.x
mov r5, sr[a0.x]
...
mov a0.x, r2.x
mov sr[a0.x], r4
; where a0.x is always in the range 0-63
Originally posted by: emuller
Any ideas why 7680 threads are so much faster on the 4870, whereas it makes no difference on the 5770?
I think that reasonable guess here is that it has something to do with latency. With 7680 threads you have (7680/10/64)=12 warps per simd core - so it might be enought to hide LDS access latency. With 1280 threads you have only 2 warps per simd core - which usually isn't enough. There are also some issues with executing more than 1 group at simd core at the same time ( I think that behaviour of RV7xx, RV8xx with more than 1 group per simd is rather unknow ).
PS. I guess ( from you other post ) that you use group size 64 ( so 1 warp for group ). Usually you should try to have at least 6 warps per group to hide latency ( of course best value depends on kernel ).
In OpenCL forum, I think there is somebody benchmark the 5770 LDS and got 540GBps bandwidth
So yeah, it is much much faster