Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

4870 vs 5770 shared memory performance for WarpStandard Random # generator

I've recently ported the WarpStandard RNG for GPUs to IL:

Summary: the RNG shares state accross the whole wavefront via LDS, to make it much faster than an MT.

What I'm noticing is that on a 4870, I have rather worse performance than the CUDA version running on a gtx260, and it seems to be due to slow LDS.

My 5770 is more than twice as fast as my 4870.

Here are the #s:

CUDA gtx260: 35 Gsamples/s

IL version on 5770: 21 Gsamples/s for 1280 threads or 7680 threads

IL version on 4870: 7.5 Gsamples/s for 7680 threads and 2.1 Gsamples/s for 1280 threads.

Any ideas why 7680 threads are so much faster on the 4870, whereas it makes no difference on the 5770? 

In the attached disassembly, clearly alot has changed for LDS on the two platforms ... 






4870 LDS operations: 12 TEX: ADDR(1772) CNT(1) 19 LOCAL_DS_WRITE (0) R2, STRIDE(4) SIMD_REL 13 TEX: ADDR(1774) CNT(1) 20 LOCAL_DS_READ R2, R6.xy WATERFALL 14 TEX: ADDR(1776) CNT(1) 21 LOCAL_DS_READ R1, R7.xy WATERFALL 5770 LDS operations 29 x: LDS_WRITE ____, R4.x, PV28.x 30 x: LDS_WRITE ____, R4.y, T0.w 31 x: LDS_WRITE ____, R4.z, T0.z 32 x: LDS_WRITE ____, R2.w, T0.y 33 x: LDS_READ2_RET QAB, R2.x, R3.y 34 x: LDS_READ2_RET QAB, R2.z, R3.w 35 x: LDS_READ2_RET QAB, R3.x, R6.y 36 x: LDS_READ2_RET QAB, R3.z, R4.w

3 Replies
Journeyman III

I only need 128bits per thread in a wavefront ... might it be possible to use shared temp registers instead of lds?  Perhaps they are faster.  I guess I would need to use them as an array using dclarray.  Is this even possible for shared registers?  Something like:

dcl_shared_temp sr64

dclarray sr0,sr64

mov a0.x, r1.x

mov r5, sr[a0.x]


mov a0.x, r2.x

mov sr[a0.x], r4

; where a0.x is always in the range 0-63



Adept II

Originally posted by: emuller

Any ideas why 7680 threads are so much faster on the 4870, whereas it makes no difference on the 5770?

I think that reasonable guess here is that it has something to do with latency. With 7680 threads you have (7680/10/64)=12 warps per simd core - so it might be enought to hide LDS access latency. With 1280 threads you have only 2 warps per simd core - which usually isn't enough. There are also some issues with executing more than 1 group at simd core at the same time ( I think that behaviour of RV7xx, RV8xx with more than 1 group per simd is rather unknow ).

PS. I guess ( from you other post ) that you use group size 64 ( so 1 warp for group ). Usually you should try to have at least 6 warps per group to hide latency ( of course best value depends on kernel ).


In OpenCL forum, I think there is somebody benchmark the 5770 LDS and got 540GBps bandwidth

So yeah, it is much much faster