I only need 128bits per thread in a wavefront ... might it be possible to use shared temp registers instead of lds? Perhaps they are faster. I guess I would need to use them as an array using dclarray. Is this even possible for shared registers? Something like:
mov a0.x, r1.x
mov r5, sr[a0.x]
mov a0.x, r2.x
mov sr[a0.x], r4
; where a0.x is always in the range 0-63
Originally posted by: emuller
Any ideas why 7680 threads are so much faster on the 4870, whereas it makes no difference on the 5770?
I think that reasonable guess here is that it has something to do with latency. With 7680 threads you have (7680/10/64)=12 warps per simd core - so it might be enought to hide LDS access latency. With 1280 threads you have only 2 warps per simd core - which usually isn't enough. There are also some issues with executing more than 1 group at simd core at the same time ( I think that behaviour of RV7xx, RV8xx with more than 1 group per simd is rather unknow ).
PS. I guess ( from you other post ) that you use group size 64 ( so 1 warp for group ). Usually you should try to have at least 6 warps per group to hide latency ( of course best value depends on kernel ).
In OpenCL forum, I think there is somebody benchmark the 5770 LDS and got 540GBps bandwidth
So yeah, it is much much faster