I have a kernel memory bounded, that on the ATI 5870 reach about 70GB/s very far from 153GB/s, all possible optimization has been done regarding coalescing, occupancy, workgroup size. ( On C2050 i reach ~ 120-130GB/s) The kernel use 57 vGPRs so 256 Active thread 4 Wavefronts (Enought to hide latency ?)
On the other hand i have tested my 5870 with the AMD benchmark GlobalBandWidth
Where i get for uncached read: 77GB/s, reading the kernel for uncached (but coalesced) read, i can say that the read method used by this kernel is similar
to those on my kernel. As far as i know uncached read should measure the bandwidth of the GDDR5 so i should get something more near to 153 GB/s. So the question is someone with an 5870 can confirm this number or there is something i am missing ?