The writes too should be ordered - linear writes with vec4 will give you the maximum bandwidth.
I have a benchmark which gives about 100 GB/s global memory write speed on 5870 (and 50GB/s on 5770) - it uses coalesced writes as I mentioned.