Yeah, I read that article, but no source is given so I'm reluctant to believe it (much less reference in a published performance analysis). They seem to have gotten that number by multiplying the clock rate times the bis width times 2 (For double data rate), which seems reasonable if the clock rate and bus width are correct. However, I have a memory bound application that appears to be writing ~600MB of data in 13ms, giving a memory bandwidth usage of 40GB/s. This is significantly higher than the write bandwidth, which would be ~25GB/s that is on the Wiki article. If the wiki article is correct, then I have some interesting tribulations ahead of me to understand this performance. I presently time the execution from just before the kernels parameters are assigned and the kernel is run until all events from outstanding kernels are received, which seems like the intuitive way to do this. These times are captured on the CPU using gettime() in time.h.