I might be missing something... but I don't see that I can measure the amount of data written by kernels via GPUPerf. I would like to have this so that I can automatically generate estimates of the bandwidth of kernels... so far I have been counting bytes by hand...
Please check out the FastPath and CompletePath performance counters (these correspond to the raw data written in the GPU though).