Was it broken before 2.1?
Look at this results:
up: 0.000000 s / exec: 0.054278 s / down: 0.000000s / 309095084
56 / up: 0.000022s / exec: 3.003121 s / down: 0.000022s / 313044173
up: 0.087022 s / exec: 0.025523 s / down: 0.000044s / 149012759
56 / up: 4.966646s / exec: 1.429298 s / down: 0.002827s / 147135768
56 is the number of passes
up is the time it takes to load 256MB data into a kernel
down is the time it takes to read an int4 into my host program
exec is the kernel execution time (16777216 work-items)
It seems that kernel execution was speed up 50% for me, but because I take up- and download time into account for the messurement of results/s, the result is 50% worse. But yeah it seems, that 256MB of data and 0.000000s for upload with 2.0.1 were pretty wrong .