I wrote an OpenCL application which enqueues a series of non-blocking buffer writes, kernel executions, and buffer reads in a loop. At the end of the loop, I call clFinish. When profiling my program in CodeXL, the program spends much of the time in the clFinish call (which is good), but almost all that time is spent with my GPU (a Fury X) idle. I've attached a screenshot of a sample run. If I add up all the time spent on transfers and kernels, around 2 seconds out of 10 is spent doing real work. I have also tried adding more queues. Now the theoretical performance should be about 1 second as computation and transfer are interleaved, but I get runtimes ranging from 3 to 10 seconds (examples also attached). Any ideas about what is going on here? I've seen a couple threads from ~3 years ago pointing to profiling bugs, but hopefully these have been fixed.