I'm currently working on optimizing an OpenCL C++ application. The platform is an A8-3870 APU with an additional Radeon HD7750 GPU (Capeverde). I'm using the Catalyst 12.10 driver.
The application has several kernels and buffer reads, set up as a task graph using events for synchronization, and are running on command queues that have the out-of-order bit set. The kernels are scheduled on the HD7750 GPU queue, while the memory transfers are scheduled on the CPU queue. The buffer objects in question are created with CL_MEM_READ_WRITE | CL_MEM_HOST_READ_ONLY | CL_MEM_USE_PERSISTENT_MEM_AMD. All command queues are flushed before the final call to clFinish().
I expect to see the memory transfers happen immediately after the kernel they depend on has finished. CodeXL's timeline (1.0.2409.0), however, indicates that the memory transfers in the CPU command queue only happen after all kernels in the GPU queue have finished.
When the memory transfers are scheduled on the GPU queue, they are executed immediately after the kernel they depend on has finished, but independent kernels are not executed in parallel. The command queue behaves like an in-order-queue.
What am I missing here? Thanks in advance for any hints.