I am using a radeon 7750 the first question I have is whether or not this graphiccard support execution of parallel kernels in opencl. Is the execution model completely controlled by the driver or is it possible to provide hints that I want this kernel to run in parallel with other kernels (or the other way around).
My program utilizes multiple hostthread calling a function which in turns makes opencl calculations on blocks up to 128KB of data. Currently each calculation requires 4 computunits, i.e. 64*4 workitems, and the card has 8 computeunits which should allow it for parallel execution(?).
When I use multiple commandqueues the execution time of a kernel more or less scales linearly with the number of commandqueues utilized, Is this intended behaviour? The time for running one kernel sequentially is roughly 0.0002s, running the same amount of calculations but spread out over multiple host threads results in an average time of 0.0008s in each thread, but the overall average execution time of a kernel remains the same. That is with 4 host threads 0.0008/4 = 0.0002s.
This suggests that the kernels are queued on the GPU and executed sequentially.
I've tried using 13.1 (stable) and 13.3 beta drivers, and I am using the latest amd app sdk 2.8. I've tried running the code on both 7750 and a 6870 graphiccard similar results on both (regarding the parallelism).
Are there any guidelines written inorder to achieve parallelism with multiple kernels?
I read this thread http://devgurus.amd.com/message/1279083#1279083 and changed the program so that I cache kernels and buffers and don't create and destroy buffers and kernels until the program is shut down. The performance did increase somewhat, but no parallelism was achieved.
Solved! Go to Solution.
Thats to bad I think I read somewhere that the 5000-series had support for it in hardware but the sdk didn't. I thought that had been fixed since then. I guess I have to think of something else then.
yes I read too that 5000 series GPU should have capability to run multiple programs in parallel. But only 7000 series GPU should have full HW support for multiple kernels at once. try look here http://devgurus.amd.com/message/1284207
Interesting, so basically there is (or atleast was) experimental support for it? Setting that environment variable doesn't seem to change anything, I don't notice any difference in timings running W8 x64 with the latest beta drivers.
Parallel execution of kernels is enabled by default with 7000 family of GPUs and above. You can use GPUView to visualize the interaction with the GPU.
Bear in mind a few things:
1.) The concurrent kernels must be run from two queues created consecutively. OCL queues are assigned to GPU async engines according to their creation order. To guarantee no two queues are assigned to the same engine create them consecutively.
2.) All the async entry points in the GPU are implemented above the same compute engine, thus if the kernels are large enough to utilize all the compute cores we will not see dramatic improvement in performance. The concurrent kernels will time share the compute engine. Async dispatch shows the best result when running many small kernels. On 7970 we need at least 30 K threads to fully utilize the GPU.
3.) Memory transfer commands and a kernels will execute concurrently even if on the same queue as long as there is no resource dependency between them .