I am using a radeon 7750 the first question I have is whether or not this graphiccard support execution of parallel kernels in opencl. Is the execution model completely controlled by the driver or is it possible to provide hints that I want this kernel to run in parallel with other kernels (or the other way around).
My program utilizes multiple hostthread calling a function which in turns makes opencl calculations on blocks up to 128KB of data. Currently each calculation requires 4 computunits, i.e. 64*4 workitems, and the card has 8 computeunits which should allow it for parallel execution(?).
When I use multiple commandqueues the execution time of a kernel more or less scales linearly with the number of commandqueues utilized, Is this intended behaviour? The time for running one kernel sequentially is roughly 0.0002s, running the same amount of calculations but spread out over multiple host threads results in an average time of 0.0008s in each thread, but the overall average execution time of a kernel remains the same. That is with 4 host threads 0.0008/4 = 0.0002s.
This suggests that the kernels are queued on the GPU and executed sequentially.
I've tried using 13.1 (stable) and 13.3 beta drivers, and I am using the latest amd app sdk 2.8. I've tried running the code on both 7750 and a 6870 graphiccard similar results on both (regarding the parallelism).
Are there any guidelines written inorder to achieve parallelism with multiple kernels?
I read this thread http://devgurus.amd.com/message/1279083#1279083 and changed the program so that I cache kernels and buffers and don't create and destroy buffers and kernels until the program is shut down. The performance did increase somewhat, but no parallelism was achieved.