cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Wibowit
Journeyman III

Concurrent kernel execution + some other questions

Does AMD APP SDK fully support out-of-order queues and concurrent kernel execution

Hi,

 

I couldn't find a info about level of task parallelism in AMD's OpenCL implementation.

 

OpenCL provides two ways to achieve task level parallelism:

- multiple in-order queues,

- one out-of-order queue,

 

I am curious which ones are fully implementend. I've read somewhere that APP SDK ignores event and executes kernel invocations in order. Also I've read that APP SDK support multiple queues and with multiple queues one can execute different kernels concurrently. But maybe when I create multiple queues then each one is binded to different set compute units?

 

I'm designing sorting algorithm (BWT transform) + compression and I plan to have many kernels, one heavy on LDS (initial sorting on small blocks), some heavy on global bandwidth (global sorting), some heavy on ALU & registers (compression) and I would want to have my GPU fully utilized. BWT is a block transformation so one block could be sorted and another block could be compressed in parallel.

 

Another questions are:

Why OpenCL has halved LSD bandwidth? I've read that in AMD's documentation.

 

Do APP SDK compiler accept some parameters to choose level of optimization, similiar to -O2 or -O3 in GCC?

0 Likes
15 Replies
himanshu_gautam
Grandmaster

Wibowit,

out of order queue is not supported. But multiple inorder queues are supported . I am not aware of any event hadling issues.

 

Can you please mention where it is mentioned that LDS bandwidth has been halved?

 

0 Likes

So if I have multiple queues then different kernels will be executed concurrently or serialized? I've read (IIRC) that Cayman has some sort of Hyper-Threading, ie. one compute unit can switch between two kernels, like it switches between wavefronts. Is that true? And is it supported in AMD's OpenCL implementation or will it be supported in a few months?

 

In OpenCL Programmin Guide, Chapter 4.10 Using LDS or L1 Cache:

LDS is typically larger than L1 (for example: 32 kB vs 8 kB on Cypress). If it is not possible to obtain a high L1 cache hit rate for an algorithm, the larger LDS size can help. The theoretical LDS peak bandwidth is 2 TB/s, compared to L1 at 1 TB/sec. Currently, OpenCL is limited to 1 TB/sec LDS bandwidth.

If LDS bandwith on Cypress is halved then probably it's halved on other products.


0 Likes

The LDS bandwidth has always been 1TB/ps for openCL. It might ne different at CAL level but i don't know that.

That statement should mean that LDS is capable of transferring at 2TBps but that is not exposed to openCL.

The bandwidth for other devices are also in accordance with this rule.Refer to Appendix D of openCL programming guide to compare the devices.(It appears theoritical peak Bandwidths are mentioned there).

 

0 Likes

What do you mean by "not exposed"? LDS bandwidths provided in Appendix D are global bandwidths, ie. sum of LDS bandwidths of all compute units. Juniper is basically half of Cypress (not counting DP math support), so it's logical that it would have half the bandwidths.

If one compute unit on Cypress operating at 850 MHz has theoretical LDS bandwidth of 108.8 GB/s, and OpenCL limits that to 54.4 GB/s, then why Juniper compute unit should provide full bandwidth?

I assume that LDS bandwidth in OpenCL is halved.

 

I am still waiting for answers to my other questions, about kernel/ task level concurrency on AMD GPU's.

0 Likes

I've found some info about CKE on Radeon HD 5000 series. Here: http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=142485&messid=1187105&parentid=118702...

Jeff Golds said that it's possible to run concurrently 8 unique programs. Even 5 would be wonderful. Does anyone have info about when it's planned to be exposed by AMD APP SDK?

 

If you look at Fermi white-paper: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pd...

Then on page 11, Summary Table, there is information that Fermi is able to run 16 different kernels concurrently. That meas the have good abilities to fully utilize their hardware, when different kernels have different bottlenecks.

 

 

0 Likes

in upcoming releases. that is answer which you will get.

0 Likes

I've tried to run some kernels concurrently on my Radeon 6950 with different command queues using the 2.4 SDK with the 11.3 driver .... no overlap

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported on my GPU so the same results when enqueuing the kernels using the same command queue -> no overlap.

So can anyone confirm that the CKE in the 2.4 SDk still not supported and any plans to implement this in the near future??

Thank you.

0 Likes

Bump.

 

I'm still waiting for answers.... Micah, are you here?

0 Likes

concurrent kernel execution is not supported in SDK 2.4 and not planned for 2.5 either.
0 Likes

bad news !!!

I don't really want to stick with CUDA, i hope this will be done in some future release ....

thank you anyway

0 Likes

Sorry for reviving this thread. I've also looked at this a bit, and you may be interested in what I found.

I wrote a program that executes a kernel using clEnqueueTask that shows that conncurrent kernels does not work--at least for my GPU HD 6450, SDK v2.5.  However, concurrency via multiple queues works for my quad-core OpenCL CPU device.  Unfortunately, there seems to be a max of about 44 queues that can be created.  (That should be fine until we have a 44-core CPU.)  CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE does not work for either GPU or CPU.  The code is here.

I also have a NVIDIA Fermi card, and there is no way to get concurrent kernels on that using their OpenCL (driver v280.19 w/ OpenCL 1.1), with either single out-of-order queue or multiple queues.  You can get concurrent kernels using CUDA Runtime API, but then it's not in OpenCL.

Ken

0 Likes

Originally posted by: KenDomino  I wrote a program that executes a kernel using clEnqueueTask that shows that conncurrent kernels does not work--at least for my GPU HD 6450, SDK v2.5.

Conncurrent kernel exection is not supported yet on GPUs.

 However, concurrency via multiple queues works for my quad-core OpenCL CPU device.  Unfortunately, there seems to be a max of about 44 queues that can be created.  (That should be fine until we have a 44-core CPU.)

Are you getting appropriate error message if more than 44 command queue are created?

 CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE does not work for either GPU or CPU.  The code is here.

 

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported yet either on CPU  and GPU.

0 Likes

Originally posted by: genaganna

 

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE is not supported yet either on CPU  and GPU.

 

 

Thats strange. I thought out-of-order was a mandatory feature in the OpenCL spec ...

0 Likes

Yep, I do get an error return (out of resources) on the command queue create.  That's fine.

I wasn't sure about the AMD OpenCL or GPUs, but CUDA API itself does support concurrent kernels.

0 Likes

hi,  this thread was very useful checking concurrent execution of tasks in opencl. In my experiments I am creating many queues and using it to enqueue different kernels. I am not able to see complete utilization of all the cores in the machine (equal to number of queues declared).. its a intel processor with 24 cores and its showing utilization not more than 3 to 6 cores at a time (having something like 10 queues).. did someone check the utilization of the cpu when running with many queues ?? or is there some way to do it... without using device fission...

 

some values for 10 queues

  13.510468 seconds elapsed for concurrent (utilization around 4 cores (stable))

  20.554451 seconds elapsed for sequential (1 core)

Thanks

0 Likes