Start using OpenCL™ 2.0 today – AMD is providing a sneak peek that works on GPUs and APUs.
We are still working on the beta SDK which will be available soon. In the meantime, we have example code ready for the adventurous among you, so you can start learning some of the ins and outs.
We are creating a series of blog posts, called OpenCL 2.0 Demystified - One Feature at a Time. The posts have insights, code snippets, and complete samples that you can download. We are making some serious example code available for you to study and play with.
Review the first blog, on how shared virtual memory can make your code simpler and more efficient. We have planned several posts highlighting various features of OpenCL 2.0. So keep your eyes open – we’ll make announcements here to let you know when they are available.
All these links are available directly in the blog. Make sure you have supported hardware (there’s a complete list on the driver download page). Play with the examples. Maybe write your own OpenCL 2.0 samples. And share your observations with the community here. With this sneak peek at the example code, when the full release is available you’ll be ahead of the curve.
Additional blogs in the series
Thanks for the nice blog post with a clear intro to coarse SVM. I like the simplicity of the pointers being shared but am less sure about the performance implications in the example presented.
Is there code for your OpenCL 1.2 timing comparisons? I'm not sure I quite believe them as on an APU I have zero copy so if I have my buffers set up correctly I can do a zero penalty map of those buffers in OpenCL 1.2 just fine. Then my data structure just needs to be tweaked slightly to be offset based rather than (true) pointer based and I think I'd get almost the same performance in OCL 1.2 as OCL 2.0.
I'm a little confused about SVM, how is it different than using an old style buffer with host memory pointer? In my experience changes to this kind of memory is just as visible between the CPU and GPU, without copying or even mapping the buffer(!), as the blog post describes SVM. Is it about the inner struct pointers being valid, too?
Thanks for the feedback. While we cannot give the source for OpenCL 1.2, you can always write that and test yourself easily. However, the main point is the fact that real penalty is in translation of data structures from pointers to indices, as can be seen in the table. Even if we optimize the transfer time, that may not help much but for the translation time.
Perhaps in the table you could explicitly label what the OpenCL 1.2 timings actually are, i.e. kernel execution time, data structure translation time, transfer times. As they stand it's just implied by the text.
My takeaways were:
a) The inner structure pointers are valid across devices.
b) Creating memory that can be mapped in a zero copy way is now simple and explicit rather than fiddly trying to get your buffer creation flags correct, etc.
I have two questions:
QUESTION1. My system is A10-7800. Is it supported by the OpenCL 2.0 ?
Apparently after installation of the driver I see the following output of clinfo:
|Number of platforms:||1|
|Platform Version:||OpenCL 2.0 AMD-APP (1598.5)|
|Platform Name:||AMD Accelerated Parallel Processing|
|Platform Vendor:||Advanced Micro Devices, Inc.|
|Platform Extensions:||cl_khr_icd cl_amd_event_callback cl_amd_offline_devices|
|Platform Name:||AMD Accelerated Parallel Processing|
|Number of devices:||2|
Platform ID: 0x7f587e13d670
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 1598.5 (VM)
Device Type: CL_DEVICE_TYPE_CPU
|Name:||AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G|
|Device OpenCL C version:||OpenCL C 1.2|
|Driver version:||1598.5 (sse2,avx,fma4)|
I.e. "Device OpenCL C version" of CPU is still "OpenCL 1.2"
QUESTION2. Is it normal that the execution time of my program nearly doubles regarding OpenCL 1.2 (fglrx-14.301.1001) ?
My program is an example of multiple execution of parallel reduction on GPU.
Regarding your first question, when the clinfo says "OpenCL 2.0" then that device supports OpenCL 2.0. In this case, as you can see, the CPU is saying as "OpenCL 1.2" which means you cannot use CPU device for OpenCL 2.0 features.
Regarding your second question, it is not normal that the same application is taking double the time on 2.0 driver. Can you send us more details (like host code, kernels and other configurations) ?
>> which means you cannot use CPU device for OpenCL 2.0 features.
OK, I see. Please, can you explain me the difference between A10-7850K (which is said to be supported) and A10-7800 (which apparently isn't)? Do you plan to include support for A10-7800 at some latter moment or that device is fundamentally deprived of some circuitry that is essential for OpenCL 2.0 features? Why TDP of the 7850K is so much higher?
Ideally I need the lowest consumption OpenCL 2.0 APU. What would you recommend for that?
>> Can you send us more details (like host code, kernels and other configurations) ?
Sure, no problem with that. Where shall I send the code to?
Update: Our next blog in the OpenCL 2.0 demystified series is ready. It’s on pipes. While not exactly a piping hot feature, pipes serve many useful purposes.
The blog explains how pipes in OpenCL 2.0 can make your code simpler and more readable. As usual, we have insights, code snippets, and complete samples that you can download. Go for it.