Thanks for asking,
I can only make comments about my limited experience with OpenCL on GCN cards coming from CUDA.
Kudos: very good hardware that is easy to optimize for most of the time.
Issues: OpenCL compiler is not an "optimizing" one - there is no way to control or optimize register usage, so multiple times I was in a situation when a kernel compiles using 60 registers on V7 card, and 65 registers on a V6 card. This was causing occupancy difference and performance difference of 20%, that was sometimes not easy to fix, since one can only guess where the registers are being used. It should not be developer's job to hunt for these small optimizations every time he makes a change - Nvidia solves it with maxrregcount parameter. And in general it always seems like kernel uses more registers than it "should" by looking at the code.
Suggestion: if AMD is serious about GPU compute, it should make an investment into compiler and tools to make them easier to develop for.
Thanks for the feedback. I cannot confirm/discuss future roadmaps and what we may/may not do with respect to investing in tools. But this is most excellent feedback. The good news is, I know PRECISELY where to send the input on optimizing the compiler. In addition, the coming HSA tools and compiler are open source, so that should help.
Thanks again.
I'd be happy to provide a more detailed feedback and several other (smaller) issues with compiler to interested parties and spend some time iterating if needed.
Please contact me directly for this.
one of the smaller issue is function inlining - OpenCL "should" be inlining functions, but in fact when I copy/paste function code into the location, register count decreases by a few registers and perf also decreases slightly.
I really like your idea for a "Board Farm". I am currently developing my app on an HD 7700 - it would be very useful to see performance
on a newer, more powerful card such as the 290x.
Hi,
Kudos:
Really love how you're pushing low level API's, even if they are still not available for public use.
Quite easy to get 50% or so device utilization in OpenCL.
Complaints:
Nigh impossible to go over 50%: Register usage! Already mentioned before but it's really painful to tune performance when the compiler goes "It's better to save 4 clocks on recomputing this value than use one register less and thus increase occupancy!". A good register rematerialization pass which would take the device occupancy into account is a must! Even if the HSA compiler is going to be opensource the backend which does register allocation for GCN is probably going to be closed, so we cannot do this ourselves.
Linux drivers. They always require kernel/Xorg around 9 months old. It's absolutely impossible to keep up to date. Your biggest competition on GPU arena is not perfect but still far far better. And with great Linux drivers they were able to push new GPU on Android as it actually uses the same driver stack as their desktop Linux drivers. If you ever want to go into mobile space you really need to get this thing fixed. I mean seriously. Plug in a GCN based GPU into your new fancy ARM based CPU's and voila, you have an Android Tablet SOC! Also whereas your competition Linux drivers are around 20% faster in OpenGL than their Windows drivers (Due to no WDDM hampering it) your Linux drivers are maybe 20% slower than the Windows drivers.
In addition to use OpenCL you must have X running, which is kinda ridiculous. How can you tell to a customer that if they spend few millions on a massive FirePro cluster that they have to have X server on?
Suggestions:
Employ one or two engineers full time to work on just keeping your Linux drivers up to date. It seems currently it's a separate project, as in "Implement support for latest Ubuntu" and then completely abandoned until new iteration of $popular_distro comes out. It's relatively low intensity work so you could have a working driver out few days after new kernel/xorg release.
Focus on Ocl 2.0 really hard. To get to par with current CUDA one really must have dynamic parallelism and the work_group_reduce/broadcast functions in shape using the GCN register shuffles internally.
Hi,
Kudos: The quality of the drivers and OpenCL support increased in the last years and there is at least some documentation available!
-- NaN
Captured. Thanks for taking the time. I'm still gathering info, but about to start seeding the various developer teams with this feedback.
Hi,
here is something quick and easy: on the ACML download page, the userguide linked there is still for version 5.3, yet with the versions >=6.0 an updated userguide is part of the download; could the updated userguide also be linked on the download page? Also, on the matter of the ACML userguide, it would be helpful if the pdf would contain (hyper-) links ie for the table of contents or other references.
Thank you
Yeah, that should have already happened. I think we noticed that a week or two ago. Thanks for applying the boot. Let me go ping the people responsible. Kinda fell off my radar after I alerted them to the problem.
And fixed. THANKS!