Archives Discussions

rougered · ‎02-09-2014

Dear list,

i am involved in a "towards exascale" EU funded project, and i am attempting to understand if the HSA can fit our needs.

Essentially our problem is that we have an object oriented code, written in c++ and we would like to offload part of it to the gpu. There is a lot of parallelism but we do not want to give up with OO programming to pass to something like OpenCL.

as i understand what we want to do shall be possible on AMD apus, thanks to HSA technology. The thing i don't get is HOW TO DO IT. Could you provide some guidance on this?

If this possibility does not exist now, it would be interesting to have an ETA of when it could become available, to understand if we can plan to leverage in our exascale project or not.

thank you in advance

Riccardo Rossi

Meteorhead · ‎02-10-2014

The quick answer is: there is no way.

The big problem with all advanced GPGPU programming is that there is virtually no high-level language, that is capable of making efficient use of HSA. OpenCL 1.2 does not provide enough extensions yet, to make use of it, not to mention in a portable way. OpenCL 2.0 is far away, and is still OpenCL C, so it is very cumbersome to closely integrate it with C++, not to mention proper OO C++. C++AMP would be a very good language, but because it is designed in a way that it must be implementable using DirectX ComputeShaders, it carries with itself all the restrictions and dumbness of DirectCompute, more precisely it knows nothing about Shared Virtual Memory. (My big hope is that at Build 2014 MS unveils it's new, DirectX 12 in response to Mantle, and it will have SVM, and it will ease most of the restricitions on C++AMP with a new version.

So the longer answer is this. Right now, there is no way to make use of it, and in an OO manner, it is years away. OpenCL 2.0 is not even out yet in it's first form. The timescale when a new kernel language becomes available, some dumb C++ variant... that is a long way from now. If DirectX 12 in assocation with C++AMP would take a big leap, that too in it's earliest could debut in Visual Studio 2014 (if that's the name), and not before.

Best chances are that Bolt is going to make use of it in a tricky and pretty much 'black box' way, that makes use macro magic, but I would not bet on that either. Getting OO through to OpenCL, you are best off generating the kernel code yourself by making use of Expression Tempaltes on host side to concatenate operations and in the last phase, generate one kernel out of it.

So in any way, it is multiple years away, before you can use C++ to program HSA.

rougered · ‎02-11-2014

Dear Meteorhead,

first of all thanks for your answer, it is definitely useful

Just as a comment even OpenCL 2.0 or C++AMP would not fit to what we are looking for, since they are heavily oriented to working with arrays and primitive data structures. OpenMP would do, and i know there exist a gcc/HSA project, but if you tell it is far away then i'll take it into account.

having said this ... shall i then conclude that all of the HSA stuff is only hype, at least in the short term? is there anyone from AMD that could comment on this?

thx

Riccardo

Meteorhead · ‎02-11-2014

The biggest strength of HSA is also it's beiggest weakness, namely that it is not bound to any language. It defines services and capabilities on an intermediate level, and does not make any assumptions on how it is implemented in HW, and what language the source originates from.

Even if there is an HSA backend to GCC (as very soon there will be one to Clang), pure C++ does not let you express memory spaces, neither does it have the notion of GPU threads or thread groups. Even if it were to make the default assumtion that all memory allocation is shared virtual memory, and that stack variables are thread local (aka. __private), there is no notion of __local in C++. OpenMP is no good as well in this regard. Even if you could launch GPU threads, there is no __shared memory. You might want to take a look at OpenACC (I have not read through the docs), which is an NVIDIA invention and is already implemented in GCC, which might make use of HSA properly (funny how things align). I am no familiar with OpenACCs memory address qualifiers.

I still believe C++AMP could be the best bet, if it gets a major overhaul, as parallel_for_each lambdas can virtually capture anything visible at the given scope. If restrictions to being amp compatible were lifted (all of them), you'd get exactly what you want.

Input or insights are welcome for others though, as I would like to know more as well.

geal · ‎03-16-2014

"The quick answer is: there is no way. The big problem with all advanced GPGPU programming is that there is virtually no high-level language, that is capable of making efficient use of HSA."

Is this actually true? There are, for example, many implementations of Lisp that allow you to do very tight integration with their runtimes, generate code on the fly and run it. After that, it's only a matter of writing a sufficiently large set of macros (especially compiler macros!) to perform the code transformations. Oh, and the code generator, but that should be vastly easier that writing an x86 one, given the rather clean design of the whole HSAIL thing.

(I can also see the APL/A/J/K people churning out HSA-enabled implementations of their languages fairly fast, given the overall design of these, but that's really not my area.)

Meteorhead · ‎03-18-2014

Of course. Everything is possible if you write the compiler yourself. Even my cookbook is HSA enabled if I write a compiler myself. What I meant was that there is no widely available compiler that I know of, that actually makes use of HSA and gives direct control to the developer, and there is no language as well. But if you know one, just paste the link and we'll all be a little more informed.

rougered · ‎02-12-2014

Hello again,

this is just to say that OpenMP 4.0 should also run on gpus...

Riccardo

Meteorhead · ‎03-10-2014

I have read the specs of OpenMP 4.0, and aside from the fact that there are no implementations yet, it also cannot account for SVM, as far as I understood. Please correct me if I'm wrong. It can use raw pointers as device data, when declared in a device sensitive context, but has no means of automatically share things between host and device. In fact, on the very first pages of the specs this becomes clear, when it states that it strictly follows the fork-join model, and is asymmetric with regard to host-device, mainly focusing on the pattern where the host offloads computations to the device.

The core of heterogenous programming is just the opposite of this design, and OpenMP 4.0 again will not be able to make efficient use of HSA, as yet again it does not let you express things in code that the HW is capable of.

rougered · ‎03-14-2014

Dear Meteorhead,

i am not an expert myself, nevertheless (to my understanding) regarding accelerators OpenMP 4.0 should be a sort of super-standardization of OpenACC.

There is definitely some work going on to have OpenMP running on HSA (see e.g. this email exchange gcc - Dev - OpenACC support in 4.9 or AMD Enables Server APU Software to Reimagine the Server | OpenMP )

the big question is if this will do with the APUs. Again here my understanding of HUMA which shall be a part of HSA, is that the memory is the same between cpu and gpu, so no transfer at all shall be needed.

Next week i will go to a meeting where there shall be some compiler experts. I'll definitely try to ask my questions there and maybe report back in this forum.

nevertheless, i really would like somene from AMD to actually say a word on this. I think that my questions are fair, and that the answer would be of interest ... AMD really has a game changer if this becomes truly available!!

Riccardo

Archives Discussions

how to actually use hsa