As i understand it is not possible from OpenCL. Although some success has been acheived in the thread you mentioned, it is not a officially supported feature. Probably you can share your experiences regarding it, in the first thread itself. If a lot of people show interest in this, AMD can certainly think of including it in its future plans.
ye, that's what I though. That lead me to the next question. I'm quite interested about the HSA specs, it is supposed to support functions calls and a load of new interesting features. Still it's low level, So is there any plan about building OpenCL on top of it ? if yes being able to use HSAIL inside OpenCL like inline AMDIL could be great.
1 of 1 people found this helpful
As I know, there are many problems reaching true fuct calls not even from OpenCL but from AMD_IL also:
- On the VLIW architecture (below HD7700) there is no GOTO instruction. Only Loop, If/Else and exit.
- The AMD_IL compiler has an optimizer inside it which likes to work on totally unrolled code. So I think even on the GCN architecture (which has true GOTO) it will unroll all your CALL-s and then do the optimizations on the whole thing.
- Another thing is variable value exchange between OpenCL and AMD_IL. Unfortunately there's no such thing (as I currently know). You can only insert your amd_il texts inside the stream not knowing which OpenCL variable is in which register.
A year ago I've played with recursion. I've managed to do recursive Fibonacci with the S Alu. Not on high level as AMD_IL or HSA(will be) but on the lowest GCN asm level.
Oh, and here comes another problem:
- On GCN you have to drive 2 processors in one instruction stream: Scalar and Vector. In OpenCL and in AMD_IL you can't reach the S Alu, with which you could jump to a 64bit physical address in gpu memory for example (and it can do much more).
Recursion is possible with the S alu. Even you can make a small stack for return addresses and passed parameters in the registers because you can access the registers indirectly with s_movrel.
I'd say on the GCN architecture it's possible to do all those complicated things that an IA-64 processor can do. A GCN chip is like an IA-64 with 2048 bit SSE, 10..32 cores but it comes with a very well designed instruction set.
ty you for those precious informations.
I had a small success with recursivity myself without using AMD_IL.
It was based on a branching table (emulated with a switch, because no goto allowed in OpenCL) and a do while looping until the call stack was empty.
pushing state when recursive call was made, and poping when it returned.
It worked well ( I implemented a Loop subdivision algorithm ).It was still a big mess, switch case couldn't be nested inside a loop ( no interlace switch ). No recursive calls inside a loop could be made.
I know now recursivity in OpenCL only is not really possible . When I read AMD_IL specs and HSA it clearly says there is a call stack mechanism (registerer are saved when a subroutine is called).
- Why isn't that feature present in OpenCL? Is it because other vendors from OpenCL consortium cannot provide it ?
some subsidiary questions :
- Are we going to see goto in OpenCL so we could implement call stack ?
- What happen if goto inside a wave are divergent ( in GCN ) ?
- Will we see a greater assembler/OpenCL interoperability (I know it blows OpenCL cross-device principal, it still would be a great feature) ?
"What happen if goto inside a wave are divergent ( in GCN ) ?"
You can mask out lanes with exec register and then jump (if any bit in exec register is 1). The jumps are controlled by the S alu, so all the lanes in the wavefront will go.
IMO things aren't going in the direction towards assembly. The companies rather improve things that make programming GPUes easier for wide number of developers.
If OpenCL would have GOTO and function pointers, then the whole OpenCL -> ISA path had to be changed drastically. It is way too compatible with VLIW now (where there are only IF/ELSE/WHILE/BREAK/EXIT exists)
And as it's a wavefront goto, its implementation in OpenCL wouldn't be that straightforward.
Here's a small example what you can do with advanced gcn stuff: I had a work where I found indirect jumps really effective: I had a 32 element circular buffer there and the main loop worked on that. In order to place the array in the registers I had to unroll the loop 32x.
In every loop step I checked a criteria and when it happened (statistically around 1/100 of the time) it had to call a more complicated function to process the data further. (An alternative could be to make 2 kernels for it, but the data transfer between them would be too high.)
The 32x unroll with the complicated functions inlined didn't fit in the instruction cache, so I placed the complicated function into a subroutine and when the checked criteria evaluated true I jumped out from the loop: There were 32 jump destinations for each 32 unrolled loop steps, so I was able to save the unrolled step index into a register, I used this later to calculate the return address. Then I put the task into a small queue in the LDS. Writing the LDS was very fast as you can write 2*64bit in a single instruction. When the count of tasks reached 64 I launched the complicated function on all 64 workitems (maximum v_alu utilization). This way it happened 1/100/64 of the time only. When it was done, I used s_setpc instruction to jump back into the mainloop and continued that 32x unrolled loop right there where it was 'interrupted'.