Does AMD understand the SDK shall give programmers a way to get the max of their hardware?

Discussion created by adm271828 on Apr 23, 2011
Latest reply on Apr 28, 2011 by adm271828

I fully understand the strategy of sticking to Opencl standard, which might be somewhat limiting.

I even understand the decision to drop support for IL in the future. Some months ago, I made the analysis that IL was ill-positionned, too low level compared to OpenCL, and too high-level to be used as an assembly language to optimize portions of code that OpenCL wouldn't be able to optimize. The problem is that IL has been shown to be usefull to get extra performance. And beside people that will have to deal with their legacy code writen with IL, the true problem is that if IL it not here anymore to provide the extra performance, where will this capability be provided?

Here are some examples (based on observations using OpenCL with SDK 2.4, driver 11.3, Ubuntu 10.04, 64 bits):

- popcnt: at least the name for an extension is here, but there is no documentation. The good news is that, even if undocumented, it seems to work in SDK 2.4: popcnt isssues a BCNT_INT opcode.

- clz(uint): maps into a FFBH_UINT opcode. Good, but how inefficiently! It seems the implementation wants to return 32 if the arg was zero. Why not (the standard says nothing about this special case), but it takes 3 ISA instructions and, worse, the data dependency chain has lenght 3! Could at least be only 2 by testing the initial argument against zero instead of testing the result of FFBH_UINT against -1.

What I want to see is a native_ffbh instruction (or call it whatever you want) that returns the result of FFBH_UINT unmodified (gcc has a __builtin_clz that says 'result is unspecified if argument is null', which is perfectly suited).

- uint x = y >> z; ... I was horrified to discover that it generates 2 ISA instructions: first a z &= 31, then the LSHR. Is the z &= 31 not performed by the LSHR opcode??? (as explained in the Evergreen instruction set reference manual). And what if I, as a programmer, garanties that z is in the 0..31 range? Shall I pay the extra instruction cost?

OpenCL says it is better than C99, because shifting by a value greater thant 31 is fully specified. Well, if it costs me an extra instruction, I'm not sure it is better...

- bitselect still unmapped to the corresponding instruction... Is this really difficult?

- find lsb not mapped to FFBL opcode...

I recently wrote a kernel, with a critical loop (wrt performances) that should have taken 20 clock cycles. When I looked at the ISA code, I discovered that 5 clock cycles had been added for nothing, just because of extra unnecessary instructions like the ones above... And the extra data dependencies prevented an optimal instruction packing as well.

So here is another wish (to be added to the long list in another thread): provide, as soon as possible a specific amd extension to OpenCL that maps into the languages all the specific RV instructions, without any postprocessing, under native_amd_XXX name. This is probably not difficult to do, and this will perhaps make some current IL users less unhappy about the future death of IL.

Sorry to be a little provocative with the title of this post, but I'd like to see an answer different from "well, we don't know, we have no software vision, we have no plan to do this, and maybe it will come in a future release of SDK, but we don't know when...".

Best regards,