cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

adm271828
Journeyman III

Does AMD understand the SDK shall give programmers a way to get the max of their hardware?

I fully understand the strategy of sticking to Opencl standard, which might be somewhat limiting.

I even understand the decision to drop support for IL in the future. Some months ago, I made the analysis that IL was ill-positionned, too low level compared to OpenCL, and too high-level to be used as an assembly language to optimize portions of code that OpenCL wouldn't be able to optimize. The problem is that IL has been shown to be usefull to get extra performance. And beside people that will have to deal with their legacy code writen with IL, the true problem is that if IL it not here anymore to provide the extra performance, where will this capability be provided?

Here are some examples (based on observations using OpenCL with SDK 2.4, driver 11.3, Ubuntu 10.04, 64 bits):

- popcnt: at least the name for an extension is here, but there is no documentation. The good news is that, even if undocumented, it seems to work in SDK 2.4: popcnt isssues a BCNT_INT opcode.

- clz(uint): maps into a FFBH_UINT opcode. Good, but how inefficiently! It seems the implementation wants to return 32 if the arg was zero. Why not (the standard says nothing about this special case), but it takes 3 ISA instructions and, worse, the data dependency chain has lenght 3! Could at least be only 2 by testing the initial argument against zero instead of testing the result of FFBH_UINT against -1.

What I want to see is a native_ffbh instruction (or call it whatever you want) that returns the result of FFBH_UINT unmodified (gcc has a __builtin_clz that says 'result is unspecified if argument is null', which is perfectly suited).

- uint x = y >> z; ... I was horrified to discover that it generates 2 ISA instructions: first a z &= 31, then the LSHR. Is the z &= 31 not performed by the LSHR opcode??? (as explained in the Evergreen instruction set reference manual). And what if I, as a programmer, garanties that z is in the 0..31 range? Shall I pay the extra instruction cost?

OpenCL says it is better than C99, because shifting by a value greater thant 31 is fully specified. Well, if it costs me an extra instruction, I'm not sure it is better...

- bitselect still unmapped to the corresponding instruction... Is this really difficult?

- find lsb not mapped to FFBL opcode...

I recently wrote a kernel, with a critical loop (wrt performances) that should have taken 20 clock cycles. When I looked at the ISA code, I discovered that 5 clock cycles had been added for nothing, just because of extra unnecessary instructions like the ones above... And the extra data dependencies prevented an optimal instruction packing as well.

So here is another wish (to be added to the long list in another thread): provide, as soon as possible a specific amd extension to OpenCL that maps into the languages all the specific RV instructions, without any postprocessing, under native_amd_XXX name. This is probably not difficult to do, and this will perhaps make some current IL users less unhappy about the future death of IL.

Sorry to be a little provocative with the title of this post, but I'd like to see an answer different from "well, we don't know, we have no software vision, we have no plan to do this, and maybe it will come in a future release of SDK, but we don't know when...".

Best regards,

     Antoine

0 Likes
15 Replies
nou
Exemplar

i read that primary goal is achive correctnes in OpenCL (IIRC it was MicahVillmow?). and then optimizations on speed.

0 Likes

Originally posted by: nou i read that primary goal is achive correctnes in OpenCL (IIRC it was MicahVillmow?). and then optimizations on speed.

 

Nou,

I'm not asking the OpenCL compiler to generate the best possible code. I'm just asking AMD not to limit programers by limiting access to hardware instructions. And a way to do it is simply to expose some specific instructions, with direct access so that we can use them efficiently. Intel did this under the 'intrinsic' name. AMD can do this as an OpenCL vendor specific extension.

This is not complicated at all (probably less than a few days of work, with almost no risk), and would be an easy quick win for AMD.

Of course general speed optimization can be a long term goal, and I can even understand that achieving correctness in OpenCL might have a higher priority. But there is nothing that prevents doing this quick tactic improvement (exposing interesting instructions), while keeping other goals in mind.

Best regards,

Antoine

0 Likes

Originally posted by: adm271828
Originally posted by: nou i read that primary goal is achive correctnes in OpenCL (IIRC it was MicahVillmow?). and then optimizations on speed.

 

 

 

 

Nou,

 

I'm not asking the OpenCL compiler to generate the best possible code. I'm just asking AMD not to limit programers by limiting access to hardware instructions. And a way to do it is simply to expose some specific instructions, with direct access so that we can use them efficiently. Intel did this under the 'intrinsic' name. AMD can do this as an OpenCL vendor specific extension.

 

This is not complicated at all (probably less than a few days of work, with almost no risk), and would be an easy quick win for AMD.

 

Of course general speed optimization can be a long term goal, and I can even understand that achieving correctness in OpenCL might have a higher priority. But there is nothing that prevents doing this quick tactic improvement (exposing interesting instructions), while keeping other goals in mind.

 

Best regards,

 

Antoine

 



Of course one needs to substantiate such claims for 2 reasons. First of all to know what is missing. Inefficient compilation of code is different there.

 

Obviously ideally there should be 2 ways to program the GPU. That's both opencl as well as in assember of the gpu, ideally of course allowing inline assembler functions, not to confuse with the more restrictive intrinsics.

A big problem of intrinsics over assembler is that you just have access to 1 instruction with big odds the compiler still is inefficiently generating code around it, where you use assembler at that spot at the first place.

For example a FFT i wrote in 64 bits integers code using some intrinsics at visual c++, still was really inefficient getting compiled as compilers have the habit to generate useless instructions around it to achieve the objective (they're not optimized for intrinsics much - logically as in specint/specfp only pure high level programming code is allowed and no inline assembler nor intrinsics).

 

Now such functions always are so tiny and get so generic used by others (read: cut'n pasted) that writing them in assembler is not a problem.

 

So i would really favour inline assembler over intrinsics as that is mighty more powerful and we must simply face that high level codes for a GPU you can never write as it already requires deep understanding of parallellism, so you really use them when you know what you're doing. Inline assembler is far more powerful there.

 

Now my simple question is of course: is it possible to write programs for the GPU's in assembler right now? If so which document do i need to know how to assemble a working program?

 

I'm stumbling against limits of OpenCL. GPU's are completely uninteresting for very generic code as you need to do so much effort to get performance; you only use them to get the utmost performance out of the GPU where it is faster than a CPU.

 

OpenCL casts on paper very well at the GPU's, so let's not give up hope yet. Yet it sure must deliver low level performance.

I see it as that OpenCL is very young yet and still needs some improvements.

 

What of course already helps is if one can print the opencl code to assembler and then compile that further, as that allows to remove useless instructions such as the ones you mentionned.

 

However soon i'll have to decide whether to continue OpenCL or switch to CUDA once again.

 

The simple problem in 32 bits that i encounter right now is that multiplication doesn't output enough bits per cycle per streamcore for integers. Yet i realize all too well that a simple addition to OpenCL or some tricks in the compiler solve all that.

 

Regards, 

Vincent

0 Likes

Yes, you're right about inline assembly vs intrinsics, at least if we take into consideration experience coming from CPU. I don't kown about VC++, but it took 10 years  before gcc has been able to generate decent code for Intel processors using intrinsics (spurious extra instructions started to really disappear at a reasonable level with version 4.5).

Now the question is wether AMD has enough maturity (and willingness) to provide an inline assembly mecanism shortly. I really don't think so. I asked a few months ago how to directy compile assembly: no answer from AMD, and somebody else answered there is no known plan. A good indication however is that it seems there was an assembler working for previous GPU models (up to RV6xx?), but I was said developement stopped. I also asked about the binary format details: no answer (except one page in the OpencCL guide that says bacically it is ELF, and provides the name of the sections - I could figure this out myself - but thank you Nou: you took at least some time to post an answer).

The point I want to make is that it is extremely easy to add a few intrinsics right now (it has been already done for float instruction like native_exp2, native_sin, ... so why don't finish the work with the few remaining usefull instruction???), while providing an inline assembly mechanism would be much more complicated. So, even if the benefit of the former would be less than the one of the later, it could at least help us to wait.

Regarding the "compile, look at assembly, modify source code, compile again" optimisation method... unfortunately it doesn't work with the examples I gave with my configuration. I found no way to get rid of the extra unnecessary instructions.

And yes, the next question I will ask myself is wether I switch to NVidia + CUDA for my next card. I don't know if their software strategy is really better, but at least I could give it a try.

Best regards,

Antoine

 

 

0 Likes

adm271828,
If there are specific cases that you can find that we do not optimize correctly, please provide small test cases and we will work on getting the generated code improved. Many times the issue is that we have not gotten to optimizing those specific instructions or sequences of instructions yet. A lot of our optimization priority is based on customer feedback for what is causing issues in their applications. For example, There was a request to have BFE generated, so in SDK 2.4 we added optimizations to generate the BFE instruction based on code sequences. In SDK 2.5 we are expanding that, along with other optimizations, to generate BFI instructions. We also will be updating the documentation to show the exact patterns, with examples, that we currently detect.

We are looking at inline assembly in OpenCL, but there is no feedback that I can give on when it might show up, if it ever does. As for the intrinsics, the issue is how to expose them in OpenCL in a manner that is
0 Likes

Thanks Micah for your answer.

Here are some test cases that just demonstrate what I said in my initial message. Since you said BFE patterns had been included in 2.4, I made another test (and BFE is not used).

In those example the loop is only here to isolate the interesting code. I put generated ISA + more comments inside the code.

As for the need for optimisation, I virtualy need a direct access to every specific instruction that manipulate integers.

And I'd like to read what you were about to write about intrinsics. Seems your message was posted too fast

Best regards,

  Antoine

 

 

#pragma OPENCL EXTENSION cl_amd_popcnt : enable __kernel void test1 (__global uint4* data) { const uint guid = get_global_size(0); uint x = data[guid].x; uint y = data[guid].y; for (int k = 0; k < 100000; ++k) { // 03 LOOP_DX10 i0 FAIL_JUMP_ADDR(6) // 04 ALU_BREAK: ADDR(39) CNT(7) // 5 x: AND_INT ____, R0.z, (0x0000001F, 4.344025239e-44f).x // WHY -----> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ // LSHL below is already supposed to clip its second arg in 0..31 range // (Evergreen Instruction Set Manual, Feb 2011, page 9-131) // // w: ADD_INT R0.w, -1, R0.w // 6 y: LSHL ____, R1.x, PV5.x // 7 x: XOR_INT R1.x, R1.x, PV6.y // 8 z: ADD_INT R0.z, R0.z, PV7.x // 9 x: PREDNE_INT ____, R0.w, 0.0f UPDATE_EXEC_MASK UPDATE_PRED // 05 ENDLOOP i0 PASS_JUMP_ADDR(4) // // NB: maybe the documentation for LSHL is incorrect (the one for ASHR_INT is inconsistent // for instance), and maybe LSHL returns 0 if second arg is > 31. In which case I would LIKE // to see an intrinsic that returns the raw result of LSHL. x = x ^ (x << y); y = y ^ x; } data[guid].x = x; } __kernel void test2 (__global uint4* data) { const uint guid = get_global_size(0); uint x = data[guid].x; for (int k = 0; k < 100000; ++k) { // 02 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) // 03 ALU_BREAK: ADDR(38) CNT(8) // 4 x: FFBH_UINT T0.x, R1.x // z: ADD_INT R0.z, -1, R0.z // 5 w: LSHR ____, PV4.x, (0x0000001F, 4.344025239e-44f).x // WHY? -----> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ // // 6 y: CNDE_INT R123.y, PV5.w, T0.x, (0x00000020, 4.484155086e-44f).x // 7 x: ADD_INT R1.x, R1.x, PV6.y // 8 x: PREDNE_INT ____, R0.z, 0.0f UPDATE_EXEC_MASK UPDATE_PRED // 04 ENDLOOP i0 PASS_JUMP_ADDR(3) // // Generated code is equivalent to: // tmp0 = FFBH_UINT(x) // tmp1 = (tmp0 >> 31) will be 0 or 1 // result = tmp1 ? 32 : tmp0 // // Should be: // tmp0 = FFBH_UINT(x) // result = x ? tmp0 : 32 // // And an intrinsic that returns raw result of FFBH_UINT is needed when programmer // handle the specific null case elsewhere (no reason to pay extra cycles). // x += clz(x); } data[guid].x = x; } __kernel void test3 (__global uint4* data) { const uint guid = get_global_size(0); uint x = data[guid].x; uint y = data[guid].y; uint z = data[guid].z; uint w = data[guid].w; y &= 0x0f; for (int k = 0; k < 100000; ++k) { // Is this not a pattern where BFE_INT could be detected (offset=0)?? // // 03 LOOP_DX10 i0 FAIL_JUMP_ADDR(7) // 04 ALU_BREAK: ADDR(39) CNT(10) // 5 y: AND_INT T0.y, R2.x, (0x0000000F, 2.101947696e-44f).x // z: ADD_INT R0.z, -1, R0.z // 6 z: SUB_INT ____, 0.0f, PV5.y // 7 x: AND_INT ____, PV6.z, (0x0000001F, 4.344025239e-44f).x // Still unnecessary ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ // (compiler could even detect that both y and 32-y are within 0..31 range already) // // 8 w: LSHL ____, R1.x, PV7.x // 9 y: LSHR ____, PV8.w, T0.y // 10 x: ADD_INT R1.x, R1.x, PV9.y // 11 x: PREDNE_INT ____, R0.z, 0.0f UPDATE_EXEC_MASK UPDATE_PRED // 05 ALU: ADDR(49) CNT(1) // 12 x: MOV R2.x, R1.x // 06 ENDLOOP i0 PASS_JUMP_ADDR(4) x += (x << (32 - y)) >> y; y = x & 0x0f; } data[guid].x = x; }

0 Likes

adm271828,
Thanks for providing these examples.
0 Likes

 

Micah,

I'm sure you can understand that "provide an example, we will optimize it" (though this is a friendly approach I thank you for) will not solve the problem. I'm pretty sure I can provide more examples than what you will be able to optimize in a near future. The BFE is a good example. You said 2.4 had patterns to generate this instruction and the very first attempt I made is a failure (and this is a special case where one arg is 0).

Basically, what you can do is as follow:

- improve your compiler (seems to be your only strategy). This is a good strategy, but a long term one (keep in mind the the time it took for CPU compilers to optimize SIMD instructions). And even with very good compilers there are always cases they are unable to optimize.

- have inline assembler: also a good strategy, but medium to long term. I have no idea if it is easy to add to your compiler, but I think there are a few issues to deal with. And there is also another problem, which is the learning curve for users.

- expose specific instructions using intrinsics: we all agree that it would be better to have a perfect high level compiler, or to have inline assembler, but this solution is easy to implement in the short term. So this is clearly the best one right now.

And the point I want to make is that the 3 options are not exclusive...

Now, one more word about the "easy" shifting instructions, and more specifically ASHR_INT whose description I said was inconsistent (page 9-56).

The text says: "if src1 > 31, the result is either 0 or -1 (0xFFFFFFFF), depending on the sign of src0". So, ASRH_INT with src0=0x80000000 and arg1=33, shall give -1. On the other hand, the pseudo code just below says: "dst = src0 >> (src1 & 0x1F)", in which case our result shall be the same as shifting right only 1 bit, that is 0xC0000000. What shall we believe?

If I start to speculate, I would say I trust the first part: we get -1 (I have no time now to verify this experimentaly you'll probably be able to find this information). 

Now go back to OpenCL (1.1, page 370) and we read that OpenCL "defines behavior for oversized shift values [...] For example, if we shift an int4 by 33 bits, OpenCL treats this as a shift by 33%32 = 1 bit". Well, now I start to understand why the generated code uses a mask instruction with 0x1F before, making every shift instruction whose shift is not known at compile time a 2 cycle instruction. This is confirmed by the description of LSHL_INT: "if src1 is > 31, the result is 0" (page 9-130), hence the need for masking src1 for OpenCL compliance. And the question remains for LSHR_INT (page 9-131).

So, you will never (once again, if my speculation is correct) be able to optimize out this case and remove the mask before shifting, even if this mask is useless 99,9% of the time. VLSI guys have been probably very happy to hear that all their effort to provide a one clock cycle instruction is wasted because a high level language made such a decision...

But the good news is that you can do something!

Of course changing OpenCL will be difficult, since it is so proud to have defined the behaviour of shifting by too many bits. You could also implement shifting instructions as OpenCL says directly in hardware. I'm not sure I would love this, because filling with the implicit external outer-most bit is a more natural semantic (1 from the left when signed, and 0 when unsigned or from the right).

And guess what: intrinsics are here, easy to do, no need to break the compliance with OpenCL, and guys that need efficiency will be happy

_amd_rv7xx_lshri, _amd_rv7xx_lshri, _amd_rv7xx_asrhi, ... and when you'll do this, don't forget the bfe, bfi, bfm, find msb, find lsb, bit reverse, ... and friends. In 2.5 ???

Best regards,

Antoine

 

0 Likes

adm271828,
There is lots of discussion on your second and third options but a decision on doing them hasn't been made, but it hasn't been ruled out either. For the first one, we do have a more advanced optimization pass that specifically optimizes for BFE, but, what I didn't know is that, it was disabled in 2.4 because of conflicts with OpenGL. This is something we are looking to get fixed in 2.5 timeframe.

The basic BFE optimization does this pattern match:
(A >> B) & C where B and C are constants.
I.e. this kernel:

kernel void bfe_opt(global int* a)
{
*a = (a[0] >> 4) & 0xFF;
}

Produces this ISA:
; -------- Disassembly --------------------
00 ALU: ADDR(32) CNT(2) KCACHE0(CB1:0-15)
0 x: LSHR R1.x, KC0[0].x, (0x00000002, 2.802596929e-45f).x
01 TEX: ADDR(48) CNT(1)
1 VFETCH R0._x__, R1.x, fc173 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU: ADDR(34) CNT(2)
2 x: BFE_UINT R0.x, R0.y, (0x00000004, 5.605193857e-45f).y, (0x00000008, 1.121038771e-44f).x
03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1].x___, R0, ARRAY_SIZE(4) MARK VPM
END_OF_PROGRAM

0 Likes

Well, there is still hope...

I think you should ask your customers in some way. And please provide my feeback to (the group of) people that will make the decision.

As for BFE, and more generaly, if the patterns you optimize are only the easy ones with constants.... then, sorry, this is useless for me (and probably not only me).

Regards,

Antoine

0 Likes

Hello,

Lot's of project try to generate specialized opencl code. Instead of using lot's of layers of (useless) abstraction and code eg.: shader lang - > shader ir -> opencl codegen -> opencl compiler-> llvm (amdIL)-> hw IL-> hw why not simply cut these layers out?

Why not just release the llvm amdil backend specification with some kind of inline IL support with a callable llvm library backend ? Every high performance / heterogen computing project try to develop their own compiler and runtime nowdays - they will love it. Domain specific language developers (just check the shader language palette) will also love it. And suddenly we can port (almost) every language that support llvm to the gpu..

Best regards,

S

 

 

0 Likes

S,
It is something I am working on, I just don't have the time to finish.
0 Likes

Hello,

Please let us know what and how we can help to you to finish your work.I have some experience with llvm.

A direct llvm backend (with inline IL) for the gpu will be the best of both word. This backend will start lot's of university research on heterogenous computing, domain specific languages and application accelerators.

Please keep in mind that the AMD GPU have lot more computation resource [1] and we would like to map these resources to predicatable(!) direct computation without lot's of complex abstraction and code generation layer.

Regards,

S

[1] http://whitepixel.zorinaq.com/

0 Likes

That would be great.

At some point I looked if it was possible to generate my own ISA code, but it would have been too time consuming (at least for free), and I bumped into another issue which is the binary format. The .text section is easy to understand with available documentation, but there is a .rodata section the loader probably uses to load constants + some GPU parameters I couldn't find information about.

If AMD provides a llvm backend, fully packaged to generate a loadable binary file, I would be very happy indeed.

So, intrinsics in the 2.5 and the llvm backend just after Very good!

Best regards,

Antoine

0 Likes

ScacPyuf6Ob1,
It is just a time issue to finish up the process. Once we finish development for SDK 2.5, I should have more time to focus on it.
0 Likes