cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

meriken
Adept III

90+% Performance Reduction of OpenCL Application with AMD Radeon Software Crimson Edition

With the latest AMD Software Crimson Edition, I am experiencing a 90+% performance reduction of the tripcode generator I developed, and other users reported the same problem. The following are relevant links:

http://meriken.ygch.net/programming/merikens-tripcode-engine-english/

https://github.com/meriken/merikens-tripcode-engine

(Please make sure to uncomment "// #define ENGLISH_VERSION" in "/MerikensTripcodeEngine/Source Files/MERIKENsTripcodeEngine.h" when you build this application.)

When I change OpenCL build options in "/MerikensTripcodeEngine/Source Files/OpenCL12.cpp" from "-O1 -cl-mad-enable" to "-O5 -cl-mad-enable", I either get corrupt results (7970/7990) or the same slow speed (290X). I was able to reproduce the problem with the following drivers:

radeon-crimson-15.12-with-dotnet45-win7-64bit

non-whql-64bit-radeon-software-crimson-16.1.1-win10-win8.1-win7-jan30

non-whql-64bit-radeon-software-crimson-16.2-win10-win8.1-win7-feb23

non-whql-64bit-radeon-software-crimson-16.3-win10-win8.1-win7-march9

This application was working just fine up until Catalyst 15.11.1 Beta. My system configuration is as follows:

GPU0: DIAMOND Radeon HD 7990

GPU1: Gigabyte Radeon HD 7990

GPU2: Sapphire Radeon R9-290X

GPU3: Gigabyte Radeon R9-290X

CPU: Intel Core i7-4770

MB: ASUS Maximus VI Extreme

PSU: Corsair AX1200

OS: Microsoft Windows 7 64bit SP1 English

I would really appreciate if you could work on this issue ASAP.

0 Likes
1 Solution
meriken
Adept III

Just to give an update, I ended up rewriting the OpenCL kernel in question in GCN assembly:

https://github.com/meriken/merikens-tripcode-engine/blob/master/SourceFiles/OpenCL/bin/OpenCL10GCN.a...

Now my application works reliably across different versions of AMD display drivers. Luckily, I was able to cover GCN 1.0/1.1/1.2 with the same kernel at the source code level.

By the way, CLRadeonExtender is a wonderful toolkit with GCN assember and disassembler that really should be part of AMD APP SDK. I cannot recommend it highly enough.

View solution in original post

10 Replies
meriken
Adept III

I either get corrupt results (7970/7990)

I forgot to mention that you need to specify the "-b 1" option when you run MerikensTripcodeEngine.exe in order to see an error message for this particular problem. The error message would look like this: "A generated tripcode was corrupt."

I encountered similar bugs in AMD's OpenCL drivers over the last 3 years or so, and I was able to work around them by tweaking either my OpenCL codes or OpenCL build options. I exhausted all the available means this time around, however, and I am rather desperate. If this bug is not fixed, I either have to rewrite the OpenCL kernels with a GCN assembler (very time consuming), or drop support for AMD in favor of NVIDIA.

0 Likes

Would it be possible to post the kernel(s) where most of the time is being spent?

0 Likes

Sure, of course! This is the kernel in question, which basically performs Bitslice DES:

merikens-tripcode-engine/OpenCL10.cl

The above kernel is executed here:

merikens-tripcode-engine/OpenCL10.cpp

merikens-tripcode-engine/OpenCL12.cpp

0 Likes

There's a lot here to look at.  Would you be able to try to narrow down the cause(s)?   For example, could you get finer timestamps so we can see in more detail where the timing differences are?  Also, you could collect OpenCL profiling information on the kernel execution itself to see how that has changed.

I noticed you're using blocking WriteBuffer, but I don't see any reason to do that.  Also, since you're using a blocking ReadBuffer, there is no need for the Finish.

Finally, it might be interesting to compare results with and without adding "-legacy" to your build options.

0 Likes

Thanks a lot for your valuable input and pointers. I am currently rewriting the kernel with a GCN assembler, and I will definitely look into your suggestions once I am done. At this point, I am pretty sure that the problem is with code generation of the new OpenCL drivers as kernels I wrote in GCN assembly work just fine. I suspect there is a serious issue with VGPR reassignments for each of the 8 S-Boxes in the innermost loop of Bitslice DES. With the old drivers, numvgprs was less than 128, which would hide memory latency, but with the new drivers it is somewhere around 220 IIRC.

By the way, I wasn't aware of the "-legacy" build option. What does it exactly do? Is it documented somewhere?

To People at AMD: PLEASE OFFICIALLY SUPPORT GCN ASSEMBLER! The wild ride I had with AMD's unstable OpenCL drivers over the years makes the old ATI CAL IL compiler look much more appealing in comparison. Open-sourcing drivers is definitely a step in the right direction, but I need a reliable assembler to maintain my sanity. I would rather deal with the GCN ISA instead of unstable compilers. Thank you.

I'll look forward to hearing your results and analysis of the code generation.  If code generation is not ideal and you can give us a short fragment of C and ISA demonstrating the problem, then we can probably handle it much more quickly than  the full example.

Regarding assembly, we have heard the requests and are moving to provide support.  We have significantly increased our investment in the AMDGPU target in the LLVM compiler.  As kind of a byproduct of that effort, the "llvm-mc" command can be used to assemble.   However the result is a code object which can be handled directly by the HSA runtime, but cannot be directly consumed by the OpenCL runtime since it does not carry everything that the OpenCL runtime needs.

0 Likes

I had a chance to look into a disassembled code, and my suspicion was confirmed: There were register spills all over the place inside the S-Boxes. I will try to come up with a short example that would demonstrate the aforementioned issues. There seems to be problems with code generation at least with register reassignment (slow speeds) and missing s_waitcnt (corrupt results).

As for official support for GCN assembler, I have been experimenting with CLRadeonExtender with pretty decent results, and it would be great if AMD could help/support this project in one way or another. I really like it because it has a cross-platform assembler that can generate OpenCL-compatible binaries from disassembled codes.

0 Likes
meriken
Adept III

Just to give an update, I ended up rewriting the OpenCL kernel in question in GCN assembly:

https://github.com/meriken/merikens-tripcode-engine/blob/master/SourceFiles/OpenCL/bin/OpenCL10GCN.a...

Now my application works reliably across different versions of AMD display drivers. Luckily, I was able to cover GCN 1.0/1.1/1.2 with the same kernel at the source code level.

By the way, CLRadeonExtender is a wonderful toolkit with GCN assember and disassembler that really should be part of AMD APP SDK. I cannot recommend it highly enough.

Great job!

Your function calls are awesome! I can just guess if you unroll that maybe you can't fit into the ICache. So does the dealing with s_xxxxPC_b64 worth the effort?

What tool have you used to generate the asm sources? An own code generator, right?

Sry for having too many questions, I'm excited, because I rarely see cool GCN ASM nowadays.

0 Likes

Thank you. It's actually quite nice to have somebody who understands my codes.

I can just guess if you unroll that maybe you can't fit into the ICache.

That is exactly right. With Bitslice DES, it is crucial that the main loop is on the instruction cache, no matter which architecture you are dealing with.

So does the dealing with s_xxxxPC_b64 worth the effort?

Absolutely. With these lightweight function calls, the performance gain was around 100% compared to the unrolled version.

What tool have you used to generate the asm sources?

It's a combination of a custom code generator and CLRadeonExtender.

I started with an OpenCL kernel. I created GCN byte codes with CodeXL for GCN 1.0/1.1/1.2, disassembled them with CLRadeonExtender, and used diff to see how the OpenCL compiler handled three different GCN architectures. Once I have a functional kernel written in GCN assembly, I analysed its register usage, generated an optimized version of the main loop with a custom code generator, and merged it with the disassembled code.

Sry for having too many questions, I'm excited, because I rarely see cool GCN ASM nowadays.

Oh, no problem. Your work was my original inspiration, so I am honored to answer your questions

0 Likes