cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

madsbuvi
Adept I

Kernel uses less registers since 13.1 - harming performance of my application

Hello!

Since drivers 13.1 and later, the AMD OpenCL compiler has been rather sparse with allocating registers for my code, and as a result there is massive register spilling and about 3x reduction in code performance.

When compiled with 12.11 or older it would use 244 registers, with no spilling. Still enough registers to achieve full utilization of the GPU since with no register spilling my code would be nearly wholly arithmetically bound while having enough wavefronts to keep the GPU occupied.

But when compiled with 13.1 or later it uses only 131, spilling a lot of registers.

Are there any compiler flags i can pass to allow/force the compiler to be more liberal in allocating registers?

My apologies in advance if i've missed any documents specifying this, or if this question has already been answered (i searched but couldn't find any entirely similar questions).

The specific code in question can be found here:

https://github.com/madsbuvi/MTY_CL/blob/master/readme.md

Run thought CodeXL should give a complete .cl file

the loop at lines 246-250 in gpu.cl and the whole of des.cl/sboxes.cl is the relevant runtime-critical section.

performance dropped to ~40 million from 125-130 million hashes / second with the new drivers. Edit: with the card 7850, i forgot to mention.

(The code also broke completely and generated wrong hashes with driver version 13.1, but this has been fixed in the newest beta driver. Mentioning this in case it might be related.)

edit:

http://devgurus.amd.com/message/1286728

seems somewhat relevant in terms of losing performance. But does not seem to be caused by register spilling.

0 Likes
25 Replies
nou
Exemplar

try add

__attribute__((work_group_size_hint(64, 1, 1))) or __attribute__((reqd_work_group_size(64, 1, 1)))

0 Likes

Thank you, but it made no difference.

0 Likes

btw....I know you would have done.. but just asking for sake of it: "After the workgroup hint, you should be spawning 64 workitems per workgroup while launching the kernel". I hope you did that as well.

0 Likes

Yes, i have of course tried this.

0 Likes

Have you marked this as correct answer by accident? I can un-mark if you need.

0 Likes

Haha, yes, sorry about that.

0 Likes
himanshu_gautam
Grandmaster

Thanks for reporting it. I will try to reproduce it at our end. Is the testcase 32-bit or 64-bit. It contains DLLs so i assume you are using Windows. Win7 or Win8?

0 Likes

It is compiled as 32-bit and links to the 32-bit libraries. I am running a 64-bit version of windows 7.

0 Likes

I see your kernel using 32KB of LDS, which results in 5% kernel occupancy. This may not be a good way to use GPUs, unless your algorithm is so restraining. I would recommend you to tweak the algorithm to achieve better kernel occupancy.

I did observed the performance drop and slight decrease in VGPR usage with 13.1 driver as compared to 12.10 driver. I have reported that to OpenCL Compiler Team.

0 Likes

Thank you for the reply but:

This is not an issue since, with 244 registers, the critical loop would be wholly arithmetically bound, and enough threads to fully use the resources available on the gpu still fit in (4 wavefronts per sm).

The code outside this loop could probably make better use of the gpu resources, but they are a fairly small fraction of the runtime so i haven't spent time optimizing this.

0 Likes

That was just a general suggestion.

If your algorithm is enough ALU bound, 4 wavefronts per compute unit may be good enough. Although still I guess 8 wavefronts are required for AMD GPUs.

Thanks for the testcase.

0 Likes

Suggestions are always appreciated!

Yes, section 5.6.1 of the amd guide says 4 is a bare minimum to achieve peak alu throughput, which is what i believe i am achieving. Even if i'm not, decreasing both LDS and register usage to allow higher occupancy causes a slowdown anyway since keeping all those registers is critical to the performance of the critical loop.

Thanks again for response!

Raistmer
Adept II

Well, I see the same for my app too,

With 13.1 app started to cause driver restarts. Comparing  ISA for too long running kernel I found that under 13.1 it uses only 5 registers while on 12.8 (where no driver restart) it uses 12 GPRs:

SQ_PGM_RESOURCES:NUM_GPRS     = 5

vs

SQ_PGM_RESOURCES:NUM_GPRS     = 12

So, register spilling inevitable under 13.1 that slows down kernel in such big degree that it causes driver restarts.

0 Likes
madsbuvi
Adept I

With drivers 13.4 and 13.5 beta the program crashes with exception code c0000005 at the call to clBuildProgram.

Not sure if i should make a separate thread about this. I'll edit / post again if i can figure more precisely what is causing the crash.

0 Likes

btw.. How many compiler options are you passing?

0 Likes

Just "-cl-opt-disable" or "-O0". Disabling this eliminates the crash. In regards to the original questions trying with or without this made no difference. I had disabled optimizations since the code produced with -O0 was not noticeable slower and the program frequently needs to recompile so reducing compilation time is desirable (but not crucial).

I was able to produce a fairly small case that reproduces the crash in KernelAnalyzer 2 (2.1.671) with either "-O0" or "-cl-opt-disable" enabled.

typedef uchar     uint8_t;

typedef ushort     uint16_t;

typedef uint     uint32_t;

typedef ulong     uint64_t;

typedef char    int8_t;

typedef short    int16_t;

typedef int        int32_t;

typedef long    int64_t;

struct WDW

{

  uint64_t w;

  uint16_t xp;

};

inline uint64_t

xpize(uint64_t m, uint32_t xp, uint32_t len)

{

    int i;

    for (i = 0; i < len; i++){

        if (xp & (1 << i))

        {

            uint32_t c = (m >> 6 * i) & 077;

            uint32_t s;

            if (c >= 046)

              s = 26;

            else if (c == 001)

              s = 1;

            else

              continue;

            m -= (uint64_t)s << 6 * i;

        }

    }

    return m;

}

__kernel void crypt25(

                      __global struct WDW * wdw_pool

                                            ){   

   

    int index = get_global_id(0);

    uint64_t word = wdw_pool[index].w;

    uint64_t xi = xpize(index,wdw_pool[index].xp,10);

}

Good news is, without -O0, the code is back to full speed (tested only on beta driver 3.5)

0 Likes

Hi,

I will just report a personnal experience I had, that is alike to yours. I was also working on a very big kernel, using a lot of registers. The code worked perfectly on Nvidia cards, when i tested it on a HD7950, it didn't produce the good result.

Disabling optimizations made it functionnal but really slow. I eventually found that adding volatile qualifier fixed the problem, even if there were no reason to put a volatile (it was just a simple float variable).

Then the code got bigger and the AMD compatibility didn't follow, at the moment it only works with NVidia cards which is quite annoying for an opencl code. I'm currently waiting for a code generation improvement in AMD IL, because I think it's why my kernel isn't working with AMD cards

0 Likes

Hi roger,

Can you explain what you mean by compatibility did not follow. Has the kernel stopped compiling for AMD cards? It will help if you can share some cut-down version of your kernels.

0 Likes

Hi Himanshu, thank you for answering.

Well, i'm sorry but I can't really post the code, it's a really big kernel more than 2000 lines. I wont post cut down version for the moment either, my companie wouldn't allow it anyway.

When I said the compatibility did not follow, I meant, it compiles fine, it just doesn't produce the right result when NVidia cards are able to.

I can give you the followind details :

I used the casting vector type to pointer trick to access a vector type component dynamically.

example :

float4 vec;

float f =     ((float*)&vec);

That kind of code didn't work well with HD7950 card. I replaced it with that trick :

http://developer.amd.com/community/blog/tips-tricks-a-code-snipit/

It fixed couple of problems, from what I remember the vector type to pointer trick worked with float4 that were not inside a structure (???), but maybe it is pure coincidence.

From there it started to work better, and I finally found that adding a volatile on some random float variable made the kernel functional, that's mainly why I think there is something broken in the OpenCL to AMD IL process.

I never had any problems with small and medium size kernels, I still think AMD Opencl is very reliable but it seems there is some kind of glitchs happening when kernels become quite complex, the kernel I encounter difficulties with has 4/5 nested loop, many break and a load of conditionals statements .

The kernel also relies on warp/wavefront lockstep principal... I used 32/64 to define their width, so it should be ok there.

Roger

0 Likes
madsbuvi
Adept I

While the speed is back to normal, the new drivers seems to have other issues.

I've been unable to produce a smaller case of where something is going wrong, but the drivers 13.4 and 13.5 are producing garbage instead of correct hashes.

Attached are snapshots of searching for any hashes containing the string "Green", where testing under 12.10 shows correct output while 13.4 produces garbage (and 13.5 has the same issue).

This problem was also present in 13.1, but not in any of the beta drivers in between the releases of 13.1 and 13.4.

I'm sorry that i can't provide a smaller case. But what i did find was that changing the randomly generated keys to a hardcoded key would produce the correct output (presumably due to constant propagation hiding the actual issue...).

0 Likes

Hi roger,

Thanks for sharing the Tips and Tricks page. You can probably check if you can send the kernel using some private channel. I have added you as my friend on the forum, just check if private messages allow attachments. I would also suggest to start new threads, if your issue topic is different. The actual thread creator is still active here.

Hi madsbuvi,

Can your issue be reproduced using the repository code, you had shared in the beginning? If so, please give some helpful steps to reproduce it. If things were working earlier and are now failing with new driver, it is a very critical issue for us.

0 Likes

Hi

The problem probably exists also in the repository, but i haven't updated it yet since there are still issues and it will crash due to the compiler issue mentioned earlier.

However i've made a smaller and more clear testcase (Attached as testcase.rar). Snapshots of outputs from runs with driver 12.10 and 13.4 are also attached.

This version is completely deterministic and hashes a single set key (instead of random keys), clearly the program is correct in driver 12.10 and broken in 13.4.

As of driver version 13.9 the program and the test case runs correctly if i disable optimizations with "-cl-opt-disable" when calling clBuildProgram (line 69 of testcase.c).

A more gpu agnostic version i've made will run slowly without optimizations, and will force me to create separate versions for amd and nvidia. But this is manageable so my thread can be marked as solved.

0 Likes

Hi again

Sorry but I can't find a way to send you private messages, really well hidden.

0 Likes

roger512 wrote:

Hi again

Sorry but I can't find a way to send you private messages, really well hidden.

Top right, select create, then Direct message. (not used it myself yet)

Claggy