cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ryta1203
Journeyman III

Increase GPR usage with new SDK and Driver?

I went from Catalyst 10.5 to 10.7 and SDK 2.1 to SDK 2.2 and now all my kernels have horrible performance and the register allocation is approximately DOUBLE!

What happened?

0 Likes
38 Replies
ryta1203
Journeyman III

BTW, has anyone else noticed this? Has it effected anyone else's performance? What am I missing here?

 

BlackScholes example has gone from 16 to 31 GPRs? Is this correct?

0 Likes

Also, this problem seems to only be with the 5870?

For reported SKA GPR usage the 4870 is the same or better... ODD.

0 Likes

I'm using the SKA in SDK 2.2 to check kernels written for SDK 2.1, targeting the 5870. It's a mixed bag. Some kernels are reported to have better throughput, some worse, and some which were reported to use 0 scratch registers now use several (I'm seeing 9, 11 and 15) and have reduced throughput.

Another problem:

#pragma OPENCL EXTENSION cl_amd_fp64 : enable

yields the following error:

error: can't enable all
          OpenCL extensions or unrecognized OpenCL extension

0 Likes

Originally posted by: ryta1203

BTW, has anyone else noticed this? Has it effected anyone else's performance? What am I missing here?


My app has approx same performance under new Cat + SDK2.2.
On some workload it become slower ~2% on other even slightly faster.
Also, rebuild with new SDK had no effect on speed, old binary and new one execute with same speed under new SDK/driver.
But maybe my kernels just have no GPR pressure, didn't check what happened with GPRs via SKA.

EDIT: BTW, I use HD4870, so maybe my card didn't affected indeed...
0 Likes

Originally posted by: Raistmer
Originally posted by: ryta1203 BTW, has anyone else noticed this? Has it effected anyone else's performance? What am I missing here?
My app has approx same performance under new Cat + SDK2.2. On some workload it become slower ~2% on other even slightly faster. Also, rebuild with new SDK had no effect on speed, old binary and new one execute with same speed under new SDK/driver. But maybe my kernels just have no GPR pressure, didn't check what happened with GPRs via SKA. EDIT: BTW, I use HD4870, so maybe my card didn't affected indeed...


Yes, like I said, I'm not seeing a difference on the 4870 as far as GPR allocation is concerned (I haven't checked performance).

It's the 5870 (and probably the entire 58xx series) where  my GPR has increased dramatically in most kernels.

 EDIT: It's a concern for me since one of my kernels has gone from 31 GPR to 50 GPR. Simul wavefronts from 8 to 4, quite a difference in performance by using a "new" (and assumed "better") driver/SDK.

0 Likes

I've played some more with the SKA. An example of perplexing behaviour:

Start with three kernels, call them kernel_A, kernel_B and kernel_C, which all take the same arguments and perform similar computations. Individually, they use no scratch registers and have similar throughputs; call those thru_A, thru_B and thru_C (MThreads/s).

Now combine them to a single kernel which takes the same arguments, by simply turning their bodies into blocks of the new kernel. Since there are no shared variables between the blocks, I would expect the compiler to treat each block as it treated the original kernel body. I would still expect to see no scratch register usage and throughput given by 1/(1/thru_A + 1/thru_B + 1/thru_C).

Instead, I now get plenty of scratch register usage and significantly lower throughput than expected.

0 Likes

Originally posted by: Curious cat I've played some more with the SKA. An example of perplexing behaviour:

Start with three kernels, call them kernel_A, kernel_B and kernel_C, which all take the same arguments and perform similar computations. Individually, they use no scratch registers and have similar throughputs; call those thru_A, thru_B and thru_C (MThreads/s).

Now combine them to a single kernel which takes the same arguments, by simply turning their bodies into blocks of the new kernel. Since there are no shared variables between the blocks, I would expect the compiler to treat each block as it treated the original kernel body. I would still expect to see no scratch register usage and throughput given by 1/(1/thru_A + 1/thru_B + 1/thru_C).

Instead, I now get plenty of scratch register usage and significantly lower throughput than expected.

Have you looked at the ISA and played with moving instructions around?

It turns out that simply cascading kernels is not the best way to results. I won't get into this much but it's not too hard to get the same register usage from the merged kernel as it is from the max(kernA, kernB, kernC), but you will need to look at, and possibly move, the code.

0 Likes

Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?

No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.

0 Likes

Originally posted by: Curious cat
Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?

 

 

No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.

 

Curious cat,

        Could you please post your three kernels here which helps us to see what is going wrong?

0 Likes

Curious cat,

        Could you please post your three kernels here which helps us to see what is going wrong?

No, but I could try creating an example and mail it to you. Will have to wait a few days though (busy). If Mica Villmow still has the code I mailed him back in June when aticaldd was crashing (I don't), it might be enough to use the body of that kernel.

0 Likes

Originally posted by: genaganna
Originally posted by: Curious cat
Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?

 

 

No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.

 

Curious cat,

        Could you please post your three kernels here which helps us to see what is going wrong?

Just take three samples and cascade them.

0 Likes

Originally posted by: Curious cat
Originally posted by: ryta1203Have you looked at the ISA and played with moving instructions around?

No. The point being that if the compiler were behaving reasonably, it would reuse all registers employed in block A when doing block B, and again reuse all registers employed in block B when doing block C. Instead, it's spilling registers. If I were a compiler developer, I would want to understand why; it may well be the same problem causing the increased register use in 2.2 vs 2.1.

Well, I agree with you but it's not so what are you going to do?

I'm pretty sure that AMD outsources their compiler (to VizExperts?).. so that could help to explain things.

As far as 10.7 goes, I just installed the 10.7 from AMD.com drivers page. Is there a different that I should be using???

And Jawed, it's not just SKA, the PROFILER (stated previously) reports the exact same high GPR usage that the SKA reports.

0 Likes

I linked the new release of 10.7 earlier in this thread. That version of 10.7 is unlikely to match the SKA 1.6 instance of 10.7 because normally there's a lag before a shader compiler gets into SKA.

0 Likes

Originally posted by: Jawed I linked the new release of 10.7 earlier in this thread. That version of 10.7 is unlikely to match the SKA 1.6 instance of 10.7 because normally there's a lag before a shader compiler gets into SKA.

I can try the other driver, but I'd expect AMD's driver home page to have the "latest" Catalyst driver, which is where I got the one I'm using just a few days ago (~8/(12/13)).

Also, like I said before, regardless of what the SKA is producing, the RUNTIME PROFILER is reporting the same numbers as the SKA, so unless I have the same catalyst as the SKA...

0 Likes

The regular driver page is not the latest driver though. That's why this driver matches SKA.

The latest driver, the one I linked earlier, is the driver linked from the SDK 2.2 download page. See the section called "Tested Drivers (necessary for OpenCL™ GPU support):"

That's why it's called the "Update Driver" and why I described it as such when I linked it.

The filename includes "10.7b".

0 Likes

SKA 1.6 includes the latest Catalyst 10.7 Update module.

 

0 Likes

Originally posted by: bpurnomo SKA 1.6 includes the latest Catalyst 10.7 Update module.

 

Well, that solves that, problem still stands. I'll update to the newest driver though.

0 Likes
Jawed
Adept II

Are you using the update version of 10.7?

update driver

0 Likes

Originally posted by: Jawed Are you using the update version of 10.7?

update driver

Yes, I have all the latest and greatest installed. Even did a complete uninstall and directory delete followed by reinstallation to be sure. No change. I now have kernels which used to have 0 scratch registers using 9, 11, 15 and 20 scratch registers, and feel like Sisyphus.

It would be OK if performance improved by using more registers, but in all those cases it is reported to be down significantly.

Does

#pragma OPENCL EXTENSION cl_amd_fp64 : enable

work for you?

0 Likes

Originally posted by: Curious cat
Originally posted by: JawedDoes

 

#pragma OPENCL EXTENSION cl_amd_fp64 : enable

 

work for you?

 

Are you facing any problem with cl_amd_fp64 extension?

0 Likes

Originally posted by: genaganna Are you facing any problem with cl_amd_fp64 extension?

Yes, this:

OpenCL Compile Error: clBuildProgram failed (CL_BUILD_PROGRAM_FAILURE). Line 10: error: can't enable all OpenCL extensions or unrecognized OpenCL extension #pragma OPENCL EXTENSION cl_amd_fp64 : enable ^

0 Likes

Originally posted by: Curious cat
Originally posted by: genaganna Are you facing any problem with cl_amd_fp64 extension?

 

 

Yes, this:

 

OpenCL Compile Error: clBuildProgram failed (CL_BUILD_PROGRAM_FAILURE).

 

Line 10: error: can't enable all OpenCL extensions or unrecognized OpenCL extension #pragma OPENCL EXTENSION cl_amd_fp64 : enable                                                                        ^

 

Could you please run MatrixMulDouble sample coming from SDK and see whether it is running or not?

0 Likes

Originally posted by: genaganna Could you please run MatrixMulDouble sample coming from SDK and see whether it is running or not?

Yes, it runs with "--device cpu" on the command line (am on a laptop right now, no AMD graphics). So maybe it's just the SKA 1.6 that's borked.

When I try to target x86 Assembly with the SKA, I get "OpenCL Compile Error: X86 asm output is not currently supported." It does work without the cl_amd_fp64 pragma (but only produces stats for GPUs, and no x86 assembly).

0 Likes

Originally posted by: genaganna
Originally posted by: Curious cat
Originally posted by: JawedDoes

 

 

 

#pragma OPENCL EXTENSION cl_amd_fp64 : enable

 

 

 

work for you?

 

 

 

 

Are you facing any problem with cl_amd_fp64 extension?

 

 

http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/

http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/

 

Weird thing: the pdf specs say use enable / disable,

the online manpages say : use require instead of enable

 

 

0 Likes

MatrixMulDouble_Kernels.cl uses "enable", but oddly, the SKA does not complain about "require". Now I'm totally confused.

Judging by the stats though, "require" is simply ignored.

0 Likes

Originally posted by: Jawed Are you using the update version of 10.7?

update driver

Jawed,

If you are talking to me then yes, per my original post.

0 Likes

There are two releases of Catalyst 10.7, so I can't tell which you are using.

For this reason SKA cannot be relied upon, because when 10.7 is selected internally for compilations, you don't know which release of 10.7 is being used.

(For the record: I've got no experience of any of this, as I haven't installed SDK 2.2, nor Catalyst 10.7, nor SKA 1.6).

0 Likes

Originally posted by: Jawed Are you using the update version of 10.7?

 

update driver

 

Just for the record, this version of the drivers crash the Xserver as soon as I start an OpenCL application (Ubuntu 10.04 64 bit, 5870+5850). May be something related to multi-gpus ?

 

Multi-GPUs support seems to work really bad with standard Catalyst 10.7 since the introduction of SDK 2.2. I obtain the same performance with 1 or 2 GPU.

 

0 Likes

Here the backtrace of the Xserver crash:

 

Backtrace:
0: /usr/bin/X (xorg_backtrace+0x28) [0x4a3258]
1: /usr/bin/X (0x400000+0x655bd) [0x4655bd]
2: /lib/libpthread.so.0 (0x7f9e00f81000+0xf8f0) [0x7f9e00f908f0]
3: /usr/lib/xorg/modules/drivers/fglrx_drv.so (0x7f9dfd6f2000+0x2af584) [0x7f9dfd9a1584]
4: /usr/lib/xorg/modules/drivers/fglrx_drv.so (0x7f9dfd6f2000+0x2ad843) [0x7f9dfd99f843]
5: /usr/bin/X (0x400000+0x30c3c) [0x430c3c]
6: /usr/bin/X (0x400000+0x261aa) [0x4261aa]
7: /lib/libc.so.6 (__libc_start_main+0xfd) [0x7f9dffc78c4d]
8: /usr/bin/X (0x400000+0x25d59) [0x425d59]
Segmentation fault at address 0x8

Caught signal 11 (Segmentation fault). Server aborting

Please consult the The X.Org Foundation support
         at http://wiki.x.org
 for help.
Please also check the log file at "/var/log/Xorg.0.log" for additional information.

 

0 Likes

well, all that i know is that my ALUBusy % drop from 100% to 56% for a kernel of mine, with no changes, even in comments, if you see what i mean. this is really bizzard, not to metion that my kernel run slower now! i can't tell about the GPRs coz the sdk 2.1 profiler doesn't show this information in visual 2008 (i've got the pro version). how can i check the GPRs with the sdk 2.1? all that i have is a 50 GPRs with the new sdk 2.2 's profiler.

one more stuff, the transfer RAM<->VRAM is slower with the sdk 2.2 compared to the version 2.1, something like 20% slower (both senses).

 

ps: my card is a hd5770

0 Likes

Originally posted by: laobrasuca well, all that i know is that my ALUBusy % drop from 100% to 56% for a kernel of mine, with no changes, even in comments, if you see what i mean. this is really bizzard, not to metion that my kernel run slower now! i can't tell about the GPRs coz the sdk 2.1 profiler doesn't show this information in visual 2008 (i've got the pro version). how can i check the GPRs with the sdk 2.1? all that i have is a 50 GPRs with the new sdk 2.2 's profiler.

one more stuff, the transfer RAM<->VRAM is slower with the sdk 2.2 compared to the version 2.1, something like 20% slower (both senses).

 

ps: my card is a hd5770

I also have VS2008, check your profiler settings.

If not then you can just dump the ISA and look at the bottom of that file. Or you can use the SKA to check the GPR, just make sure if you use the SKA that you are using the same version (for 2.1, the SKA that uses Catalyst 10.3 and for 2.2 the SKA that uses Catalyst 10.7)

0 Likes

Ryta,
Our compiler stack for OpenCL is developed internally. However, the CAL compiler is fundamentally a graphics compiler, which has different requirements than a general purpose compute compiler. We are still working on fine tuning our stack for compute compiler loads and it looks like there are some cases where our tuning was less than optimal.
0 Likes

Originally posted by: MicahVillmow Ryta, Our compiler stack for OpenCL is developed internally. However, the CAL compiler is fundamentally a graphics compiler, which has different requirements than a general purpose compute compiler. We are still working on fine tuning our stack for compute compiler loads and it looks like there are some cases where our tuning was less than optimal.


So you can confirm this (the dramatic increase in register usage)? I just want to know, actually it's not really effecting my work so much, but I am still curious. Thanks.

It would be awful difficult for developers to have to decide which SDK and drivers to use based on which ones perform better for their kernels. A "slight +/-" swing in performance is to be expected but to almost half the performance for some kernels because the register allocation system is broken makes it difficult.

0 Likes

Ryta,
Yeah we see this internally. We are still looking into the root cause of why it is occurring, but since there are a lot of components that changed between 2.1 and 2.2 so it might take us a little bit to figure out exactly what change, or series of changes, caused this to occur.
0 Likes

Originally posted by: MicahVillmow Ryta, Yeah we see this internally. We are still looking into the root cause of why it is occurring, but since there are a lot of components that changed between 2.1 and 2.2 so it might take us a little bit to figure out exactly what change, or series of changes, caused this to occur.


Micah,

  Ok, thanks again for confirming this, appreciate it.

0 Likes

Originally posted by: MicahVillmow Ryta, Yeah we see this internally. We are still looking into the root cause of why it is occurring, but since there are a lot of components that changed between 2.1 and 2.2 so it might take us a little bit to figure out exactly what change, or series of changes, caused this to occur.


Sorry, so is this an SDK issue or a driver issue? I was just wondering if this might be fixable in 10.8 or 10.9 or will we have to wait for a new SDK version?

0 Likes

It is an issue with the CAL compiler which is shipped with the driver.
0 Likes

Micah,

  Ok I thought this,  unfortunately 10.5 wouldn't work with 2.2 so I had to roll back to 2.1. This sucks because now I'm back to being limited to smaller data sizes it seems.

0 Likes