Hi,
I have an OpenCL programs that runs about 25% faster when the screen goes into standby mode.
At the same time the compiz process takes 100% cpu usage. This is a known bug that is discussed in here : https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/969860
That site suggest a workaround (enable "Force full screen redraw (buffer swap) on repaint") which fixes the 100% cpu usage issue, but the performance now stays low even when the screen goes into standby mode.
This behavior have been happened to me on ubuntu 12.04 with Radeon 6970 and 7970 with 2.7 sdk 12-4 and 12-6 drivers.
I guess that getting better performance when the screen goes into standby mode is reasonable, but right now to achieve this I must "pay" with one core of 100% usage.
I can't tell if this a general GPU drivers issue or an OpenCL one.
If you are running OpenCL apps remotely, did you try without logging in to ubuntu? You should be able to run OpenCL apps when the computer is running at login prompt. Try and let us know the results:
Thanks , I used that info that you wrote in the other post.
I run the app without login and the performance is still slower than login and let the screen to go into standby.
It is possible that when in standby the card does not exit power saving mode. Did you try to check it with aticontrol program to check current speeds when running your program to verify this?
How long is your test case? It is also possible that the time it takes for the card to switch from low power to performance mode might be where you are loosing that 20--25% performance. Perhaps when compiz runs, it forces card to be on performance mode at all times (you can confirm this with aticontrol getclocks as well).
I used aticonfig --odgc to see the current peak clock.
From what I saw, the current peak clock is always equal to 925MHz which is the maximum clock in my settings for this card.
I notice that the GPU load is getting higher in the standby mode (from about 40% to 55%), which is maybe caused by the compiz process + my program.
Each test case duration is taken for few minutes and the performance is steady in the standby or the "awake" mode .
I tried to run samples from the SDK to see if it happens also there.
The only difference I've found so far is in BufferBandwidth on one line :
when awake:
Page fault | 1670.84 ns |
when standby:
Page fault | 837.94 ns |
It doesn't tell me allot, but maybe it has something to do with my issue.
Hi c0nfig,
I think you should have a test on Windows. And then to confirm it's a driver issue or program issue.
Hi Wenju,
I tested it on windows, the performance didn't change when the screen went into standby.
Moreover, my program runs faster in windows (about 20% more than the fastest I could get using the linux version)
In Linux, if you return the cpu usage to 100%, what's the result? Faster again?
I'll use some numbers to ease this discussion.
I doesn't matter if I start the program when the screen is off ( through SSH) or directly from the computer with the screen on.
Moreover, during these tests, every few minutes I moved the mouse to awake the screen and then let it goes off again (after 5 minutes).
Ok, so these are the two cases:
Hi c0nfig,
I think you should do another test. Close compiz process and run your program.
So far I just speculting: when the screen is off, the gpu will be idle. And it's a bug that the compiz process takes 100% cpu, so at this moment, compiz doesn't perform very good, just like 3d render. When the screen goes into standby mode, compiz performs much worse or maybe it doesn't work. So there has a lot of gpu resource to your program. But on the other hand, when you fix it, compiz perfoms nornally even though in the standby mode.
Hi,
As I previously replied to yurtesen, I ran via ssh with login in to any user ( I will recheck but I think that means that compiz was close). The results were the same slower performance (with either screen on or off) , just like with compiz and the compiz fix.
For some reason that compiz bug make the performance higher in standby mode.
I'll try to write a simple kernel so you'll be able to see it if it also happens on your systems.
If you can provide some test code, I can try to run it. What I was thinking is that the cards going to power saving mode or switching in between and loosing performance You already said the current peak was 925, but important part is 'Current Clocks' part (did you mean that?). Perhaps it might make sense to see if 'Current Clocks' show max speed when compiz is using 100% CPU (to see if compiz causing card to stay on high speed mode).
I dont know if it is possible, but you can maybe try to force clocks to stay high at all times? I am not sure if CCC would show you anything under Linux there... But I found this utility which you might be able to try...
http://manpages.ubuntu.com/manpages/hardy/man8/rovclock.8.html
I am just fishing here...
Hi,
I've prepared a code sample. I simply took a vector addition example and changed the kernel to be more time consuming so the bottleneck won't be the CPU.
The program starts the kernel and calls a blocking buffer copy and repeats in an infinite loop. The program prints the number of steps every 10 seconds.
On my computer when compiz is "fine" I get about 3900 steps per 10 seconds.
During the same run, after a minute the screen goes into standby and when compiz starts to go wild(about 30 seconds after the screen is in standby) I get 4800 steps per 10 seconds.
aticonfig during the all test showed both current peak and current clock @ 925MHz.
I can set the MHz for the cores and for the memory using aticonfig overdrive utility. I tried to play with this, it changed the performance of course but the behavior remained.
rovclock actually forces the frequency rates ? or it is like the AMD overdrive ?
BTW, yurtesen you were right, In that message I meant both current peak and current clock.
c0nfig wrote:
rovclock actually forces the frequency rates ? or it is like the AMD overdrive ?
I dont know, my hope was that it would force the card to stay at high speed even when idling. If you are able to run on windows, MSI afterburner unofficial overclocking mode (2) does this for Tahiti cards at least. It is able to disable powerplay (or it appears it takes over it). But you already said it runs fine on windows so... you might have to find an utility which does the same on linux.
PS. I cut my finger today so I am barely able to type... I wont be able to test your code right away taking it easy for a little while...
oh sorry to hear that, resting your finger is absolutely a better idea.
I don't have this issue on windows, so I'll just try to use rovclock and post an update.
I just downloaded your test files and will try to run them soon. How did it go with rovclock?
Well, I am getting something like this on a 5870 as output
$ ./a.out
steps per 10 secs : 5012
steps per 10 secs : 5070
steps per 10 secs : 5066
steps per 10 secs : 5081
steps per 10 secs : 5080
6x 5000 = 30000 ? Tomorrow I can try to run it on a 7970 I guess... It seems unlikely that 5870 would beat 6970 or 7970. The next problem is that the 7970 which I have to use is clocked at 1010mhz at performance mode... I guess I will be able to tell if I am having the same problem once I run the program from console directly.
I checked with top and I have roughly 10% CPU usage on your program only. I tested it without logging in and the screen was at login screen (ubuntu 12.04, app sdk 2.7 and catalyst 12.6 driver and I am not sure if it goes to standby, I just installed ubuntu to that box). I will try it tomorrow from console directly also.
Also, just out of curiosity, why didnt you use vector elements in your kernel if this is a vector addition example? (just for fun I tried float4 and it doubles my performance to ~11800 elements per 10 seconds on 5870).
Well, when at console, it appears the performance depends on if there are movements on display or not. I am able to get consistent speeds if nothing is moving on screen. Perhaps X is monopolizing the card?
$ ./a.out
steps per 10 secs : 4569
steps per 10 secs : 5042
steps per 10 secs : 5038
steps per 10 secs : 5035
steps per 10 secs : 5039
steps per 10 secs : 5028
steps per 10 secs : 5037
Do you happen to have that bug that Compiz goes to 100% CPU on you ubuntu 12.04 ?
Is this last result is on your 7970 @ 1010 Mhz ?
I tried to use rovclock, but it kept throwing the error "Invalid reference clock from BIOS: 0.0 MHz" on any operation I tried, though it did found ATI card (prints "Found ATI card on 01:00 ...").
Thanks for the help.
c0nfig wrote:
Do you happen to have that bug that Compiz goes to 100% CPU on you ubuntu 12.04 ?
Is this last result is on your 7970 @ 1010 Mhz ?
I tried to use rovclock, but it kept throwing the error "Invalid reference clock from BIOS: 0.0 MHz" on any operation I tried, though it did found ATI card (prints "Found ATI card on 01:00 ...").
Thanks for the help.
I ran your program only on 5870 with Ubuntu 12.04. When I tested it, I was not logged in. Therefore the compiz was not using any GPU. But I have the same compiz problem, if I leave a user logged in, then compiz is using a single core 100%. (but I didnt run your program when compiz was running).
Unfortunately I didnt have time to get to 7970, there were some hardware problems in that machine (it will be fixed soon), and it is running Ubuntu 11 (I just remembered now). If you want I can run it on 7970 at some point when I get it up and running.?
From what I see in my 5870, there does not seem to be a problem with the clocks, sometimes when I run your program, the first iteration was little slower but the other ones were quite stable and ~5000 steps/10sec. But of course maybe this problem effects only some cards (but then I would expect 5000 and 6000 series to perform more or less similarly).
I am getting exactly same behavior even when compiz is 100% (I had first iteration little slower sometimes in my previous tests also) (this is on 5870, 850mhz GPU /1200mhz GDDR5)
# ./a.out
steps per 10 secs : 4638
steps per 10 secs : 5061
steps per 10 secs : 5058
steps per 10 secs : 5062
steps per 10 secs : 5064
steps per 10 secs : 5065
^C
top - 01:12:20 up 4:48, 4 users, load average: 1.37, 1.37, 1.44
Tasks: 199 total, 2 running, 196 sleeping, 0 stopped, 1 zombie
Cpu(s): 2.4%us, 11.9%sy, 0.0%ni, 85.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16435264k total, 1392672k used, 15042592k free, 35116k buffers
Swap: 16775164k total, 0k used, 16775164k free, 431780k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4066 supremum 20 0 1284m 90m 44m R 100 0.6 147:32.48 compiz
7698 root 20 0 174m 53m 22m S 10 0.3 0:01.83 a.out
Hi,
Thanks for your test.
Hmm, so maybe it depends on the card,
I'll install a clean Ubuntu to test this program again.
Also, the fact that 5870 outperform my 7970 is suspicious.
Just to be sure, please tell me if you are you not using the latest version of the APP SDK and the graphics drivers.
I have the box with 7970 up and running, I will soon return back with some numbers. I am using SDK 2.7, and 12.6 drivers (I mentioned it earlier). Actually I just installed ubuntu 12.04 from scratch to this box.
Hmm, right something is strange here... GPU load shows 0%
Adapter 0 - AMD Radeon HD 7900 Series
Core (MHz) Memory (MHz)
Current Clocks : 300 150
Current Peak : 1010 1375
Configurable Peak Range : [300-1125] [150-1575]
GPU load : 0%
and the performance is terrible...
$ ./a.out
steps per 10 secs : 1448
steps per 10 secs : 1465
steps per 10 secs : 1468
steps per 10 secs : 1465
Anyway, there is a problem in your loop also. You are not waiting for kernel execution to finish before running the enqueueread? I get 50% better performance if I put a clfinish between enqueue kernel and enqueue read statements. But that is not very efficient... (on 7970, it now uses 50% of the card with clfinish, you should find a better solution ...)
on the other hand, if I put clFinish on 5870, there is no difference in execution....
Clfinish is necessary ? The command queue keeps the order of the clenqueue commands and the clEnqueueReadBuffer is blocking. Am I missing something here ?
Yes, when you enqueue a kernel, the host program will continue and run the readbuffer command (which will try to read data from where your kernel is working on in a blocked fashion). Because the enqueue kernel command is not blocking. You should use events to keep track of kernel execution and try not to read/write to memory areas which are used by the kernel while it is executing (obviously). I think I am right, but double check from the manual