While testing my workload on a different system, to which I had to move an AMD R9 290 card, I encountered several bottlenecks that slowed my workload down by 25% to over 400%. Configuration changes alleviated some of it, but I'm thinking I may need to try to do something programmatically to make my performance more consistent. So looking for suggestions.
Here are some things I did.
1) deleted a bunch of disk files -- my workload isn't always doing disk I/O, but in this case it was set to do so. It needs to read/write several GB of files. Normally a large amount of RAM hides the disk I/O pretty well, especially in a repeated run. Anyway, I cleared up a bunch of disk space as I only had several GB left, and it dropped my runtimes down a lot.
2) I was still at least 200% slower, so I started killing other programs. Nothing helped... finally, I stopped SearchIndexer.exe (disabled Windows Search service) and that got me down to maybe 50% slower than normal. The indexer wasn't using an obviously high number of cycles, but it wasn't quiescent.
3) Disabled processor management; So from control panel, find power management; Edit Power Plan; Advanced Settings; Processor Power Management. I set minimum, maximum to 100% each, and System Cooling Policy to passive. That got me to where I needed to be, matching the performance of the other system.
4) The next day, after restarting Outlook, the SearchIndexer came back on and slowed it down, although I'd disabled the service. Stopping it again worked. It is interesting though, the machine is not obviously busy, but my OCL program gets slowed a lot. I haven't tried other OCL workloads yet.
5) I did some experiments with other programs, mainly DirectX things. I used the AMD Leo sample, and also one called HK-2207. Unsurprisingly, running my opencl concurrently with those others slows it back down. I noticed even when I pause the DX sample, my program is still maybe 25% slower.
1) What can I do programmatically to make sure OCL app can run full-speed? Boost my thread priority is all I can think of at the moment.
2) If I boost thread priority would I be able to restore default p-states?
2) why would a paused DX program make my opencl workload slowdown?
depending on what you're doing, why, and of course how less jarring it is for you and your programs to migrate, consider trying things under Linux such as Ubuntu or OpenSuSE (shameless plug). You can nail down this stuff much easier and get the OS to have minimal influence on your applications realtime performance.
This is an answer to some and noise to others :-).
Gotcha :-). Porting to Linux isn't on the radar right now. Certainly could happen at some point but Windows user base is dominant.
1) Pretty much nothing. The issue you are facing is that you are trying to make a generic user distribution into a workstation, or more like a server. If you install Windows Server XY, you most likely will get optimal performance out of the box. Stock installation has pretty much everything turned off. People generally tend to boot up Windows in recovery mode when they want to run ONLY the application they want and measure performance, but for GPGPU applications, that mostly does not work. Since you are referring to a Windows "user base", you have to assume that your users will listen to MP3, watch a youtube video, run Office Word and what not while they are running your program. You, as the programmer should not meddle with THEIR system, because they will become angry. You should write your application so that it does no unneccesary work, try to do everything optimally, but setting up the target system should remain outside the scope of a developer. You don't see StarCraft2 setting thread priorities or turning off indexing services, but you can bet Blizzard wants to be the fastest possible too.
As jason has mentioned, linux generally has A LOT less services/daemons running and you get better performance out-of-the-box.
2) I don't know what exactly you are referring to by "restoring".
3) "Paused" can mean a lot of things. Today, DX resources are managed by the OS and the application in a weird symbiosis. Even if you minimize an a DX application, it is up to the application how it handle being put to the background. It may still run the display loop, but without doing any render. You have pretty much no control over how silent an application will become once put into the background. The app's VRAM should be flushed to disk AFAIK, but the host side code can do whatever it wants.
4) To boost your application's performance, master async IO for both disk access, GPU memory movement, utilize multiple host threads for reading data, perhaps do data compression if it makes sense in your case... these are the things you as a programmer can do to obtain better performance. YOU as a user of yourself should keep a clean system with all the unnesseccary things turned off (not always easy to do on a dev box), but you cannot mandate your end-users to do the same, moreover achieve it programmatically. Imagine Photoshop turning off Indexing on my machine. I'd uninstall the hell out of it the instant I find out.
...just my thoughts.
Thanks for the insights. Sounds like a reasonable thing to do is provide end-user documentation (like a tech note) regarding system settings and third party items that may impact the performance. I agree with your other thoughts.
2) Q: If I boost thread priority would I be able to restore default p-states?
2) A: I don't know what exactly you are referring to by "restoring".
Let me try to clarify what I meant by "restore default p-states". To get max OpenCL performance on this particular system, I explicitly set it to performance mode; which disables power-saving mode (e.g., default p-states). So my question was whether programmatically changing thread priority in my app might solve the performance issue a different way. If I get a chance to try it I will relate my findings.