Archives Discussions

holomatix · ‎12-17-2009

Why so slow?

Hi,

We're developing using openCL, and have one dev machine with an NVIDIA GTX 260, and another with an ATI 4870. These both seem to be mid-range cards, and are similarly priced.

However, I'm sorry to say we are getting approximately 5x the performance from the NVIDIA card, than from the ATI. We're using the same openCL kernel, and the SDKs of the respective companies - in the case of the ATI, Stream SDK 2.0 beta 4.

Is this performance gulf due to the early stage of ATI's OpenCL support? Is the implementation not well optimised yet? If so, how soon can we expect this gap to close? Or were we mistaken about the equivalence of the two cards?

To be honest, I was expecting more from the 4870 - certainly the specs seemed to imply it was fairly powerful - however, currently my CPU (core2 quad 2.4GHz) can outperform it by roughly a factor of two. Something definitely seems amiss!

Many thanks for any information

Best Regards

Matt Taylor

MicahVillmow · ‎12-17-2009

holomatix,
This is entirely dependent on how you coded the kernel and what OpenCL features you are using. There are known performance issues for HD4XXX series of cards on OpenCL and there is currently no plan to focus exclusively on improving performance for that family. The HD4XXX series was not designed for OpenCL whereas the HD5XXX series was. There will be performance improvements on this series because of improvements in the HD5XXX series, so it will get better, but it is not our focus.

For example, if you are using local memory, they are all currently emulated in global memory. So it is possible you are going out to main memory twice as often as you do on NVidia. This can cause a fairly large performance hit if the application is memory bound. On the HD5XXX series, local memory is mapped to hardware local and thus is many times faster than the HD4XXX series.

kbrafford · ‎12-17-2009

>For example, if you are using local memory, they are all currently emulated in global memory.

Will that change in the near future?

MicahVillmow · ‎12-17-2009

On the 7XX series of cards there is no plan on changing that. The limited nature of the hardware makes the number of situations where it is beneficial very limited.

kbrafford · ‎12-17-2009

It would seem to me that any algorithm that uses precomputed table lookups would benefit from the fast, close memory. If people clamor for it, will ATI reconsider?

MicahVillmow · ‎12-18-2009

kbrafford,
The current implementation does not do so, but future implementations will be sticking precomputed tables in constant memory which is faster than local memory.

For example, constant float twiddles[] = { ... }; benefits from constant memory much more than placing it in the local memory.

holomatix · ‎12-18-2009

Hi Micah

Thanks a lot for your information - it's very helpful.

So with the 4870, it seems I have purchased a turkey - just in time for Christmas! 🙂

Seriously though, I suppose we'll just use it to check compatibility, and perhaps upgrade to a 5xxx series in the new year.

Thanks again, and merry Christmas

Matt Taylor

kbrafford · ‎12-19-2009

Micah, if I declare my lookup tables that way, as constants that are part of the CL program, how much data can I use that way? Is that what the 64K constant number that CLInfo reports means?

Also, how many of these memory accesses per second will I get? If I have 10 computational units, and each one of those has many threads (256, right? I am still figuring this stuff out) how many constant reads per clock cycle should I expect?

Thanks for all your help, btw.

kbrafford · ‎12-19-2009

Also, once samplers are implemented using the texture caches, why wouldn't it be faster to use a sampler with nearest neighbor filtering in order to implement lookup tables?

MicahVillmow · ‎12-18-2009

holomatix,
I wouldn't say that it is a turkey, it just has to be programmed differently than the 5XXX series to get performance because of the lack of proper hardware local support. It is possible to get good performance, just not with a direct port from Cuda.

empty_knapsack · ‎12-18-2009

By the way, it'll just great if ATI/AMD will release C99 to IL compiler. Currently to get good performance on 4XXX series you're just forced to use CAL/IL while programming in "pure" IL taking way too many resources.

It is possible to write .cl kernel, compile it via OpenCL and intercept IL code coming from OpenCL's output to calclCompile. However it's too many unnecessary actions required. How about doing it easy way?

(Some subset of) C -> IL -> manual IL editing -> profit.

rahulgarg · ‎12-22-2009

empty_knapsack : I actually wrote a small compiler from a tiny C like language to IL once. I actually have broken it due to some other project, but I can release it soon.

empty_knapsack · ‎12-23-2009

Originally posted by: rahulgarg empty_knapsack : I actually wrote a small compiler from a tiny C like language to IL once. I actually have broken it due to some other project, but I can release it soon.

I guess all necessary functionary already presents in OpenCL, so we aren't need to write our own C compiler . We just need convenient interface to access IL source code after clCreateProgramWithSource() / clBuildProgram().

And it'll be just great if it'll possible to get some analog of nvcc for compiling pure device kernels. I mean with CUDA we can compile .cu file into binary cubin (or firstly into ptx == IL, manual editing ptx -> cubin) with a help of nvcc and later use this cubin with driver level CUDA API (cuModuleLoadData(), cuModuleGetFunction(), etc). This way we can use all features of IL/PTX without additional complexity (i.e. without using IL/PTX for everything, writing code from scratch). Setting up inputs/outputs isn't a problem.

genaganna · ‎12-23-2009

I guess all necessary functionary already presents in OpenCL, so we aren't need to write our own C compiler . We just need convenient interface to access IL source code after clCreateProgramWithSource() / clBuildProgram().

Set/Export GPU_DUMP_DEVICE_KERNEL=3. It saves both .il and .isa files in local directory.

empty_knapsack · ‎12-23-2009

Thanks, nice option, genaganna.

It's only problem here -- when there're several different GPUs at system (like I've now 5770+4770) there are several calls to calclCompile & calclLink for each device type and IL source code actually differs for RV770 & RV8X0. So it'll good to add device target id to output filename as currently outputs for different devices overwrites each other and we left only with one set of .il & .isa while two+ of them actually was generated.

Anyway, it can be bypassed, so not a big problem.

holomatix · ‎12-22-2009

Hi Micah

Only joking about the turkey! 🙂

There are certainly still a lot of hardware specific details for an OpenCL developer to consider. With the advent of OpenCL I was hoping that the individual compilers might abstract some of this away.

Is it true that this is the case with the newer cards, and it's only relatively old cards that might have issues? Or are the various gpu platforms just so different that we will always end up writing several versions, to get decent performance?

Lastly, I'm looking at a 5970 - very expensive - am I right to expect blistering OpenCL performance from it?

Many thanks for your help, Micah

All the best

Matt Taylor

MicahVillmow · ‎12-18-2009

empty_knapsack,
Dumping of IL/ISA is something that will be exposed in the next release.

MicahVillmow · ‎12-21-2009

kbrafford,
Constant buffer access is about 10x faster(on RV770, not sure about 8XX series) than texture cache. One reason for this is that constant access occurs during an ALU CF clause and not a TEX CF clause so multiple lookups can happen per cycle. Constant access is almost as fast as register access. If you read in the ISA document, it talks in a section about constant file port restrictions, and this is the bottleneck to constant accesses.

MicahVillmow · ‎12-22-2009

holomatix,
As our compiler stack matures, we will be able to do more device specific optimizations so the user does not have to do it. However, if you write your code using vectors, it should map very well to both the CPU using SSE and GPU's VLIW architecture. As for the 5970, there are some reports of problems with the second chip being clocked up correctly, but a 5970 should perform equivalently to 2 5870's.

MicahVillmow · ‎12-23-2009

empty_knapsack, Thanks for the feedback, i'll get this added for the next release.

Archives Discussions

OpenCL performance issues