Hi,
We're developing using openCL, and have one dev machine with an NVIDIA GTX 260, and another with an ATI 4870. These both seem to be mid-range cards, and are similarly priced.
However, I'm sorry to say we are getting approximately 5x the performance from the NVIDIA card, than from the ATI. We're using the same openCL kernel, and the SDKs of the respective companies - in the case of the ATI, Stream SDK 2.0 beta 4.
Is this performance gulf due to the early stage of ATI's OpenCL support? Is the implementation not well optimised yet? If so, how soon can we expect this gap to close? Or were we mistaken about the equivalence of the two cards?
To be honest, I was expecting more from the 4870 - certainly the specs seemed to imply it was fairly powerful - however, currently my CPU (core2 quad 2.4GHz) can outperform it by roughly a factor of two. Something definitely seems amiss!
Many thanks for any information
Best Regards
Matt Taylor
>For example, if you are using local memory, they are all currently emulated in global memory.
Will that change in the near future?
It would seem to me that any algorithm that uses precomputed table lookups would benefit from the fast, close memory. If people clamor for it, will ATI reconsider?
Hi Micah
Thanks a lot for your information - it's very helpful.
So with the 4870, it seems I have purchased a turkey - just in time for Christmas! 🙂
Seriously though, I suppose we'll just use it to check compatibility, and perhaps upgrade to a 5xxx series in the new year.
Thanks again, and merry Christmas
Matt Taylor
Micah, if I declare my lookup tables that way, as constants that are part of the CL program, how much data can I use that way? Is that what the 64K constant number that CLInfo reports means?
Also, how many of these memory accesses per second will I get? If I have 10 computational units, and each one of those has many threads (256, right? I am still figuring this stuff out) how many constant reads per clock cycle should I expect?
Thanks for all your help, btw.
Also, once samplers are implemented using the texture caches, why wouldn't it be faster to use a sampler with nearest neighbor filtering in order to implement lookup tables?
By the way, it'll just great if ATI/AMD will release C99 to IL compiler. Currently to get good performance on 4XXX series you're just forced to use CAL/IL while programming in "pure" IL taking way too many resources.
It is possible to write .cl kernel, compile it via OpenCL and intercept IL code coming from OpenCL's output to calclCompile. However it's too many unnecessary actions required. How about doing it easy way?
(Some subset of) C -> IL -> manual IL editing -> profit.
empty_knapsack : I actually wrote a small compiler from a tiny C like language to IL once. I actually have broken it due to some other project, but I can release it soon.
Originally posted by: rahulgarg empty_knapsack : I actually wrote a small compiler from a tiny C like language to IL once. I actually have broken it due to some other project, but I can release it soon.
I guess all necessary functionary already presents in OpenCL, so we aren't need to write our own C compiler . We just need convenient interface to access IL source code after clCreateProgramWithSource() / clBuildProgram().
And it'll be just great if it'll possible to get some analog of nvcc for compiling pure device kernels. I mean with CUDA we can compile .cu file into binary cubin (or firstly into ptx == IL, manual editing ptx -> cubin) with a help of nvcc and later use this cubin with driver level CUDA API (cuModuleLoadData(), cuModuleGetFunction(), etc). This way we can use all features of IL/PTX without additional complexity (i.e. without using IL/PTX for everything, writing code from scratch). Setting up inputs/outputs isn't a problem.
I guess all necessary functionary already presents in OpenCL, so we aren't need to write our own C compiler . We just need convenient interface to access IL source code after clCreateProgramWithSource() / clBuildProgram().
Set/Export GPU_DUMP_DEVICE_KERNEL=3. It saves both .il and .isa files in local directory.
Thanks, nice option, genaganna.
It's only problem here -- when there're several different GPUs at system (like I've now 5770+4770) there are several calls to calclCompile & calclLink for each device type and IL source code actually differs for RV770 & RV8X0. So it'll good to add device target id to output filename as currently outputs for different devices overwrites each other and we left only with one set of .il & .isa while two+ of them actually was generated.
Anyway, it can be bypassed, so not a big problem.
Hi Micah
Only joking about the turkey! 🙂
There are certainly still a lot of hardware specific details for an OpenCL developer to consider. With the advent of OpenCL I was hoping that the individual compilers might abstract some of this away.
Is it true that this is the case with the newer cards, and it's only relatively old cards that might have issues? Or are the various gpu platforms just so different that we will always end up writing several versions, to get decent performance?
Lastly, I'm looking at a 5970 - very expensive - am I right to expect blistering OpenCL performance from it?
Many thanks for your help, Micah
All the best
Matt Taylor