I upgraded my 4850 to a 5870, but CLInfo still says that I don't get image support. Should I just reinstall the 2.1 SDK? Should I uninstall something first?
Oh, and my samples that used to run in 700ms now take 3700ms. What did I do?!?
in the shell environment (if using a Bourne shell variant) where the applications are being run. On the older SDK 2.0, this enables image support. This environment variable setting is necessary in the shell of OpenCL applications (just leave the X server as-is). I can attest that it works.
There are many undocumented environment variable settings: http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=128237 . Naturally, these are all at your own risk. However, I have had good experiences with images on the 5870.
Thanks for the tip. I tried that and it worked, except CLInfo failed at the end (after reporting that I did have image support).
So I resorted to removing ALL ATI software from my system, including the video driver (just like I did with the Thinkpad, from the other discussion thread). And after two tries, I got what appears to be good, solid drivers and Stream SDK code in the right place.
I get good info from CLInfo, get about 733fps from simplegl.exe, simpleimage.exe works without crashing, and my actual work code gets 2x speedup moving from the 4850 to the 5870.
Thank you for the performance comparison for images between the 4850 and 5870. That information is valuable to me. This morning, someone asked me about DGEMM on the 4870. I know that memory buffers are kind of broken on pre-Evergreen hardware as local memory is implemented in global memory. So if images work, that's the way to support older architectures like R700 (especially as the texture units have L1 cache and the memory buffers do not).
Thank you for the performance comparison for images between the 4850 and 5870. That information is valuable to me.
Whoa!!! Don't take my numbers and assume that they reflect what you will see with your application. I have a specific algorithm that I made with my old card (the 4850) in mind, and the fact that it only does 2x improvement on the 5870 is (I think) because I made a point to use no local memory. And I didn't use images in the 4850, because they weren't allowed. The reason that I was asking about images is because now that I have a decent card I want to be able to use them!
I think most codes will see a much greater than 2x speedup with the 5870. I mean, they should--there are >2x the number of stream processors, AND there is real local memory, as well as the texture caches.
"Whoa!!!" Thanks for the caution. I re-read your post after I replied and realized that I probably inferred a bit too much.
If OpenCL images work on R700 at all, that's encouraging for the problems I am working with as they are essentially bottlenecked by PCIe bus data transfer (arithmetic intensity not high enough). I need to use either local memory or images. As local memory doesn't really work on R700, that means images are the way to go if it is going to work at all.
Some tips from experience - local memory on the Evergreen is relatively slow even if access is fully coalesced. The lack of L1 cache hurts. Images are fast. However, if PCIe bus data transfer is also counted, then memory buffers can be faster. There's some overhead with images, probably related to writing through the cache? It's going to depend on the effective hierarchy between host and device memory and pattern of data transfers. I know this is not directly related to the original question of this topic but here's what I mean: http://golem5.org/gatlas/bench_sgemm/bench_sgemm_pcie.html .
Some tips from experience - local memory on the Evergreen is relatively slow even if access is fully coalesced
Then that might explain why my 1KFFT on the 4850 (that beats the AMD 1KFFT sample code, on that card) still beats the AMD 1KFFT on my 5870. Don't get me wrong, the AMD code sped up tremendously when I ran it on the new card, but my own code also sped up about 2x, keeping my implementation about twice as fast as the AMD code, looking at wall time.
I see. So I think what you are saying is that the memory is fast but the array subscript addressing adds overhead (calculating memory addresses is not free). That is the reason for the performance difference (why image based kernels are much faster for matrix multiplication).
Thanks for the clarification. I have read many other thread topics in which others discussed the reasons for observed performance. What you say here makes me remember those discussions.
Is it possible to do an FFT using images instead of local memory?
The answer is of course yes. After googling for: "FFT texture sampling gpgpu", these two links have examples of old school GPGPU FFT implementations:
( I apologize to AMD for posting these links to nVidia and Intel but I think it helps developers to fully utilize the goodness of ATI products! )
I have no personal experience implementing the FFT in a stream kernel. The last time I did anything with the FFT was in grad school using MATLAB. But as you can see, it can be done with images. More generally, almost everything we calculate is a finite dimensional problem that reduces to numerical linear algebra at some point. So if we can do linear algebra with images, then we can do pretty much anything else too. (If someone knows why I am wrong, please let me know.)
Retrieving data ...