I've see this interesting benchmark that use SLG open source raytracer to compare performance against different video cards.
It is very surprising... we have more 'compute' power on the AMD cards... but at the end they are slowers.
If you have advices... because I have a similar application and would like to tune it for AMD specifically.
Benchmarks like these should always be taken with a grain of salt. While correctness is portable in OpenCL (excepting vendor bugs), performance is not. To achieve the highest performance on ATI hardware, you really need to use 128-bit loads and 128-bit operations. However, I've found that this usually hurts performance on Nvidia hardware. So, in the end, a fair comparison requires you to write two shaders (and possibly two frontend calls).
Why does it hurt performance on NV cards? Most DX9-11 games use float4 vectors and 4x4 matrices for all transofrmations. Do 128-bit loads and stores really hurt NV that much??
I read that 69xx cards feature HW accel of scalar loads and stores. Hope that will increase AMD performance in such tests, however it's hard for me to believe NV cards have such a hard time with vector operations. (Simple unrolling is done for them most likely)
Maybe I should backtrack those statements a little bit to just say that performance isn't portable. Here is S/DGEMM running on a 5870 and a c2050:
We had to write separate and dramatically different kernels to fully exploit their capabilities.
For the record, I have started developing SLG on a 4870 and than continued on 5870/5850. It uses 128 loads/stores and float4 for most operations. I use a NVIDIA 240GT only for testing the compatibility.
It could be considered a benchmark somewhat biased toward AMD platform.
It is just NVIDIA to have nearly doubled their performance with the latest drivers. If you check older Anandtech's reviews (i.e. 6870, 580GTX, etc.), you can see how AMD had the performance crown for a while.
It seems NVIDIA has done a really good job improving the quality of their OpenCL driver.
Since AMD 6970/5870 has the 4-issue/5-issue VLIWs, it would seem that float4 stuff should go extremely faster on the AMD than the NVIDIA. How does NVIDIA handle those...does it have to stop and do each element of the float4 serially?
It does not go extremely faster than it did before. The biggest trick is getting the same amount of power out of 4-way than what it was in the 5-way VLIW. (When the 5-way VLIW was designed with the HD2xxx, they knew most operations were 4-wide vector operations, but they saw it fit to create the Special Function Unit) Now increasing DP capacity and reducing SIMD size (thus allowing more SIMD engines in the same die) seemed reason enough to try to change from 5 to 4. But this change has little to do with dealing with vectors faster.
I realize that...I guess what I am really asking the group is absent VLIW in any form, how do the NVIDIA cards handle float4's? Does the NVIDIA compiler turn a float4 into 4 operations?
Originally posted by: kbrafford I realize that...I guess what I am really asking the group is absent VLIW in any form, how do the NVIDIA cards handle float4's? Does the NVIDIA compiler turn a float4 into 4 operations?
It should but NVIDIA is supposed to have a superscalar architecture so there should be some kind of parallel execution of float4 operations on NVIDIA hardware too (i.e. like a modern CPU is able to execute instructions in parallel if there is no dependencies). It seems confirmed by the good results linked above.
Originally posted by: Jawed Is SLG compute bound?
I would expect to be more memory bound than compute bound. Mostly because of the scattered accesses to memory typical of any ray tracer.
However it still does a no trivial amount of computation too (i.e. according the KernelAnalyzer the ratio between mem. op. and comp. op. isn't bad).
Originally posted by: nou davibu i tryed SLG in AMD profiler there is Linux version too. and ALU busy was around 50% with ALU packing 80%. and ALU:Fetch ratio around 10.
This seems to confirm my original idea: somewhat halfway between being memory bound and compute bound. What is an average "ALU busy" value ?
I know 100% is the optimal but I assume most applications do not reach that value.
Well, I am not much versed with raytracer problem. But to me a ALU busy of 80% is pretty good. Ofcourse optimum value for ALU busy hugely depends on the algorithm being ported. I have also seen ALU busy values close to max in some algebric algorithms.