14 Replies Latest reply on Feb 21, 2011 3:24 AM by himanshu.gautam

    OpenCL Benchmark

    spectral

      Hi,

      I've see this interesting benchmark that use SLG open source raytracer to compare performance against different video cards.

      It is very surprising... we have more 'compute' power on the AMD cards... but at the end they are slowers.

      http://www.anandtech.com/show/4135/nvidias-geforce-gtx-560-ti-upsetting-the-250-market/15

      If you have advices... because I have a similar application and would like to tune it for AMD specifically.

       

      Thx

        • OpenCL Benchmark
          rick.weber

          Benchmarks like these should always be taken with a grain of salt. While correctness is portable in OpenCL (excepting vendor bugs), performance is not. To achieve the highest performance on ATI hardware, you really need to use 128-bit loads and 128-bit operations. However, I've found that this usually hurts performance on Nvidia hardware. So, in the end, a fair comparison requires you to write two shaders (and possibly two frontend calls).

            • OpenCL Benchmark
              Meteorhead

              Why does it hurt performance on NV cards? Most DX9-11 games use float4 vectors and 4x4 matrices for all transofrmations. Do 128-bit loads and stores really hurt NV that much??

              I read that 69xx cards feature HW accel of scalar loads and stores. Hope that will increase AMD performance in such tests, however it's hard for me to believe NV cards have such a hard time with vector operations. (Simple unrolling is done for them most likely)

              • OpenCL Benchmark
                davibu

                For the record, I have started developing SLG on a 4870 and than continued on 5870/5850. It uses 128 loads/stores and float4 for most operations. I use a NVIDIA 240GT only for testing the compatibility.

                It could be considered a benchmark somewhat biased toward AMD platform.

                It is just NVIDIA to have nearly doubled their performance with the latest drivers. If you check older Anandtech's reviews (i.e. 6870, 580GTX, etc.), you can see how AMD had the performance crown for a while.

                It seems NVIDIA has done a really good job improving the quality of their OpenCL driver.

                 

                  • OpenCL Benchmark
                    kbrafford

                    Since AMD 6970/5870 has the 4-issue/5-issue VLIWs, it would seem that float4 stuff should go extremely faster on the AMD than the NVIDIA.  How does NVIDIA handle those...does it have to stop and do each element of the float4 serially?

                      • OpenCL Benchmark
                        Meteorhead

                        It does not go extremely faster than it did before. The biggest trick is getting the same amount of power out of 4-way than what it was in the 5-way VLIW. (When the 5-way VLIW was designed with the HD2xxx, they knew most operations were 4-wide vector operations, but they saw it fit to create the Special Function Unit) Now increasing DP capacity and reducing SIMD size (thus allowing more SIMD engines in the same die) seemed reason enough to try to change from 5 to 4. But this change has little to do with dealing with vectors faster.

                          • OpenCL Benchmark
                            kbrafford

                            I realize that...I guess what I am really asking the group is absent VLIW in any form, how do the NVIDIA cards handle float4's?  Does the NVIDIA compiler turn a float4 into 4 operations?

                              • OpenCL Benchmark
                                davibu

                                 

                                Originally posted by: kbrafford I realize that...I guess what I am really asking the group is absent VLIW in any form, how do the NVIDIA cards handle float4's?  Does the NVIDIA compiler turn a float4 into 4 operations?

                                 

                                It should but NVIDIA is supposed to have a superscalar architecture so there should be some kind of parallel execution of float4 operations on NVIDIA hardware too (i.e. like a modern CPU is able to execute instructions in parallel if there is no dependencies). It seems confirmed by the good results linked above.

                                 

                                  • OpenCL Benchmark
                                    Jawed

                                    Is SLG compute bound?

                                      • OpenCL Benchmark
                                        davibu

                                         

                                        Originally posted by: Jawed Is SLG compute bound?

                                         

                                         

                                        I would expect to be more memory bound than compute bound. Mostly because of the scattered accesses to memory typical of any ray tracer.

                                        However it still does a no trivial amount of computation too (i.e. according the KernelAnalyzer the ratio between mem. op. and comp. op. isn't bad).

                                         

                                          • OpenCL Benchmark
                                            nou

                                            davibu i tryed SLG in AMD profiler there is Linux version too. and ALU busy was around 50% with ALU packing 80%. and ALU:Fetch ratio around 10.

                                              • OpenCL Benchmark
                                                davibu

                                                 

                                                Originally posted by: nou davibu i tryed SLG in AMD profiler there is Linux version too. and ALU busy was around 50% with ALU packing 80%. and ALU:Fetch ratio around 10.

                                                 

                                                 

                                                This seems to confirm my original idea: somewhat halfway between being memory bound and compute bound. What is an average "ALU busy" value ?

                                                I know 100% is the optimal but I assume most applications do not reach that value.

                                                 

                                                 

                                                  • OpenCL Benchmark
                                                    himanshu.gautam

                                                    Well, I am not much versed with raytracer problem. But to me a ALU busy of 80% is pretty good. Ofcourse optimum value for ALU busy hugely depends on the algorithm being ported. I have also seen ALU busy values close to max in some algebric algorithms.

                                                     

                                • OpenCL Benchmark
                                  MicahVillmow
                                  meteorhead,
                                  Graphics are different because they don't usually use the same hardware paths. Graphics use textures heavily where tiling modes can be highly tuned to make sure that data is read in optimally.