5 Replies Latest reply on Feb 16, 2010 11:52 PM by genaganna

    4850 and 5670 in OpenCL

    dongzaixx

      They have the same price. But I want to know which one shows better performance in OpenCL programs.

        • 4850 and 5670 in OpenCL
          genaganna

           

          Originally posted by: dongzaixx They have the same price. But I want to know which one shows better performance in OpenCL programs.

           

          Dongzaixx,

                        HD4850 has 10 SIMDs means 800 shader processors

                        HD5670 has 5 SIMD's means 400 shader processors.

                       Most cases HD4850 performance better than HD5670 as HD4850 has double the shaders than HD5670.

                       But HD5670 is designed to support OpenCL completely.

           

            • 4850 and 5670 in OpenCL
              dongzaixx

               

              Originally posted by: genaganna
              Originally posted by: dongzaixx They have the same price. But I want to know which one shows better performance in OpenCL programs.

               

               

               

               

              Dongzaixx,

               

                            HD4850 has 10 SIMDs means 800 shader processors

               

                            HD5670 has 5 SIMD's means 400 shader processors.

               

                           Most cases HD4850 performance better than HD5670 as HD4850 has double the shaders than HD5670.

               

                           But HD5670 is designed to support OpenCL completely.

               

               

               

              Since HD4XXX is not completely designed for OpenCL, that is why I ask this question. Do you have any benchmark?

                • 4850 and 5670 in OpenCL
                  n0thing

                  Unless your OpenCL program uses local memory, 4850 will always give you better performance.

                  You can test the MatrixMultiplication sample in SDK 2.01, it runs without local memory on 7xx series and uses local memory on 8xx series. Use command line options = -x 256 -y 256- z 256

                    • 4850 and 5670 in OpenCL
                      gapon

                       

                      Originally posted by: n0thing Unless your OpenCL program uses local memory, 4850 will always give you better performance.

                       

                      You can test the MatrixMultiplication sample in SDK 2.01, it runs without local memory on 7xx series and uses local memory on 8xx series. Use command line options = -x 256 -y 256- z 256

                       

                      Just out of curiosity I've tried this test on HD5870 and on CPU (dual core Intel)  to see how much I would benefit from using GPU versus CPU, and I see no gain at all. Here are the commands and their output:

                       

                       

                       

                      % MatrixMultiplication -device cpu -x 256 -y 256 -z 256 -i 16 

                       

                       

                      Executing kernel for 16 iterations

                      -------------------------------------------

                      KernelTime (ms) : 0.535729

                      GFlops achieved : 62.6332

                      KernelTime (ms) : 0.532304

                      GFlops achieved : 63.0362

                      KernelTime (ms) : 0.816905

                      GFlops achieved : 41.0751

                      KernelTime (ms) : 0.831021

                      GFlops achieved : 40.3774

                       

                      ..
                      % MatrixMultiplication -device gpu -x 256 -y 256 -z 256 -i 16
                      Executing kernel for 16 iterations
                      -------------------------------------------
                      KernelTime (ms) : 0.53732
                      GFlops achieved : 62.4478
                      KernelTime (ms) : 0.529792
                      GFlops achieved : 63.3351
                      KernelTime (ms) : 1.85189
                      GFlops achieved : 18.119
                      KernelTime (ms) : 0.831176
                      GFlops achieved : 40.3698
                      ..
                      Is this what I should expect?

                       

                       

                       

                        • 4850 and 5670 in OpenCL
                          genaganna

                           

                          Originally posted by: gapon
                          Originally posted by: n0thing Unless your OpenCL program uses local memory, 4850 will always give you better performance.

                           

                           You can test the MatrixMultiplication sample in SDK 2.01, it runs without local memory on 7xx series and uses local memory on 8xx series. Use command line options = -x 256 -y 256- z 256

                           

                           



                           

                          Just out of curiosity I've tried this test on HD5870 and on CPU (dual core Intel)  to see how much I would benefit from using GPU versus CPU, and I see no gain at all. Here are the commands and their output:

                           

                           

                           

                           

                          % MatrixMultiplication -device cpu -x 256 -y 256 -z 256 -i 16 

                           

                           Executing kernel for 16 iterations

                           

                          -------------------------------------------

                           

                          KernelTime (ms) : 0.535729

                           

                          GFlops achieved : 62.6332

                           

                          KernelTime (ms) : 0.532304

                           

                          GFlops achieved : 63.0362

                           

                          KernelTime (ms) : 0.816905

                           

                          GFlops achieved : 41.0751

                           

                          KernelTime (ms) : 0.831021

                           

                          GFlops achieved : 40.3774

                           

                           ..

                          % MatrixMultiplication -device gpu -x 256 -y 256 -z 256 -i 16
                          Executing kernel for 16 iterations
                          -------------------------------------------
                          KernelTime (ms) : 0.53732
                          GFlops achieved : 62.4478
                          KernelTime (ms) : 0.529792
                          GFlops achieved : 63.3351
                          KernelTime (ms) : 1.85189
                          GFlops achieved : 18.119
                          KernelTime (ms) : 0.831176
                          GFlops achieved : 40.3698
                          ..
                          Is this what I should expect? 

                           

                          Gapon,

                                    Please run with -x 2048 -y 2048 -z 2048

                                    If you run for smaller dimensions,  transfer time dominates the kernel time that is why you see such poor performance.