11 Replies Latest reply on Sep 8, 2009 10:02 AM by jarmniku

    Matrix Multiplication Sample, slow?

    riza.guntur

      I feel the speed is slow compared to my own single thread matrix multiplication program.

      Why? I'm running on Intel Core 2 Quad Q6600 3.3GHz

      I see in task manager OpenCL example processor usage 100%

        • Matrix Multiplication Sample, slow?
          riza.guntur

          Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance

          What happen actually.

          Does OpenCL utilize SSE3 in this case?

          • Matrix Multiplication Sample, slow?
            omkaranathan

            The SDK samples provided with this beta release of the ATI Stream SDK are not necessarily tuned for optimal performance.

             

             

            Originally posted by: riza.guntur I feel the speed is slow compared to my own single thread matrix multiplication program.

            Which algorithm are you using in your program? Could you try writing the same algorithm as an openCL kernel and give a comparison?

             

             

            Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance

            What are the data sizes you are using for the blacksholes sample?

              • Matrix Multiplication Sample, slow?
                n0thing

                BlackScholes sample by default uses 4096 data size. I am getting performance around 750 Options/sec. Most of the time is taken by kernel compilation, if you measure the setup time (setupCL function) you will get around 5 secs, in future you should be able to do offline compilation of kernel  I guess.

                  • Matrix Multiplication Sample, slow?
                    riza.guntur

                     

                    Originally posted by: n0thing BlackScholes sample by default uses 4096 data size. I am getting performance around 750 Options/sec. Most of the time is taken by kernel compilation, if you measure the setup time (setupCL function) you will get around 5 secs, in future you should be able to do offline compilation of kernel  I guess.

                    I'm getting 4900 options/sec

                    I see in task manager, OpenCL not creating threads if the problem is not big enough. Is that so?

                      • Matrix Multiplication Sample, slow?
                        jarmniku

                        Hi!

                        I tried running bigger job with -s option to eliminate start-up latencies as follows:

                        BlackScholes -s 10000000

                        and got 1.91589e+06 Options per sec (i.e. 1.9 M/s) on my

                        Intel(R) Core(TM)2 CPU          6700  @ 2.66GHz

                        Though only the other core seems to be utilized when running on Ubuntu on top of WMware Player with Windows NT.

                        However, for example Mandelbrot and SimpleConvolution are running extremely slowly -- even 100X slower than could be expected. For example, 1024x1024 convolution with 3x3 mask needs roughly 150x1024x1024 machine operations, based on the cl-kernel, but consumes 14 seconds. This means about 11 MOPS performance on a CPU that should easily achieve 1 GOPS.

                        So, there is a huge gap, any ideas why the performance is so bad? Or am I doing something wrong?

                         

                    • Matrix Multiplication Sample, slow?
                      riza.guntur

                       

                      Originally posted by: omkaranathan The SDK samples provided with this beta release of the ATI Stream SDK are not necessarily tuned for optimal performance.

                       

                       

                      Originally posted by: riza.guntur I feel the speed is slow compared to my own single thread matrix multiplication program.

                      Which algorithm are you using in your program? Could you try writing the same algorithm as an openCL kernel and give a comparison?

                       

                       

                      Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance

                      What are the data sizes you are using for the blacksholes sample?

                      for blackscholes I'm using 1000000 (one million) samples

                      I haven't learned OpenCL yet. Even I confuse where to start. Why there is cl files (I thought it will be like br files). How to compile it, etc. Brook+ docs is better since it provide how to set VS project and another compilation procedure.

                        • Matrix Multiplication Sample, slow?
                          genaganna

                          Riza.guntur,

                              There are two way to compile kernel code as per the OpenCL spec.

                                    1. compile with source

                                    2. compile with binary

                           

                               Presently, it supports only compile with source.

                                 steps to be followed to compile with source

                                  1. Read kernel code string from your kernel file

                                  2. use clCreateProgramWithSource and clBuildProgram to get excutable code for given kernel