cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

riza_guntur
Journeyman III

Matrix Multiplication Sample, slow?

I feel the speed is slow compared to my own single thread matrix multiplication program.

Why? I'm running on Intel Core 2 Quad Q6600 3.3GHz

I see in task manager OpenCL example processor usage 100%

0 Likes
11 Replies
riza_guntur
Journeyman III

Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance

What happen actually.

Does OpenCL utilize SSE3 in this case?

0 Likes
omkaranathan
Adept I

The SDK samples provided with this beta release of the ATI Stream SDK are not necessarily tuned for optimal performance.

 

Originally posted by: riza.guntur I feel the speed is slow compared to my own single thread matrix multiplication program.

Which algorithm are you using in your program? Could you try writing the same algorithm as an openCL kernel and give a comparison?

 

Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance

What are the data sizes you are using for the blacksholes sample?

0 Likes

BlackScholes sample by default uses 4096 data size. I am getting performance around 750 Options/sec. Most of the time is taken by kernel compilation, if you measure the setup time (setupCL function) you will get around 5 secs, in future you should be able to do offline compilation of kernel  I guess.

0 Likes

Originally posted by: n0thing BlackScholes sample by default uses 4096 data size. I am getting performance around 750 Options/sec. Most of the time is taken by kernel compilation, if you measure the setup time (setupCL function) you will get around 5 secs, in future you should be able to do offline compilation of kernel  I guess.

I'm getting 4900 options/sec

I see in task manager, OpenCL not creating threads if the problem is not big enough. Is that so?

0 Likes

Hi!

I tried running bigger job with -s option to eliminate start-up latencies as follows:

BlackScholes -s 10000000

and got 1.91589e+06 Options per sec (i.e. 1.9 M/s) on my

Intel(R) Core(TM)2 CPU          6700  @ 2.66GHz

Though only the other core seems to be utilized when running on Ubuntu on top of WMware Player with Windows NT.

However, for example Mandelbrot and SimpleConvolution are running extremely slowly -- even 100X slower than could be expected. For example, 1024x1024 convolution with 3x3 mask needs roughly 150x1024x1024 machine operations, based on the cl-kernel, but consumes 14 seconds. This means about 11 MOPS performance on a CPU that should easily achieve 1 GOPS.

So, there is a huge gap, any ideas why the performance is so bad? Or am I doing something wrong?

 

0 Likes

jarmniku,

The samples included in the current SDK are not necessarily optimized for performance.

But still I couldnt see this much of a performance reduction while running at my end. Could you try and run the sample in Windows or in Ubuntu without WMware and give the results?

0 Likes

Well, I don't believe it is caused by WMware, since I tested also some reference codes that ran as fast as could be expected. Or, maybe the ATI OpenCL implementation has something very specific that causes WMware to make e.g. some exceptions, that could explain the thing.

What kind of results did you got?

 

0 Likes

1024x1024 Simple convolution with 3x3 mask runs under 2seconds at my side.

Below are the results with different processors, 

Windows XP

Intel Pentium 4 560 @ 3.60GHz - 1.81s 

Intel(R) Core(TM)2 Duo T7250 @ 2.0GHz - 1.06s 

AMD Athlon Dual core  4000+ @ 2.10 GHZ - 1.85s

OpenSuse 11.0

AMD's Athlon 64 3500+ processor @2.2GHz - 0.99s

 

What kind of results are you getting when running without WMware?

0 Likes

You seem to get better results. I should try without WMware, let's see if I can do it in the near future.

Thank you for your comments!

 

0 Likes

Originally posted by: omkaranathan The SDK samples provided with this beta release of the ATI Stream SDK are not necessarily tuned for optimal performance.

 

Originally posted by: riza.guntur I feel the speed is slow compared to my own single thread matrix multiplication program.

Which algorithm are you using in your program? Could you try writing the same algorithm as an openCL kernel and give a comparison?

 

Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance

What are the data sizes you are using for the blacksholes sample?

for blackscholes I'm using 1000000 (one million) samples

I haven't learned OpenCL yet. Even I confuse where to start. Why there is cl files (I thought it will be like br files). How to compile it, etc. Brook+ docs is better since it provide how to set VS project and another compilation procedure.

0 Likes

Riza.guntur,

    There are two way to compile kernel code as per the OpenCL spec.

          1. compile with source

          2. compile with binary

 

     Presently, it supports only compile with source.

       steps to be followed to compile with source

        1. Read kernel code string from your kernel file

        2. use clCreateProgramWithSource and clBuildProgram to get excutable code for given kernel

 

0 Likes