Archives Discussions

riza_guntur · ‎08-25-2009

I feel the speed is slow compared to my own single thread matrix multiplication program.

Why? I'm running on Intel Core 2 Quad Q6600 3.3GHz

I see in task manager OpenCL example processor usage 100%

riza_guntur · ‎08-25-2009

Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance

What happen actually.

Does OpenCL utilize SSE3 in this case?

omkaranathan · ‎08-25-2009

The SDK samples provided with this beta release of the ATI Stream SDK are not necessarily tuned for optimal performance.

Originally posted by: riza.guntur I feel the speed is slow compared to my own single thread matrix multiplication program.

Which algorithm are you using in your program? Could you try writing the same algorithm as an openCL kernel and give a comparison?

Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance

What are the data sizes you are using for the blacksholes sample?

n0thing · ‎08-25-2009

BlackScholes sample by default uses 4096 data size. I am getting performance around 750 Options/sec. Most of the time is taken by kernel compilation, if you measure the setup time (setupCL function) you will get around 5 secs, in future you should be able to do offline compilation of kernel I guess.

riza_guntur · ‎08-26-2009

Originally posted by: n0thing BlackScholes sample by default uses 4096 data size. I am getting performance around 750 Options/sec. Most of the time is taken by kernel compilation, if you measure the setup time (setupCL function) you will get around 5 secs, in future you should be able to do offline compilation of kernel I guess.

I'm getting 4900 options/sec

I see in task manager, OpenCL not creating threads if the problem is not big enough. Is that so?

jarmniku · ‎09-03-2009

Hi!

I tried running bigger job with -s option to eliminate start-up latencies as follows:

BlackScholes -s 10000000

and got 1.91589e+06 Options per sec (i.e. 1.9 M/s) on my

Intel(R) Core(TM)2 CPU 6700 @ 2.66GHz

Though only the other core seems to be utilized when running on Ubuntu on top of WMware Player with Windows NT.

However, for example Mandelbrot and SimpleConvolution are running extremely slowly -- even 100X slower than could be expected. For example, 1024x1024 convolution with 3x3 mask needs roughly 150x1024x1024 machine operations, based on the cl-kernel, but consumes 14 seconds. This means about 11 MOPS performance on a CPU that should easily achieve 1 GOPS.

So, there is a huge gap, any ideas why the performance is so bad? Or am I doing something wrong?

omkaranathan · ‎09-03-2009

jarmniku,

The samples included in the current SDK are not necessarily optimized for performance.

But still I couldnt see this much of a performance reduction while running at my end. Could you try and run the sample in Windows or in Ubuntu without WMware and give the results?

jarmniku · ‎09-07-2009

Well, I don't believe it is caused by WMware, since I tested also some reference codes that ran as fast as could be expected. Or, maybe the ATI OpenCL implementation has something very specific that causes WMware to make e.g. some exceptions, that could explain the thing.

What kind of results did you got?

omkaranathan · ‎09-07-2009

1024x1024 Simple convolution with 3x3 mask runs under 2seconds at my side.

Below are the results with different processors,

Windows XP

Intel Pentium 4 560 @ 3.60GHz - 1.81s

Intel(R) Core(TM)2 Duo T7250 @ 2.0GHz - 1.06s

AMD Athlon Dual core 4000+ @ 2.10 GHZ - 1.85s

OpenSuse 11.0

AMD's Athlon 64 3500+ processor @2.2GHz - 0.99s

What kind of results are you getting when running without WMware?

jarmniku · ‎09-08-2009

You seem to get better results. I should try without WMware, let's see if I can do it in the near future.

Thank you for your comments!

riza_guntur · ‎08-26-2009

Originally posted by: omkaranathan The SDK samples provided with this beta release of the ATI Stream SDK are not necessarily tuned for optimal performance.

Originally posted by: riza.guntur I feel the speed is slow compared to my own single thread matrix multiplication program.

Which algorithm are you using in your program? Could you try writing the same algorithm as an openCL kernel and give a comparison?

Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance

What are the data sizes you are using for the blacksholes sample?

for blackscholes I'm using 1000000 (one million) samples

I haven't learned OpenCL yet. Even I confuse where to start. Why there is cl files (I thought it will be like br files). How to compile it, etc. Brook+ docs is better since it provide how to set VS project and another compilation procedure.

genaganna · ‎08-26-2009

Riza.guntur,

There are two way to compile kernel code as per the OpenCL spec.

1. compile with source

2. compile with binary

Presently, it supports only compile with source.

steps to be followed to compile with source

1. Read kernel code string from your kernel file

2. use clCreateProgramWithSource and clBuildProgram to get excutable code for given kernel

Archives Discussions

Matrix Multiplication Sample, slow?