I feel the speed is slow compared to my own single thread matrix multiplication program.
Why? I'm running on Intel Core 2 Quad Q6600 3.3GHz
I see in task manager OpenCL example processor usage 100%
Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance
What happen actually.
Does OpenCL utilize SSE3 in this case?
The SDK samples provided with this beta release of the ATI Stream SDK are not necessarily tuned for optimal performance.
Originally posted by: riza.guntur I feel the speed is slow compared to my own single thread matrix multiplication program.
Which algorithm are you using in your program? Could you try writing the same algorithm as an openCL kernel and give a comparison?
Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performance
What are the data sizes you are using for the blacksholes sample?
BlackScholes sample by default uses 4096 data size. I am getting performance around 750 Options/sec. Most of the time is taken by kernel compilation, if you measure the setup time (setupCL function) you will get around 5 secs, in future you should be able to do offline compilation of kernel I guess.
Originally posted by: n0thing BlackScholes sample by default uses 4096 data size. I am getting performance around 750 Options/sec. Most of the time is taken by kernel compilation, if you measure the setup time (setupCL function) you will get around 5 secs, in future you should be able to do offline compilation of kernel I guess.
I'm getting 4900 options/sec
I see in task manager, OpenCL not creating threads if the problem is not big enough. Is that so?
Hi!
I tried running bigger job with -s option to eliminate start-up latencies as follows:
BlackScholes -s 10000000
and got 1.91589e+06 Options per sec (i.e. 1.9 M/s) on my
Intel(R) Core(TM)2 CPU 6700 @ 2.66GHz
Though only the other core seems to be utilized when running on Ubuntu on top of WMware Player with Windows NT.
However, for example Mandelbrot and SimpleConvolution are running extremely slowly -- even 100X slower than could be expected. For example, 1024x1024 convolution with 3x3 mask needs roughly 150x1024x1024 machine operations, based on the cl-kernel, but consumes 14 seconds. This means about 11 MOPS performance on a CPU that should easily achieve 1 GOPS.
So, there is a huge gap, any ideas why the performance is so bad? Or am I doing something wrong?
jarmniku,
The samples included in the current SDK are not necessarily optimized for performance.
But still I couldnt see this much of a performance reduction while running at my end. Could you try and run the sample in Windows or in Ubuntu without WMware and give the results?
Well, I don't believe it is caused by WMware, since I tested also some reference codes that ran as fast as could be expected. Or, maybe the ATI OpenCL implementation has something very specific that causes WMware to make e.g. some exceptions, that could explain the thing.
What kind of results did you got?
1024x1024 Simple convolution with 3x3 mask runs under 2seconds at my side.
Below are the results with different processors,
Windows XP
Intel Pentium 4 560 @ 3.60GHz - 1.81s
Intel(R) Core(TM)2 Duo T7250 @ 2.0GHz - 1.06s
AMD Athlon Dual core 4000+ @ 2.10 GHZ - 1.85s
OpenSuse 11.0
AMD's Athlon 64 3500+ processor @2.2GHz - 0.99s
What kind of results are you getting when running without WMware?
You seem to get better results. I should try without WMware, let's see if I can do it in the near future.
Thank you for your comments!
Originally posted by: omkaranathan The SDK samples provided with this beta release of the ATI Stream SDK are not necessarily tuned for optimal performance.
Originally posted by: riza.guntur I feel the speed is slow compared to my own single thread matrix multiplication program.Which algorithm are you using in your program? Could you try writing the same algorithm as an openCL kernel and give a comparison?
Even the blackscholes example is slow. The single threaded sample in ATI Stream SDK 1.4b for comparation with CPU show very near performanceWhat are the data sizes you are using for the blacksholes sample?
for blackscholes I'm using 1000000 (one million) samples
I haven't learned OpenCL yet. Even I confuse where to start. Why there is cl files (I thought it will be like br files). How to compile it, etc. Brook+ docs is better since it provide how to set VS project and another compilation procedure.
Riza.guntur,
There are two way to compile kernel code as per the OpenCL spec.
1. compile with source
2. compile with binary
Presently, it supports only compile with source.
steps to be followed to compile with source
1. Read kernel code string from your kernel file
2. use clCreateProgramWithSource and clBuildProgram to get excutable code for given kernel