to answer your question.
look at your algorithm and try find a way how to split it into thousand of indepent work units. for example process each pixel of input image and the do something like parralel reduction.
This is peanuts for GPU's, as long as your image fits inside the GPU buffer you are allowed to allocate.
256MB ain't much.
See it like this: the most attempt algorithmically will be bandwidth limited to the RAM. Your PC's bandwidth to the RAM is factors less than the GPU.
Over factor 10 increase in bandwidth.
If you also manage to use the PE's effectively it'll fly of course as from a 100 gflops single precision with a bunch of cores you move to over a teraflop and with multiply-add you get the add for free.
0.1 seconds should not be a problem, you want more or less exactly what 3d engines already are doing realtime.
That is, if you have defined what the object is.
In case you want image processing the recognizes what the object is, it's moving to a different league, the league of discrete fourier transforms...
Many struggle at nvidia now to get transforms working at the gpu's. Bit by bit succeeding better there, it doesn't seem easy and it all is bandwidth limited to the RAM for most transforms, just like it already was on the CPU's...
The biggest problem there is the parallellisation, as embarrassingly parallel crunching would be possible only when there is enough RAM for each transform; it is already impossible for most transforms to have a different transform in each compute unit (a compute unit is 64 Processing Elements in case of the 6000 series gpu's). PE's within a compute unit can communicate with each other very well using the local data store (LDS), yet it's not so clever to communicate between compute units. In fact that requires the RAM (sure there is the GDS but that's just a few kilobyte). So from parallel viewpoint seen to get the maximum performance it would be best to do each independant calculation within 1 compute unit.
That'll be similar for you in opencl.
As soon as there are constraints there, only actions that CAN be done indepdendant from each other, yet in a parallel manner, you can have other compute units do.
A big difference with the cpu is of course that every compute unit does execute the same instruction at roughly the same time. No escape there right now at AMD; though that should change in some time, see announcements done there, yet i'm not sure when this gets implemented to steer compute units with independant code from each other, which would simplify the problem not only for you possibly
Nvidia already can do that for quite some years.
So the short answer is, a speedup of factor 10 should be easy over 1 core if you require a big buffer.
As you described it however, it seems like some simple feature that all 3d engines have which can be done in full blown realtime and the gpu has all logics onboard for all sorts of estimation methods to get things done.
Please note that most games as a result of this also are a tad bandwidth limited to the RAM as a result, so it would be great if you do more sophisticated calculations than just move basically a subimage from the image away and a few simple calculations.
See it like this. DDR2 ram has roughly 10GB/s to the CPU in bandwidth as a max for 2 sockets. DDR3 on paper is a lot better and achieved a 20GB/s roughly, depending upon chipset version bla bla.
the gpu's have ddr5 which delivers them handsdown a 170GB/s or more.
So realistically a factor 10 you do eyes closed. Above that is a matter of sophisticated programming skills, which by the way most 3d engine programmers do not have (you get what you pay for).
Originally posted by: rafidka I have never used OpenCL before, so forgive me if my question looks funny :-) I am limited with time so I want to see if OpenCL is going to help me before digging into it.
I am making a camera controller using optimisation and simple image processing. I allow the user of my controller to express a set of screen properties. For example, the user might see: I want this object to occupy 20% of screen width. What I do is to randomly distribute 36 camera configurations in the space, then move those configurations in space like particles (the optimisation method is actually called Particle Swarm Optimisation) and with each movement I render the scene from all configurations and process the resulting images to evaluate the satisfaction of the certain configuration. If it satisfies the user requirement, I return it.
To evaluate the satisfaction, I do simple image processing techniques. For example, if the user wants the element to occupy 20% of screen width, I find the horizontal coordinates of the minimum and maximum pixels occupied by the object, then subtract them to find the width.
Obviously, in addition to transferring the rendered image back to the CPU for processing, this process is heavy on the CPU itself and make it impossible to run the camera solver in real-time. My question is, is it possible to run those processes on the GPU using OpenCL to make it faster? Not necessarily real-time, but as faster as possible. Currently, it is normal to take around 8 seconds until a good solution is found, so if I can reduce that to 1 second, it would be great.
In practical terms, you'd want to use image_2d buffers, process them in Device (GPU) memory and (ideally) display them without any transfer back.
May I suggest this topic:
From this you can have an idea about what to expect concerning GPU performance regarding image processing.
Hope that helps
Thanks guys! Your answers are very useful.