General rules of writing GPU code:
1. Create serial version to verify correctness of paralell algorithm. (Might skip if extremely simple)
2. Make it work. (able to compile and run)
3. Make it right. (produce actually correct results, not garbage)
4. Make it fast. (optimize only at this point)
Golden rule: premature optimization is the root of all evil!
If you find buffers to be simpler than images, use them. When you are bored of buffers, start experimenting with images. If too many concepts are new, try to simplify things and resort to ones that you understand and know that are working. If something is buggy, and you used 3-4 new elements at once, you'll have no idea where you've gone wrong.
The idea of NDRange will not get simpler, but it is really no black magic. You should think of it as the specification of how many threads you want to launch. Global worksize is how many threads you want to launch, and local worksize is how large should the threads group into, that have a type of memory that they can share. (Naturally this cannot be arbitrarily large) For simple algorithms, where threads need not communicate with each other, you can safely disregard local worksize and when calling clEnqueueNDRangeKernel, set the corresponding argument to NULL, and let the implementation decide what to do with thread grouping.
Hope that helped.
Do read the OpenCL Programming Guide of AMD, as very simple questions such as this usually do not get answered, but jus get pointed to some tutorial or guide. (It is not lazyness, and people wont say "RTFM", but we've all had to learn OpenCL through tutorials and guides ourselves, so if many people did it before, it cannot be that hard)
i didn't start to learn opencl myself as i have else things to do now
learn is good ;)
your friendly answer is nice !
There is a sample MatrixMulImage in AMD APP SDK, which should be interesting to you.