Probably best to start with the OpenCL examples and adapt those to your own situation. Also, libraries like ArrayFire (which is the one I work on) can make it much easier to get good performance in just a few hours.
I wrote my application on my AMD GPU but it still does not perform successfully (the aim is biting the same application running with SIMD+OpenMP in terms of time consumption). So recently I gave a look to ArrayFire, and maybe could be an option. Honestly, I have some problems understanding how to merge it with OpenCL, in the sense of creating context, adding devices, memory objs, executing kernels and so on (since this is the way I learned for communicating with an external device and adding work). Is there an available guide/examples to drive the user step by step configuring the environment and adding kernels to the device, showing on a side how it could be written using only OpenCL and on the other side using OpenCL+ArrayFire?
Thank you for the attention.
In case you are interested in learning OpenCL, i would recommend to start at http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/
Read the documentations, install the SDK, and learn from samples.
Also there are some highly optimized libraries like clAmdBlas and clAmdfft available from AMD, for AMD GPUs.