For me is AMD can be very proud the path has taken of exposing features in his hardware via vendor extensions as this is the fastest path possible.. i.e. cl_amd.. as media ops..
I will describe all extensions and improvements that will made a perfect OpenCL implementation for me.. Hopefully you will have all on the radar but at least serves me as a checklist..
For me still lacking are some extensions that DX11 hardware and DirectCompute have and some even where claimed in Evergreen press launch in september 2009 to be exposed in OpenCL:
Please AMD employees speak about what do you think about my requests (if you are interested in implementing it and when we can expect it and if not why not)
*Please support cl_khr_gl_event OpenCL 1.1 extension for more efficient OpenGL interop
*Global data share exposure
*Append/Consume buffers (this were supposedly coming for Stream 2.2 in leaked roadmaps..)
*DX11 level integer instructions (first bit set,count bits set,etc..)->please expose similarly as media ops are exposed
*Add support for accessing system host mem from GPU kernels (this is supposedly being worked on as claimed by AMD employee in AMD stream forums): useful for huge datasets which don't fit in GPU mem.. and also can speak what relation with Fusion will have this extension?
*Concurrent kernels (same as Fermi maybe via ext_device_fission?)
A good FFT and BLAS library ala CUBLAS CUFFT would be perfect for scientific works.
I know you have ACML-GPU but is not OpenCL friendy so to be easy to be integrated in a OpenCL programs please fix it(for example calling with cl_mem_objects as input ouput data arguments)
Can you speak about upcoming extensions and improvements like better Catalyst integration ( I mean a single atiocl.dll in window system path), suposed Open Video Decode UVD interop, and C++ template kernels exposed in leaked roadmaps?
Also Open Video Decode API would expose MVC codec present in 6xxx series for Bluray 3D decode? if not please do some whitepaper of using that hardware supposedly via DXVA 2.0 extensions..
This would allow GPU processing of compressed stereoscopic video with no overhead in decoding it..
I have some more possible upgrades to OpenCL:
*expose hardware accelerated 2d texture arrays (read and write).. DirectCompute does so (RWTexture2DArray)..
*expose some form of launching a kernel with grid/local workgroup size coming from graphics mem.. this allows algos as marching cubes where a kernel may be run only "interesting" voxels following a scan algorithm.. so basically launch kernel size depends of previus kernel launch (scan).. this alleviates a roundtrip and sync cpu/gpu through host mem.. I have tested this on Marching cubes and could provide for small examples a 2x improvement.. DirectCompute exposes this (DispatchIndirect)
also interesting is I found some Qualcomm supposed improvements to opencl which using this api and declaring new device types allows to using OpenCL program DSP, Video decode/encode units, GPUs,CPUs all interoperating efficiently for example running all units asynchronously if hardware suppors and also using same memory space..