Thanks for taking the time to download and use AMD’s OpenCL Beta. In many ways the results you give seem positive, they have shown that given a sensible number of work-items (64-256) the overhead can be amortized and performance scales well against the single threaded version. It is 3-4x faster than the single core version and the missing bit is that without using vector types, in this case float4 to map to __m128 for SSE, then the OpenCL code will not fully utilize the x86 cores but this is also the case for the straight C code.
Of course, auto-vectorization can help here but there is much work to be done to get things implemented and tested and we are working on this and other ideas. It is worth remembering that doing the math in SSE is not the only thing that can be optimized and in particular an application must optimize around load/store width, on x86 is can be faster to do SSE load/stores than scalar. The problem is that simply taking scalar code and vectorizing the memory operations is not trival and combining this will branching can often make it near impossible. This was one of the key motivations behind OpenCL including vector types in a unified and portable manner that map to the underlying vector hardware, on current x86 this is SSE and in the future AVX.
It is worth noting that an OpenCL device, in this case the CPU, places a limit on the number of work-items per work group and currently AMD’s OpenCL Beta implementation puts this at 1024 and this most likely the kernel launch failed for 2048. You can test for the maximum work-group size of a device by calling the API routine clgetDeviceInfo(device, CL_DEVICE_MAX_WORK_GROUP_SIZE,...).