I have a question regarding on the OpenCL on CPU for AMD APP.
I have made a simple test OpenCL code, which just did a lot of add in a loop to test the OpenCL performance. The CL code only include 1 work group and 1 work item, and I also made an C reference code on CPU doing the same thing. I assume the performance should be similar, but what I found is when using reference code on CPU and select the platform to be X64 , the reference code result is much faster( about 50%) than OpenCL on CPU. And if I select the reference code to run on win32 platform, the result is some similar with OpenCL on CPU.
My question is does the AMD APP OpenCL compiler optimized for the X64 platform or only for W32 platform? And how can I verify it. Can anyone give me some instructions or suggestions?
I used AMD kernel analyzer to analysis my CL code and generate the X86 assembly instructions as below.
// only paste the vector add part of the code.
// the CL source code is vec[c] = vec[a] + vec
// where vec is char16 data type
movdqa (%ebx), %xmm0
addl $16, %ebx
paddb (%esi), %xmm0
addl $16, %esi
movdqa %xmm0, (%ebp)
addl $16, %ebp