Hi. I am somewhat questioning slow execution of OpenCL kernels on CPU.
I have taken the kernel from the sample AESEncryptDecrypt and modified it so that it uses an input char string, not a 2D image. So now it works in 1D index space, workgroup size 64. I also have a regular C file that does AES on a single block at a time, http://dl.dropbox.com/u/4230568/c_aes.c, loop over the input buffer.
The regular C execution maxes one CPU core during entire encryption, the OpenCL kernel maxes all 8 CPU cores (4 real with hyperthreading). But they take roughly the same amount of time to finnish, no matter the input size. Why is that?
How can 8 cores (or 4..) do as little work using opencl as 1 core using ordinary C? Local memory latency? Thread context switching (local_mem_barrier)?
I only compare the actual calculation time, not host setup.