Hi. I am somewhat questioning slow execution of OpenCL kernels on CPU.
I have taken the kernel from the sample AESEncryptDecrypt and modified it so that it uses an input char string, not a 2D image. So now it works in 1D index space, workgroup size 64. I also have a regular C file that does AES on a single block at a time, http://dl.dropbox.com/u/4230568/c_aes.c, loop over the input buffer.
The regular C execution maxes one CPU core during entire encryption, the OpenCL kernel maxes all 8 CPU cores (4 real with hyperthreading). But they take roughly the same amount of time to finnish, no matter the input size. Why is that?
How can 8 cores (or 4..) do as little work using opencl as 1 core using ordinary C? Local memory latency? Thread context switching (local_mem_barrier)?
I only compare the actual calculation time, not host setup.
Did you try to limit CPU execution to the 4 actual cores? I have had (unrelated to OpenCL) experiences where CPU intensive processes actually run slower if I don't limit them to the physical number of cores.
Originally posted by: nou well in my case using __local and barrier() in kernel lead to several times slower execution on CPU.
i guessed it was something like this. thanks.
Originally posted by: himanshu.gautam Can you provide a test case.May be we can figure out some ways to improve the performance of your code.
test case available here: http://dl.dropbox.com/u/4230568/aes.tar.gz (you need to do changes to Makefile according to your system)
note that this is the kernel from AESEncryptDecrypt with small changes (workgroup size, internal functions). I think that it's the large amount of thread context switches from local mem and barrier(CLK_LOCAL_MEM_FENCE) that lowers performance on CPU.
for CPU, each kernel should probably work on its own aes-block (16 bytes) and have all in private mem.