Archives Discussions

eklund_n · ‎10-22-2010

compared to standard C implementation

Hi. I am somewhat questioning slow execution of OpenCL kernels on CPU.

I have taken the kernel from the sample AESEncryptDecrypt and modified it so that it uses an input char string, not a 2D image. So now it works in 1D index space, workgroup size 64. I also have a regular C file that does AES on a single block at a time, http://dl.dropbox.com/u/4230568/c_aes.c, loop over the input buffer.

The regular C execution maxes one CPU core during entire encryption, the OpenCL kernel maxes all 8 CPU cores (4 real with hyperthreading). But they take roughly the same amount of time to finnish, no matter the input size. Why is that?

How can 8 cores (or 4..) do as little work using opencl as 1 core using ordinary C? Local memory latency? Thread context switching (local_mem_barrier)?

I only compare the actual calculation time, not host setup.

datlatec · ‎10-22-2010

Did you try to limit CPU execution to the 4 actual cores? I have had (unrelated to OpenCL) experiences where CPU intensive processes actually run slower if I don't limit them to the physical number of cores.

nou · ‎10-22-2010

well in my case using __local and barrier() in kernel lead to several times slower execution on CPU.

eklund_n · ‎10-22-2010

Originally posted by: nou well in my case using __local and barrier() in kernel lead to several times slower execution on CPU.

i guessed it was something like this. thanks.

himanshu_gautam · ‎10-22-2010

Can you provide a test case.May be we can figure out some ways to improve the performance of your code.

eklund_n · ‎10-25-2010

Originally posted by: himanshu.gautam Can you provide a test case.May be we can figure out some ways to improve the performance of your code.

test case available here: http://dl.dropbox.com/u/4230568/aes.tar.gz (you need to do changes to Makefile according to your system)

note that this is the kernel from AESEncryptDecrypt with small changes (workgroup size, internal functions). I think that it's the large amount of thread context switches from local mem and barrier(CLK_LOCAL_MEM_FENCE) that lowers performance on CPU.

for CPU, each kernel should probably work on its own aes-block (16 bytes) and have all in private mem.

himanshu_gautam · ‎11-13-2010

eklund.n,

I am not able to use the link given by you.

I would be great if you can send the test case at streamdeveloper@amd.com

Archives Discussions

AESEncryptDecrypt sample kernel, slow on CPU