Local memory access in CPU

Discussion created by windy96 on May 13, 2010
local memory access latency global memory


When I run this kernel on Intel CPU Core2Duo, I found curious case about memory performance.

Running part 1 is faster than running part2.  It means local memory access is faster than global memory access.  Since this CPU does not have dedicated local memory, it emulates local memory using global memory.  For this reason, local memory access should be identical to global memory access.  But I got a different result.


I thought possibilities of optimization.  However, because running part1 is obviously slower than running empty kernel or running empty for loop, I do not believe this result is from compiler's cool optimization.

What is the reason?  Is there any reason of implementing OpenCL CPU platform and runtime like this?


__kernel void testMemoryAccess(__global float * output, __global float * input, __local float * block, const uint width, const uint height, const uint blockSize ) { int i; // 1. for accessing local memory for (i = 0; i < blockSize * blockSize; i++) block[i] = 0.0; // 2. for accessing global memory for (i = 0; i < blockSize * blockSize; i++) output[i] = 0.0; }