0 Replies Latest reply on May 13, 2010 8:24 PM by windy96

    Local memory access in CPU

    windy96
      local memory access latency global memory

       

      When I run this kernel on Intel CPU Core2Duo, I found curious case about memory performance.

      Running part 1 is faster than running part2.  It means local memory access is faster than global memory access.  Since this CPU does not have dedicated local memory, it emulates local memory using global memory.  For this reason, local memory access should be identical to global memory access.  But I got a different result.

       

      I thought possibilities of optimization.  However, because running part1 is obviously slower than running empty kernel or running empty for loop, I do not believe this result is from compiler's cool optimization.

      What is the reason?  Is there any reason of implementing OpenCL CPU platform and runtime like this?

       



      __kernel void testMemoryAccess(__global float * output, __global float * input, __local float * block, const uint width, const uint height, const uint blockSize ) { int i; // 1. for accessing local memory for (i = 0; i < blockSize * blockSize; i++) block[i] = 0.0; // 2. for accessing global memory for (i = 0; i < blockSize * blockSize; i++) output[i] = 0.0; }