local memory performance question

Discussion created by fajsc88 on Sep 23, 2011
Latest reply on Sep 23, 2011 by notzed
operations on local memory affecting performance of copy to global mem space

p { margin-bottom: 0.08in; }

Using the kernel code located below, why does the number of respins affect the time required to copy the final results from local to global memory?  I'm running a single work-group and single work-item. The barrier is placed after all the respins have occurred. Wouldn't the calculation done on the local memory be done in cache, with one cache flush when the barrier statement is reached? Then the final copy out to global memory shouldn't be dependent on the number of respins.

What I'm observing is:

  1. If omitting the final copy out to global memory, this kernel runs extremely fast, regardless of the number of respins.

  2. When the final copy to global memory is included, the kernel run time degrades as the number of respins is increased.

Obvervation #2 doesn't make sense. Shouldn't the final copy out to global space take a fixed amount of time regardless of the number of respins?

Thank you.

__kernel void gpufunc(__global uint *input, __global uint *output,
              __local uint *inl) {
    unsigned int n;
    unsigned int cnt = 0;

    /* Move input to local memory */
    for (n=0;n<256;n++) {
        inl[n] = input[n];


    /* Do some calculations */
    for (n=0;n<256;n++) {
        inl[n] = inl[n] + 1;

    if (++cnt < 0xFFF) goto respin;


    /* move data back to global memory to be read by application */
        for(n=0; n < 256; n++)   {
            output[n] = inl[n];