Showing results for 
Search instead for 
Did you mean: 


Adept I

How do OpenCL global memory requests work?

My issue is with Section 2.6.1 of the "OpenCL Optimization Guide" (Hiding Memory Latency with ALU Operations)
2.6.1 Hiding Memory Latency with ALU Operations

It seems when you send a global memory request 1 wavefront is taken up on the respective ALU to wait for the reply?

"The wavefront that generates the global memory access is made idle until the memory request completes."

"consider a global memory access which takes 400 cycles to execute"
"To fully hide the 400 cycles of latency, the compute unit requires [100 wavefront insturctions]."

Only 100... shouldn’t it be 300.
An ALU has 4 SIMDs, each SIMD can process a wavefront, and only 1 SIMD is waiting on memory?

Or is it 400 and this is all wrong because waiting on memory doesn't happen on SIMDs?

// Unrelated Question below

// I need to test it more all my test have failed...
I can't seem to understand the example in Section 2.1.1 Channel Conflicts
"In the next example of a kernel for copying, the input and output buffers are
interpreted as though they were 2D, and the work-group size is organized as 2D."
#define WIDTH 1024
#define DATA_TYPE float
#define A(y , x ) A[ (y) * WIDTH + (x ) ]         // how?
#define C(y , x ) C[ (y) * WIDTH+(x ) ]
kernel void copy_float (__global const
__global DATA_TYPE* C)
int idx = get_global_id(0);
int idy = get_global_id(1);
C(idy, idx) = A( idy, idx);
How does this work, can't arrays only be constant in size? (its an input, so its special? does it know somehow?)
Also shouldn't it be more like A[y*WIDTH*x].
I can see how y would have some effect, but x just looks like its adding a few elements to the end of the array/buffer not changing the fake image's dimensions?
Is A[ (y) * WIDTH + (x ) ] different to A[ y * WIDTH + x ]
How is the work group 2D?

3 Replies
Big Boss

A SIMD has 16 ALUs and it takes four cycles to complete a wavefront or 64 work-items. In other words, the hardware executes each instruction in the wavefront in four cycles. So, to fully hide the 400 cycles of latency, the hardware needs (400 / 4) = 100 wavefront instructions.

For the 2nd part of your post:

It's just an example code and it uses macros to quickly produce a set of kernels by changing the width, the data type and the work-group dimensions. 

Is A[ (y) * WIDTH + (x ) ] different to A[ y * WIDTH + x ]

Yes, the outcome can be different. For details, please find this discussion:


I get that on a SIMD 100 operations = 400 cycles, because SIMDs operate over 4 cycles
I don't get what, "The wavefront that generates the global memory access is made idle until the memory request completes." means?
Q1 : Is 1 of the 4 SIMDs completely idle on an ALU while it waits for a memory request?
Q2 : It says the "wavefront" is idle until the memory request completes, does that mean the work-group it is attached to continues executing? (this sounds like a basic question on how actually parallel work-groups are I need to go do some revision...)

Thank you for stack-overflow link .


1. No. While the wavefront is waiting or idle, the SIMD can select another wavefront which is ready to execute and start executing it.

2. Yes. If other wavefronts belonging to the same work-group have no dependency on that idle wavefront, then other wavefronts can progress.  In general, wavefronts are executed independently until any synchronization is needed.