AnsweredAssumed Answered

How do OpenCL global memory requests work?

Question asked by cardboard on Apr 1, 2019
Latest reply on Apr 3, 2019 by dipak

My issue is with Section 2.6.1 of the "OpenCL Optimization Guide" (Hiding Memory Latency with ALU Operations)
2.6.1 Hiding Memory Latency with ALU Operations

It seems when you send a global memory request 1 wavefront is taken up on the respective ALU to wait for the reply?

"The wavefront that generates the global memory access is made idle until the memory request completes."

"consider a global memory access which takes 400 cycles to execute"
"To fully hide the 400 cycles of latency, the compute unit requires [100 wavefront insturctions]."

Only 100... shouldn’t it be 300.
An ALU has 4 SIMDs, each SIMD can process a wavefront, and only 1 SIMD is waiting on memory?

Or is it 400 and this is all wrong because waiting on memory doesn't happen on SIMDs?
Thanks.

 

 

 

// Unrelated Question below

// I need to test it more all my test have failed...
I can't seem to understand the example in Section 2.1.1 Channel Conflicts
"In the next example of a kernel for copying, the input and output buffers are
interpreted as though they were 2D, and the work-group size is organized as 2D."
#define WIDTH 1024
#define DATA_TYPE float
#define A(y , x ) A[ (y) * WIDTH + (x ) ]         // how?
#define C(y , x ) C[ (y) * WIDTH+(x ) ]
kernel void copy_float (__global const
DATA_TYPE * A,
__global DATA_TYPE* C)
{
int idx = get_global_id(0);
int idy = get_global_id(1);
C(idy, idx) = A( idy, idx);
}
How does this work, can't arrays only be constant in size? (its an input, so its special? does it know somehow?)
Also shouldn't it be more like A[y*WIDTH*x].
I can see how y would have some effect, but x just looks like its adding a few elements to the end of the array/buffer not changing the fake image's dimensions?
Is A[ (y) * WIDTH + (x ) ] different to A[ y * WIDTH + x ]
How is the work group 2D?
THANKYOU!

Outcomes