My issue is with Section 2.6.1 of the "OpenCL Optimization Guide" (Hiding Memory Latency with ALU Operations)
2.6.1 Hiding Memory Latency with ALU Operations
It seems when you send a global memory request 1 wavefront is taken up on the respective ALU to wait for the reply?
"The wavefront that generates the global memory access is made idle until the memory request completes."
"consider a global memory access which takes 400 cycles to execute"
"To fully hide the 400 cycles of latency, the compute unit requires [100 wavefront insturctions]."
Only 100... shouldn’t it be 300.
An ALU has 4 SIMDs, each SIMD can process a wavefront, and only 1 SIMD is waiting on memory?
Or is it 400 and this is all wrong because waiting on memory doesn't happen on SIMDs?
// Unrelated Question below
// I need to test it more all my test have failed...
I can't seem to understand the example in Section 2.1.1 Channel Conflicts
"In the next example of a kernel for copying, the input and output buffers are
interpreted as though they were 2D, and the work-group size is organized as 2D."
#define WIDTH 1024
#define DATA_TYPE float
#define A(y , x ) A[ (y) * WIDTH + (x ) ] // how?
#define C(y , x ) C[ (y) * WIDTH+(x ) ]
kernel void copy_float (__global const
DATA_TYPE * A,
__global DATA_TYPE* C)
int idx = get_global_id(0);
int idy = get_global_id(1);
C(idy, idx) = A( idy, idx);
How does this work, can't arrays only be constant in size? (its an input, so its special? does it know somehow?)
Also shouldn't it be more like A[y*WIDTH*x].
I can see how y would have some effect, but x just looks like its adding a few elements to the end of the array/buffer not changing the fake image's dimensions?
Is A[ (y) * WIDTH + (x ) ] different to A[ y * WIDTH + x ]
How is the work group 2D?