First, about me:
I'm a student from Germany. I am studying Computational Engineering and I write GPU-accelerated numerical software.
With the current AMD Radeon 18.2.1 drivers, there seems to be an issue using clEnqueueReadBufferRect (I haven't tested clEnqueueWriteBufferRect though) with large numbers:
Suppose we want to generate BEM-matrices on the GPU. For problem sizes above a specific limit, the problem cannot be calculated in one run because either the total available memory size on the device or the maximum available memory per buffer is too small to hold the full matrices.
We now split the problem and just calculate parts of the matrices on the GPU.
Suppose our problem is large, say n=23171.
The host buffer now has a row width (for std::complex<float>) of 23171 * 8 = 185368 bytes.
The kernel now generates a certain number of columns of the whole matric, hence clEnqueueReadBufferRect.
Because our matrix is quadratic, the length of the rows, region, is 23171.
Even though nothing is special about the number 23171 it is not possible to profile the clEnqueueReadBufferRect workloads anymore, because the values getProfilingInfo<CL_PROFILING_COMMAND_END>() and CL_PROFILING_COMMAND_STARTreturned from the profiling events are identical!
I can neither confirm nor deny the correctness of the data read or written yet because the mesh for test case is (more or less) randomly generated.
However, let's take a look at the parameters of clEnqueueReadBufferRect:
I identified region and host_row_pitch as the offending parameters.
A few examples might be (first the region parameter and afterwards the how_row_pitch parameter):
23171 | 23171 * 8 -> FAIL
23170 | 23171 * 8 -> FAIL
23170 | (23171 * 😎 - 1 = 23170 | 185367 -> WORKS
Why does 23170 | 185367 work? Well, let's take a look at the products of the parameters:
23171 * 23171 * 8 = 23171 * 185368 = 4 295 161 928 -> FAIL
23170 * 23171 * 8 = 23170 * 185368 = 4 294 976 560 -> FAIL
23170 * 185367 = 4 294 953 390 -> WORKS
Note the limit of uint32_t: 4 294 967 295!
To cut a long story short: It seems like AMD is using unint32_t internally in their drivers which causes a overflow for clEnqueueReadBuffer.
Thanks for reporting it. Could you please provide a repro that manifests the issue? Also, please share clinfo output and other setup details.
P.S. You've been whitelisted now.
As I've come to know, currently runtime has a 4GB limit for clEnqueue[Read/Write]BufferRect and it may fail to transfer when memory size is larger than this limit. Runtime may address this issue in the future.
[P.S. clEnqueue[Read/Write]Buffer seems working fine in this case.]
One point to note. It is recommended to use pre-pinned system memory for transfers to avoid pinning overhead or double copy in read/write.