OpenCL

robin_christ · ‎02-03-2018

Hello community,

First, about me:

I'm a student from Germany. I am studying Computational Engineering and I write GPU-accelerated numerical software.

With the current AMD Radeon 18.2.1 drivers, there seems to be an issue using clEnqueueReadBufferRect (I haven't tested clEnqueueWriteBufferRect though) with large numbers:

Suppose we want to generate BEM-matrices on the GPU. For problem sizes above a specific limit, the problem cannot be calculated in one run because either the total available memory size on the device or the maximum available memory per buffer is too small to hold the full matrices.

We now split the problem and just calculate parts of the matrices on the GPU.

Suppose our problem is large, say n=23171.

The host buffer now has a row width (for std::complex<float>) of 23171 * 8 = 185368 bytes.

The kernel now generates a certain number of columns of the whole matric, hence clEnqueueReadBufferRect.

Because our matrix is quadratic, the length of the rows, region[1], is 23171.

Even though nothing is special about the number 23171 it is not possible to profile the clEnqueueReadBufferRect workloads anymore, because the values getProfilingInfo<CL_PROFILING_COMMAND_END>() and CL_PROFILING_COMMAND_STARTreturned from the profiling events are identical!

I can neither confirm nor deny the correctness of the data read or written yet because the mesh for test case is (more or less) randomly generated.

However, let's take a look at the parameters of clEnqueueReadBufferRect:

I identified region[1] and host_row_pitch as the offending parameters.

A few examples might be (first the region[1] parameter and afterwards the how_row_pitch parameter):

23171 | 23171 * 8 -> FAIL

23170 | 23171 * 8 -> FAIL

23170 | (23171 * 😎 - 1 = 23170 | 185367 -> WORKS

Why does 23170 | 185367 work? Well, let's take a look at the products of the parameters:

23171 * 23171 * 8 = 23171 * 185368 = 4 295 161 928 -> FAIL

23170 * 23171 * 8 = 23170 * 185368 = 4 294 976 560 -> FAIL

23170 * 185367 = 4 294 953 390 -> WORKS

Note the limit of uint32_t: 4 294 967 295!

To cut a long story short: It seems like AMD is using unint32_t internally in their drivers which causes a overflow for clEnqueueReadBuffer.

dipak · ‎02-07-2018

Thanks for reporting it. Could you please provide a repro that manifests the issue? Also, please share clinfo output and other setup details.

P.S. You've been whitelisted now.

robin_christ · ‎04-07-2018

Hi dipak,

I packed everything into a repo: GitHub - robinchrist/amdbugdemo_1: Proof of concept for bug in AMD Driver

If you need more information, just tell me.

dipak · ‎04-09-2018

Hi Robin,

Thanks for providing the reproducible test-case. We'll check and get back to you.

Regards,

dipak · ‎04-11-2018

I was able to reproduce the issue. I'll report it to the concerned team.

dipak · ‎04-19-2018

As I've come to know, currently runtime has a 4GB limit for clEnqueue[Read/Write]BufferRect and it may fail to transfer when memory size is larger than this limit. Runtime may address this issue in the future.

[P.S. clEnqueue[Read/Write]Buffer seems working fine in this case.]

One point to note. It is recommended to use pre-pinned system memory for transfers to avoid pinning overhead or double copy in read/write.

Regards,

OpenCL

Bug in AMD driver