cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

balogh
Adept I

OpenCL async_work_group_copy with non-uniform workgroups

Hello!

I am writing a neighbourhood analysis algorithm for which the kernel gets the data into 3 distinct buffers each containing 2D data, which, when imagining them stacked one over the other would logically form a bigger "picture". The neighbourhood operation needs to be done on the middle buffer. Each analyzed row requires two rows above it and two rows below it, thus I need the two neighbour buffers in order to read the data from the last two "rows" from the first buffer and the first two "rows" from the 3rd buffer.

The host code often limits the width and height of the data to be analyzed by using global offsets and sizes to do this, often resulting in non-uniform workgroup sizes.

The algorithm runs fine when copying workgroup data from global memory to local memory manually: each work item copies its "focus" byte and if it is close to the "edge" of the 2D data from local memory, also copy its relevant neighbours - this makes the local 2D array initialization code a little bit ugly, since the neighbours, depending on the workgroup's position in the "big picture", might be from any of the three distinct buffers.

Therefore I wanted to try async_work_group_copy in order to test copying entire rows into local memory from top to bottom, making the code a little bit more readable and hoping to get at least the same performance.

With uniform workgroup sizes the copying is always correct.

However, when using non-uniform workgroup sizes then the "rightmost" workgroups, i.e. those that are narrower ( local size(0) < enqueued local size(0) ) than the rest of the workgroups, get random values in the last 4 bytes from the "rows" that were supposed to be copied from global memory to local memory ("random" with every run of the test program).

GPU: Radeon Instinct MI100, rocm v5.4.0, OpenSUSE 15.3

My question is whether it is a known limitation with the AMD implementation of async_work_group_copy or it is simply a stealthy bug that manifests with my pattern of usage. I don't have an nvidia card that supports non-uniform workgroups so I couldn't test this with a different OpenCL implementation.

I could provide source code with irrelevant parts removed if it might help.

Thank you!

Best regards,

Claudiu Balogh

21 Replies
dipak
Big Boss

Hi Claudiu,

Thank you for reporting it. I have moved the post to the OpenCL forum and whitelisted you for the AMD Devgurus community.

Could you please provide a minimal reproducible code example? Also, please share the driver information and the clinfo output.

Thanks.

balogh
Adept I

Hi Dipak

Thank you for your interest in my report.

The source code for a working example can be found at async-test .

Build command:

g++ -std=c++14 async-test.cpp -o async-test -I/opt/rocm/include -L/opt/rocm/lib -lOpenCL

If this is not an error in my OpenCL code, I would be happy to open a rocm issue on github.

Thank you!

Best regards,

Claudiu Balogh

Thanks for providing the reproducible source code. I will forward the issue to the OpenCL team. I will let you know once I get their feedback on this.

Thanks.

Hi @balogh ,

Could you please try to compile the kernel without optimization (i.e. with -O0 or -cl-opt-disable)  to check if it's working or not?

Thanks.

0 Likes

Hi @dipak

It does reproduce the same way also without optimizations.

HTH

Best regards,

Claudiu Balogh

0 Likes

Thanks for the information.

0 Likes

Hi @balogh ,

As the OpenCL team has informed,  async_work_group_copy is expected to work fine with non-uniform workgroups.  They checked the implementation, and it is properly accounting for non-uniform workgroups.

Since there are multiple calls to async_work_group_copy in the code example you provided, could you please help to identify the specific calls which you think are not producing the correct result?

Thanks.

0 Likes

Hi @dipak 

Any of the calls can fail for non-uniform workgroups.

Thank you!

Best regards,

Claudiu Balogh

Okay, thanks for this information.

0 Likes

Hi @balogh ,

I was trying to reproduce the issue with the source code you provided and I observed the following points:

1) If TILE_H is set to 1, it looks like there is no issue for non-uniform workgroups. When the TILE_H > 1, then this mismatch issue occurs  (also for a single work-group).

2) For non-uniform workgroups, it seems that number of wrong values (i.e. "mismatch") per row is related to the neighboring column and row values i.e. UNBR, LNBC, BUF_HEAD_PAD etc. For example, when I changed all these values to 3, the mismatch count also changed accordingly.

Could you please try the above cases? If your observation is same, it would be helpful if you please check the program logic for non-uniform workgroups to confirm whether it is working as expected or not?     

Another point, in the following line (in "async-test.cpp"), it is better to use "cl_long" instead of "long" as this line may produce error on some platforms where "long" is not same as "cl_long".

cl::KernelFunctor<cl::Buffer, cl::Buffer, cl::Buffer, long, long, long> k(prg, "copylocal", &kErr);

Thanks. 

0 Likes

Hi @dipak ,

Since the code works fine with uniform workgroups I assume the code is correct, unless someone can point to the error when run with non-uniform workgroups.

I checked the code several times and all the arguments passed to the async_work_group_copy calls are correct regardless of the uniformity of the workgroup.

At this point it would be useful if the compiler team could try the code and point to the problem, be it in my code or the OpenCL implementation.

Thank you!

Best regards,

Claudiu Balogh

0 Likes

Thanks for your feedback. I will check with the OpenCL team and file a bug ticket if needed.

Thanks.

0 Likes

Hi @balogh ,

I modified your kernel code to create a reduced test-case (attached here) for this "async_work_group_copy" issue. It seems like the issue is also reproducible even for a single work-group when global work size is less than local work-group size. 

Could you please try the attached code and let me know if the issue is reproducible with it?

Thanks.

0 Likes

Hi @dipak

Thank you for the reduced test case.

I tried it and the issue is reproducible.

Best regards,

Claudiu Balogh

0 Likes

Thanks for the confirmation. I have shared this test-case and my observations with the OpenCL team. I will let you know once I get their feedback on this.

Thanks.

0 Likes

Hi @balogh ,

The OpenCL team was able to reproduce the issue and they have identified the root-cause of it. It seems like there is a bug in some low level function used by the async copy. They will investigate it in detail and implement a fix for this issue.

Thanks.

balogh
Adept I

Hi @dipak 

Was a bug report submitted for this problem?

If there was, I would like to follow it.

Thank you!

Best regards,

Claudiu Balogh

 

0 Likes

I will check with the OpenCL team and get back to you.

Thanks.

0 Likes

Hi @balogh ,

The OpenCL team has informed that the issue has been fixed internally. 

Thanks.

0 Likes

Hello @dipak

 

Could you please guide me with trying the fix?

Is the fix available in some rocm release and will it be available in rocm 5.4.0 or older (we currently use 5.2.1 in production)? Newer rocm releases dropped support for the OS that we are currently using in production, so it would be best if we could benefit from the fix also with the older rocm releases.

If it is not available, could you point me to where I could get it from?

 

Thank you!

 

Best regards,

Claudiu Balogh

0 Likes

Hi @balogh ,

As I have been informed, the fix is not publicly released yet. Probably ROCm 5.7 will include the fix. This is just a tentative timeline and may change as decided by the release team.

 

will it be available in rocm 5.4.0 or older 

I think it is unlikely that the fix will be ported to older ROCm versions.

Thanks.

0 Likes