OpenCL

lolliedieb · ‎12-13-2019

Hi all.
Well the title already describes it.
I have got a code using 64k LDS on a Radeon VII and a RX 5700. Work group size is 1024.

Its working fine on Ubuntu 16.04 and 18.04 using amdgpu-pro 18.50, 19.30 and ROCm 2.10 (all on VII) and in an other system on 19.30 on the RX 5700.

Unfortunately it does not work on the first test system (VII) booting Windows 10 using Adrenalin 19.10.1 WHQL. The code compiles well but once queued does exit with a CL_OUT_OF_RESOURCES error. I doubt it is the compiler since I had the feeling Linux 19.30 and Adrenalin 19.10.1 are more or less binary kernel compatible.

A variant with only 32k shared memory and the remaining part of the shared operations shifted to global memory does work on all systems, but is super slow. Unfortunately some of my clients run Windows, so I wonder how to get this to work with the Adrenalin runtime - especially since the ISA for Vega states the full 64k are available.

Additionally I wonder if there are any documentations about the existing runtime environmental variables the AMD drivers understand. Concretely I am searching for options to switch WAVE32 / WAVE64 and WGP / CU mode on Navi

Thanks in advance

lolliedieb · ‎12-17-2019

No one? Its obviously no limitation of the hardware nor of the compiler - it rather seems to be a simple check in the queueing system that claims resources get exceeded. There should be a simple way to disable that / increase the bound for mentioned kind of cards. If not would consider that a bug of the runtime.

dipak · ‎12-17-2019

Thank you for the above query. I have forwarded your query to the OpenCL team for their feedback. As soon as I get their reply, I'll come back to you.

Thanks.

dipak · ‎12-18-2019

As the OpenCL team has replied, currently Vega has 64 KB local/shared memory enabled on Linux, but 32 KB on Windows. This could be the reason for the CL_OUT_OF_RESOURCES error.

Navi has 64KB local/shared memory enabled on both Windows and Linux, so the code is expected to work fine on Navi.

Thanks.

lolliedieb · ‎12-18-2019

Thanks for the reply. So I can ship the faster versions for Navi, but not for Vega in Windows. Hmm. sad to hear that. Are there any plans to make the full size on Vega available with upcoming runtime releases?

Also I wonder what about 128k shared memory on Navi in WGP mode. Any chance to activate that yet?

Thanks again.

dipak · ‎12-18-2019

Are there any plans to make the full size on Vega available with upcoming runtime releases?

Sorry, I can not provide a time frame at this moment.

Regarding your other query about the shared memory on Navi in WGP mode, I'll check with the OpenCL team and confirm.

Thanks.

dipak · ‎12-20-2019

Regarding LDS usage on Navi, here are some important insights shared by the OpenCL team:

WGP is default mode for Navi. To switch to CU mode one needs to pass "-m-cumode" option. As the User Guide for AMDGPU Backend — LLVM 10 documentation says:

-m[no-]cumode:
Control the default wavefront execution mode used when generating code for kernels. When disabled native WGP wavefront execution mode is used, when enabled CU wavefront execution mode is used.

Maximum LDS that can be accessed from a single workgroup is 64KB. Therefore, in order to access all 128KB available, at least 2 workgroups are needed to run on a WGP. If run in CU mode, each workgroup will access only its “nearby” half of the LDS. If run in WGP mode, LDS allocations could possibly span the two halves of the LDS. [Note: The LDS on a WGP is built from two 64KB arrays]

Thanks.

OpenCL

How to access more then 32k byte shared memory on Vega & Navi using Windows?