cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

lolliedieb
Adept II

How to access more then 32k byte shared memory on Vega & Navi using Windows?

Hi all.
Well the title already describes it.
I have got a code using 64k LDS on a Radeon VII and a RX 5700. Work group size is 1024.


Its working fine on Ubuntu 16.04 and 18.04 using amdgpu-pro 18.50, 19.30 and ROCm 2.10 (all on VII) and in an other system on 19.30 on the RX 5700.

Unfortunately it does not work on the first test system (VII) booting Windows 10 using Adrenalin 19.10.1 WHQL. The code compiles well but once queued does exit with a CL_OUT_OF_RESOURCES error. I doubt it is the compiler since I had the feeling Linux 19.30 and Adrenalin 19.10.1 are more or less binary kernel compatible.

A variant with only 32k shared memory and the remaining part of the shared operations shifted to global memory does work on all systems, but is super slow.  Unfortunately some of my clients run Windows, so I wonder how to get this to work with the Adrenalin runtime - especially since the ISA for Vega states the full 64k are available.

Additionally I wonder if there are any documentations about the existing runtime environmental variables the AMD drivers understand. Concretely I am searching for options to switch WAVE32 / WAVE64 and WGP / CU mode on Navi

Thanks in advance

0 Kudos
Reply
6 Replies
lolliedieb
Adept II

Re: How to access more then 32k byte shared memory on Vega & Navi using Windows?

No one? Its obviously no limitation of the hardware nor of the compiler - it rather seems to be a simple check in the queueing system that claims resources get exceeded. There should be a simple way to disable that / increase the bound for mentioned kind of cards. If not would consider that a bug of the runtime.

0 Kudos
Reply
dipak
Staff
Staff

Re: How to access more then 32k byte shared memory on Vega & Navi using Windows?

Thank you for the above query. I have forwarded your query to the OpenCL team for their feedback. As soon as I get their reply, I'll come back to you.

Thanks.

0 Kudos
Reply
dipak
Staff
Staff

Re: How to access more then 32k byte shared memory on Vega & Navi using Windows?

As the OpenCL team has replied, currently Vega has 64 KB local/shared memory enabled on Linux, but 32 KB on Windows. This could be the reason for the CL_OUT_OF_RESOURCES error.

Navi has 64KB local/shared memory enabled on both Windows and Linux, so the code is expected to work fine on Navi.

Thanks.

0 Kudos
Reply
lolliedieb
Adept II

Re: How to access more then 32k byte shared memory on Vega & Navi using Windows?

Thanks for the reply. So I can ship the faster versions for Navi, but not for Vega in Windows. Hmm. sad to hear that. Are there any plans to make the full size on Vega available with upcoming runtime releases?

Also I wonder what about 128k shared memory on Navi in WGP mode. Any chance to activate that yet?

Thanks again.

0 Kudos
Reply
dipak
Staff
Staff

Re: How to access more then 32k byte shared memory on Vega & Navi using Windows?

Are there any plans to make the full size on Vega available with upcoming runtime releases?

Sorry, I can not provide a time frame at this moment.

Regarding your other query about the shared memory on Navi in WGP mode, I'll check with the OpenCL team and confirm.

Thanks.

0 Kudos
Reply
dipak
Staff
Staff

Re: How to access more then 32k byte shared memory on Vega & Navi using Windows?

Regarding LDS usage on Navi, here are some important insights shared by the OpenCL team:

-m[no-]cumode:

Control the default wavefront execution mode used when generating code for kernels. When disabled native WGP wavefront execution mode is used, when enabled CU wavefront execution mode is used. 

  • Maximum LDS that can be accessed from a single workgroup is 64KB.  Therefore, in order to access all 128KB available, at least 2 workgroups are needed to run on a WGP. If run in CU mode, each workgroup will access only its “nearby” half of the LDS.  If run in WGP mode, LDS allocations could possibly span the two halves of the LDS. [Note: The LDS on a WGP is built from two 64KB arrays]

Thanks.

0 Kudos
Reply