On GPUs having more than 3GB memory aboard it does not seem possible to make OpenCL runtime expose more than 3 GB of RAM.
Changing the GPU_MAX_HEAP_SIZE variable as described here only allows to decrease the amount of exposed RAM, but not to increase it.
For example, on an R9 290 having 4GB of memory setting GPU_MAX_HEAP_SIZE to 100 results in 3GB to become available. Changing the variable to 50 caused only 2GB to be exposed, so the variable is respected, but there seems to be an absolute maximum of 3GB.
R9 290X with 8GB RAM only allows to use up to 3GB as well both in 32-bit and 64-bit applications.
Why is it happening? Are there any plans to expose more memory in future versions of driver?
Try to set GPU_FORCE_64BIT_PTR=1 Re: Can AMD opencl support 6GB of device memory?
With multiple 4GB GPUs, the newer driver seems to allow 4GB for the first card and 3.2GB for the rest of the GPUs. With the env variable, all cards should expose the full memory
try set GPU_FORCE_64BIT_PTR=1 so runtime begin generate 64 bit kernels. Then you should be able to get more thank 3GB of RAM.
Thanks, titanius and nou -- I wish I could mark both answers as correct. Changing this variable indeed causes all memory to become exposed. However, it's still not clear what exactly the variable does.
For example, if the kernels are compiled offline (-fno-bin-source -fno-bin-llvmir -fno-bin-amdil -fbin-exe), do you need to set this variable at compile time or at run time?
What happens if the variable was set at compile time and is not set at run time (or the other way around)?
What happens if host application is a 32-bit one? (so sizeof(size_t) on CPU will be 4). What amount of memory will be returned, say, from an 8GB-GPU? 4GB?
Does anyone have any information on how GPU_FORCE_64BIT_PTR and offline compilation are related and what exactly does this variable do?
Those are all good questions.
AMD GPUs much like CPUs have virtual to physical translation system and can operate either in 32bit and 64bit mode. While 64Bit address space enables access to a larger memory bank it may also degrade performance since pointer access requires double the clocks than in 32bit mode.
In OCL 1.2 :
1.) By default GPU 32bit mode is enabled for all process types unless the environment variable above is specified.
2.) There is a complete disjoint between the process bitness and the GPU bitness. Hence, 32 bit processes can run in GPU 64 bit mode and vice versa.
3.) When the runtime is exporting binaries it is also exporting all intermediate representations of the code. If options or environment has changed when the binary is loaded the compiler library will silently recompile from the first convergent point.
In OCL 2.0 :
1.) Because of SVM , CPU bitness and GPU bitness are tightly coupled together. GPU 64 bit mode is enabled for 64 bit processes and GPU 32bit mode is enabled for 32bit process.
The mode of operation can be discovered by the application by calling clGetDeviceInfo with 'CL_DEVICE_ADDRESS_BITS' flag.
Thanks for jumping in, Tzachi.
What you are saying has been my initial assumption, however the tests I have conducted contradict it. Specifically, if I use offline compilation for a kernel with GPU_FORCE_64BIT_PTR = 0, this kernel (and in fact, our whole application) then works correctly both when the variable is set to 0 and to 1. The opposite seems to be true as well: if GPU_FORCE_64BIT_PTR is set to 1 during compilation, the application works correctly both when the variable is set to 0 and 1 at run time. I don't see any performance difference either.
For the reference: I was using Catalyst 14.9 and Radeon 7970 with 3 GB of RAM in these tests.
Thank you for updating your answer, Tzachi.
When the runtime is exporting binaries it is also exporting all intermediate representations of the code. If options or environment has changed when the binary is loaded the compiler library will silently recompile from the first convergent point.
We are compiling our kernels offline with the following parameters: -fno-bin-source -fno-bin-llvmir -fno-bin-amdil -fbin-exe. As far as I understand, in this case no intermediate representation is saved (neither LLVM IR, nor AMD IL), only binary. Source is also not included. Nevertheless, as I described in the previous post, the application works correctly when the value of GPU_FORCE_64BIT_PTR at compile time differs from the one in run time. Suggestions?