2nd question today, but more or less unrelated to the first one.
I have an application, that needs to use almost the complete memory that the GPU offers (up to 5 MByte). The driver reports (via CL_DEVICE_GLOBAL_MEM_SIZE and CL_DEVICE_GLOBAL_FREE_MEMORY_AMD ) that enough memory is available and also the application obeys the wish not to exceed CL_DEVICE_MAX_MEM_ALLOC_SIZE. The total memory is split into two or three buffers, not necessarily the same size.
Still running the card in Linux it works fine, while in Windows 10 I see the driver starts to swap out memory to the host system once I try to touch the last ~256 MByte - and this slows down the execution extremely.
In both cases the GPU does no graphics out - this is done via an other device.
Question: is there any undocumented environment variable or other trigger to turn of GPU memory virtualization completely so it forces the buffer to be placed on device?
Thanks very much
Thank you for your query. It would be helpful if you can provide a reproducible test-case or a code snippet to show the buffer allocation and other details. Also, please attach the clinfo output.
Well the buffer allocation is just two standard calls
buffer0 = cl::Buffer(*ctx, CL_MEM_READ_WRITE, 128UL*25165824, NULL, &err)
buffer1 = cl::Buffer(*ctx, CL_MEM_READ_WRITE, 128UL*7790592, NULL, &err)
so 3072 MByte in buffer 0 and 951 MByte in buffer 1. Total 4023
The card is a standard 4G 580, nothing special about it afaik.
Relevant clinfo excepts:
Max clock frequency: 1366Mhz
Address bits: 64
Max memory allocation: 3414669721
Global memory size 4280254464 (3.986GiB)
Global free memory (AMD) 4159488 (3.966GiB)
It works well allocating the two buffers and doing the computation at full speed on Linux while in Windows 10 I see some virtual memory popping up (0.1 - 0.2 GByte only) and the kernel speed slows down significantly. All other buffers in the application use CL_MEM_ALLOC_HOST_PTR and no other applications running at the same time.
Below is the OpenCL team's feedback:
"The application can’t allocate all available memory, there is a display surface and WDM activity. Also AMD GPUs have visible heap (i.e. host-visible device memory), so in reality the app has to exclude 256MB from the reported memory or the last allocations must be in tiny sizes (< 64MB).
It depends what exactly the user was running under Linux (PAL or ROCM) and if clinfo was reporting exactly the same amount as Windows, but visible heap issue exists under Linux also."