Apologies once again for taking such a long time to reply. I have now implemented your suggestion and, on the 5700 XT, I can now remove all the flushes and am getting competitive performance which is fantastic. However, I have encountered one issue. For simplicity, I wanted to allocate a single large buffer of CL_DEVICE_MAX_MEM_ALLOC_SIZE but, when I do this, the performance drops hugely until I reduce the buffer size a bit (around 4GB is fine - I haven't managed to find the exact threshold). Is this a known issue here or am I mis-understanding CL_DEVICE_MAX_MEM_ALLOC_SIZE? For reference, clInfo shows the following related values:
Global memory size (CL_DEVICE_GLOBAL_MEM_SIZE) 8573157376 (7.984GiB)
Max memory allocation (CL_DEVICE_MAX_MEM_ALLOC_SIZE) 7059013632 (6.574GiB)
Max size for global variable (CL_DEVICE_MAX_GLOBAL_ VARIABLE_SIZE) 6353112064 (5.917GiB)
Thanks again for all your help
I can't say much without investigation. In general Windows doesn't support >4GB single allocation and runtime requires extra logic to handle that case, but the split is enabled even for much smaller allocations. Usually performance drop occurs if runtime can't fit the allocation inside device memory and it will fallback into system memory. Check memory monitor and see if something else consumes GPU memory on your system.
Thanks for your rapid response. The allocation coming from system memory would totally explain this but, this is on Linux and CL_DEVICE_GLOBAL_FREE_MEMORY_AMD reports that there is 7.922 GiB free. I can try and make a minimal reproducible example if that helps? Also, on Windows, would the 32-bit limit be reflected in the CL_DEVICE_MAX_MEM_ALLOC_SIZE device info?
It is indeed the first allocation of the app but the size at which everything slows down is (found after some binary-searching) is actually 4645191681 bytes which doesn't seem to have any significance in binary or any relation to any of the device info values.
After the app allocates memory just run clEnqueueFillBuffer() (use clear pattern size 4 or 8 bytes) and measure performance. Do you see the drop with > 4645191681 bytes?