I have a 295x2 and, for each GPU, I use multiple processes (lets assume >=2) to allocate and work on large memory buffers.
The sum of the buffers is either less than the following two limits
1. CL_DEVICE_MAX_MEM_ALLOC_SIZE ~2.4GB
2. CL_DEVICE_GLOBAL_MEM_SIZE ~3.2GB
if i try to allocate buffers that total upto 2.4GB (2 processes, 1.2GB in many buffers), the program finishes 10x faster than when the total buffer size is ~3.4GB (2 processes, 1.7GB in many buffers).
My Questions:
1. Is there a reason for this slowdown?
2. Do i have to contend with atmost using only 2.4GB per GPU even after utilizing multiple processes?
Strangely enough, i didn't run into any limits and i was able to allocate (sum of) buffers approaching CL_DEVICE_GLOBAL_MEM_SIZE on a 7870 with 2GB memory.
I am using debian with 14.9 catalyst and SDK 2.9.1. I tried searching for answers but couldn't find any.
Thanks for reading.