Table 4.2 of AMD APP Programming Guide tells you where the run-time allocates the buffers.
Usually, if you dont pass any flags in clCreateBuffer() time, the buffer gets allocated on the discrete GPU.
(I am assuming dGPU system and not an APU based system).
I dont understand what you mean by
Can you explain what happens? Are you allocating an array of buffers? OR Are you just allcating a single buffer
In any case, what is the size of the buffer? and the array size (if u r allocating an array of buffers)
global_tmp is a single buffer equivalent to
on the GPU. The only flag used is CL_MEM_READ_WRITE.
size_function is just sizeof(some_struct)*batch_size
I found a bug in my own code, which caused batch_size to be 1, which explains a lot, but now I'm not able to recreate the problem.
For reference; with batch_size = 1, global_tmp is 25088 bytes, and the loop executes around 56000 kernels.
Currently this allocates only around 5gb host memory, which is manageable.
Note, the problem is technically solved when batch_size is correct (kernels executed = 4), but I'm still curious about that 5gb.
Host memory for 5GB allocated for 25K buffer?? Hmm....That is absurd. Can you provide a small repro case?
Before that, be sure that there is no memory leak in other parts of your code. There could be a memory-leak in your code especially in the looping part. Just do a quick cross-check and then post a test case here for reproduction... Thanks
I am still not sure if i understand the issue completely. Probably you can share some code. Anyways looks like you found some bug in the code and the issue no longer exists. Please confirm.