Archives Discussions

sei · ‎05-03-2013

Hi.

I have a opencl program which abstractly looks like this

int batch_size = X << num;

cl::Buffer global_tmp(size_function(batch_size));

kernel1.setArg(global_tmp)

kernel1.setArg(dynamic local memory size)

kernel2.setArg(global_tmp)

kernel2.setArg(dynamic local memory size)

for (int done = 0; done < num; done += batch_size)

{

do kernel1, offset = done, global_size = batch_size

do kernel2, offset = done, global_size = batch_size

//queue.finish()

};

The kernels process over a large dataset, and has to do small batches at a time, otherwise the global_tmp buffer becomes to large.

When this loop runs suddenly my allocated memory skyrockets, followed by massive disk access (which I assume is paging).

I'm guessing this happens because the opencl driver stores the configuration for each kernel invocation in order to be asynchronous, but I have 16GB RAM and it gets filled.

The only explanation I can think of is that global_tmp and/or the local memory is also stored in RAM.

Is this true? If so, I don't see the reason why one ever would want that kind of behaviour.

If the local memory is saved, why exactly? Once a kernel is executed, it is run to completion, no interruption, right?

If the global memory is saved, the driver assumes prior knowledge on how the application uses that memory, which here is shared between kernel1 and kernel2, therefor kernel2 must see the output of kernel1, not the state global_tmp was in when kernel2 was enqueued.

I know that kernel2 does see the correct output of kernel1, because the algorithm still works properly, so is the global_tmp only saved for each kernel1?

Exactly what happens here?

Note: there is no problem when the queue.finish() is used, I am just wondering about the finer details on how the driver works.

I am running this on a 7970 with the latest driver.

cheers

himanshu_gautam · ‎05-04-2013

Table 4.2 of AMD APP Programming Guide tells you where the run-time allocates the buffers.

Usually, if you dont pass any flags in clCreateBuffer() time, the buffer gets allocated on the discrete GPU.

(I am assuming dGPU system and not an APU based system).

I dont understand what you mean by

"cl::Buffer global_tmp(size_function(batch_size));"

Can you explain what happens? Are you allocating an array of buffers? OR Are you just allcating a single buffer

In any case, what is the size of the buffer? and the array size (if u r allocating an array of buffers)

sei · ‎05-05-2013

global_tmp is a single buffer equivalent to

some_struct global_tmp[batch_size];

on the GPU. The only flag used is CL_MEM_READ_WRITE.

size_function is just sizeof(some_struct)*batch_size

I found a bug in my own code, which caused batch_size to be 1, which explains a lot, but now I'm not able to recreate the problem.

For reference; with batch_size = 1, global_tmp is 25088 bytes, and the loop executes around 56000 kernels.

Currently this allocates only around 5gb host memory, which is manageable.

Note, the problem is technically solved when batch_size is correct (kernels executed = 4), but I'm still curious about that 5gb.

himanshu_gautam · ‎05-06-2013

Host memory for 5GB allocated for 25K buffer?? Hmm....That is absurd. Can you provide a small repro case?

Before that, be sure that there is no memory leak in other parts of your code. There could be a memory-leak in your code especially in the looping part. Just do a quick cross-check and then post a test case here for reproduction... Thanks

himanshu_gautam · ‎05-06-2013

Hi Sei,

I am still not sure if i understand the issue completely. Probably you can share some code. Anyways looks like you found some bug in the code and the issue no longer exists. Please confirm.

Archives Discussions

Massive memory allocation