Heyho,
i need to allocate and free global memory on the GPU so that one work-item is able to allocate memory and another to free it,
but dont have an idea how to implement this, any suggestions or hints?
Thanks,
Srdja
You cannot do this. Work-items cannot allocate global memory. Please refer to OpenCL spec for more information (Memory model chapter).
You cannot do this. Work-items cannot allocate global memory. Please refer to OpenCL spec for more information (Memory model chapter).
I know there is no built in function for this...my hombrew solution would be that every process holds an memory list where he "allocates" from and "frees" to and if one process is running out of memory there must be some global mem list split procedure....
--
Srdja
While this would be theoretically possible, holding such a memory page on a per-workitem basis is very register consuming, and very slow to load global memory. Local cannot be used, as that is not coherent across the entire device. Global Data Share could be used for this, as that is a fast, globally available mem space, but that is not exposed yet to OpenCL. Hopefully it will make it into SDK 2.6.
Since there is no dynamic memory alloc, you must hardcode a maximum length of this array of mem list. First approach could a "sparse matrix" kind of list, where one integer holds the thread id the block belongs to, and then the size of that block. One single integer would hold the last block's array index, and this is what must be atomically increased. Problem is, that scanning for one thread's own memory requires a scan of the entire array, which is VERY inefficient in __global. That is what GDS would speed up dramatically, as that is roughly as fast as __shared, but globally visible.
Originally posted by: smatovic Heyho,
i need to allocate and free global memory on the GPU so that one work-item is able to allocate memory and another to free it,
but dont have an idea how to implement this, any suggestions or hints?
Thanks,
Srdja
best suggestion: change your algorithm, any solution requiring fully dynamic memory will be much slower than one that doesn't.
You could implement a dynamic allocator fairly cheaply using an atomic like a stack allocator (think alloca()), but a. you can't pass anything allocated to another work-group anyway (global memory rules prevent this), and b. there would be no efficient way to free it.
This might still be useful, but these are not general purpose processors running a multi-tasking operating system so you have to start to think differently to use them.