cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

nou
Exemplar

multiple device shared buffer

i read specification Appendix A about shared buffers across multiple devices.

In the command-queue that wants to synchronize to the latest state of a memory object,
commands queued by the application must use the appropriate event objects that represent
commands that modify the state of the shared memory object as event objects to wait on.


it mean that it is absolutly necesery pass event from one queue to another? becuase in next sentence:

This is
to ensure that commands that use this shared memory object complete in the previous command-
queue before the memory objects are used by commands executing in this command-queue.


this is subtile different. it mean that i must ensure that modifing command end before any other command on other queue can start. it even confirm sentence in next paragraph.

so it will be consistent when i use for example clFinish() on all queues. will changes in buffers propagate to other devices after this synchronization?

or consider this scenario. i have many small buffer so i spread it on multiple devices and clEnqueueNDRange simulation kernel with each one as parameter.

then i clEnqueueMarker on all queues.

after that clEnqueueWaitForEvents() with all events returned from marker. so i achive multiple device global barrier.

then i must share boundary data between this buffers as they are in 2D grid.

and after that again global barrier and simulation step and so on.

 

with this i come across another problem. i test clCreateBuffer. i can create as many buffers which fit into host memory. and they are tranfered into device as they are necesary. but when they fill up device memory i can't enqueue kernel with another buffer. it return CL_OUT_OF_RESOURCES. so is there a way to free device memory for additional buffers. as they are proceded sequentialy. i have huge data set which does not fit into device memory but can be split into small chunks. but i don't want create and release devices or manage writes and reads from device manualy to swap chunks of data. i hope that this will manage OpenCL rutime automaticaly.

0 Likes
13 Replies
nou
Exemplar

any comments?

0 Likes
Meteorhead
Challenger

You are coming accross a very interesting topic, but I feel that what you are trying to achieve is not quite possible the way you wish. Or to be precise, I do not see the point of sharing a memory object across devices if the devices do not process them in parallel. This is clearly not possible, but if you have one host thread, you cannot even have the devices process different objects at a time. How do you reach multiple devices at one time with having syncing events in the same thread? If you have multiple threads to reach multiple devices, you cannot share events to synchronize. Or do I miss something?

I fail to see the advantage of sharing buffers across devices if the devices cannot be used at the same time. Then it could even be one single device.

0 Likes

my english is not so good. so i try write it again. i have N buffers. they are in 2D grid. each buffer represent one tile of big 2D space. so i run one simulation step after one simulation step i must copy a shared boundary data between neighbors. computation in one simulation step can be done independent on other buffers.

so for example i have four devices and queue per device. i enqueue 1/4 of the buffers per queue with simulation kernel. then make that multi queue barrier to ensure that all computations was done. then enqueue another kernel which copy boundary data between neibors buffers.after that call clFlush and clFinish on each queue and repeat.

it should be possible make it from single thread as all functions call are asynchronuos. and even multithread approach is possible as OpenCL calls are thread safe. only critical section is make global barrier.

0 Likes

i make a picture explain a my barrier. http://img402.imageshack.us/i/multibarrier.png/

square are clEnqueueNDRange() simulation kernels.

circles are a clEnqueueMarker()

and rectangles are clEnqueueWaitForEvents() which wait on events returned from Markers are pointed with arrow.

0 Likes

As far as I understand this is simple multi-device usage with syncing across devices. However as far as I know if you have a single thread and multiple contexts and queues even, computation will be done serially, since no driver threads spawn upon creating multiple command queues.

When OpenCL app crashes, stack dump reveals (on linux) that there is a host thread and one thread presumably for the driver. I do not have time to test if your program really does computations in paralell on devices, but I doubt they would do that.

If one uses multiple threads, OpenCL cannot sync with it's own events, since threads most likely don't access each others variables, but many other things won't work with multiple host threads (such as the memory object syncing in question).

Could someone confirm that multi-gpu usage is possible this way without forking threads for extra device usage? I lack the time to test it myself.

0 Likes

i still don't understand why sharing events between threads should not be possible.

and even single thread should be possible. as all OpenCL calls are asynchronous so you can just call clFlush() to start execution and after that you can call blocking call like clFinish()/clWaitForEvents().

0 Likes

Originally posted by: Meteorhead Or to be precise, I do not see the point of sharing a memory object across devices if the devices do not process them in parallel.


well, i see this as a possible solution for my probem. In the case of openGL/openCL iterop in a multi-GPU system, i don't know how to share the openGL context among GPUs in order to run openCL in a GPU other then the default one, the one for which the openGL context is created by default (the one hooked to the screen). For benchmarks matters, it would be really nice to be able to select the card i want to manipulate the openGL buffers through openCL. But, now, as nou suggested, if i'm able to share CL buffers with different device, so a different GPU, i'd be able to do what i want. Well, i don't know about performance degradation, but this is a beginning of answer.

now, if you know how to create/share opengl context for different cards in the system, not necessarily the one hook to a screen, i'm listening!

0 Likes

well OpenCL should make it automaticali. as for example you can use CL/GL interoperability with CPU device.

OpenCL runtime automaticaly move buffer between CPU RAM and GPU RAM. so OpenCL should make automatic copying between GPUs. but of course it will be slow as it need transfer across PCIe bus.

0 Likes

do the devices need to be in the same platform to share cl buffers?

0 Likes

they need to be in same context. if you mean sharing with OpenGL then no. but you should query which CL devices can be used to shared context with OpenGL context.

0 Likes

ok, so this is not a solution in my case... I have 2 GPUs but each belongs to a different platform (one radeon and one geforce). Since we can't create a context with devices of different platforms, i won't be able to share cl buffers...

 

by the way, when creating openCL context using:

cl_context_properties cpsGL[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)platform,
                                      CL_WGL_HDC_KHR,                              glCurrentDC,
                                      CL_GL_CONTEXT_KHR,                          glCtx,
                                      0};

    context = clCreateContextFromType(cpsGL, CL_DEVICE_TYPE_GPU, NULL, NULL, &status);

i'm able to use GPU from AMD platform (or even CPU device, as you said previoulsy), but when using the NVIDIA platform i receive a NULL context and an error code -1000, which cleary isn't a valide openCL error! Unless I would be able to create an openGL context/device for the NVIDIA card (which is not hooked to a screen) i don't know what to do!

0 Likes
himanshu_gautam
Grandmaster

with this i come across another problem. i test clCreateBuffer. i can create as many buffers which fit into host memory. and they are tranfered into device as they are necesary. but when they fill up device memory i can't enqueue kernel with another buffer. it return CL_OUT_OF_RESOURCES. so is there a way to free device memory for additional buffers. as they are proceded sequentialy. i have huge data set which does not fit into device memory but can be split into small chunks. but i don't want create and release devices or manage writes and reads from device manualy to swap chunks of data. i hope that this will manage OpenCL rutime automaticaly.

I think releasing buffers manually is the preferred method as it provides full flexibility to the programmer.Consider a situation with three kernels kernel1,kernel2,kernel3. kernel 1 modifies one buffer which is required later by kernel3 . So it would be inefficient if kernel2 swaps out that buffer as it does not require it. How can runtime know in advance what buffers would be required in future?

0 Likes

it can know in advance. from a queue after calling a clFlush() it can go throuth enqueued kernels and check which buffers are needed to execute.

what i say is that there should be some mechanism to free device memory. and release buffer is not quite a option as i want it reuse it later. so when there is no more free space in device memory OpenCL runtime should swap out some buffers.

IMHO there should be automatic buffer managment within OpenCL runtime. if programer want more controll it can use cl_ext_migrate_memobject where can migrate it on particular device or back on the host. when is buffer migrated it can be locked on the device. when is migrated to host it become automatic manged again.

0 Likes