12 Replies Latest reply on Jun 8, 2010 5:57 AM by Raistmer

    OpenCL memory allocation and multi-GPU host

    Raistmer
      Will buffers allocated on all GPUs?

      Another case of failure:
      Host with 2 GPUs of different type.
      HD5770 and some HD4xxx GPU (RV770).
      Again, when single instance of app launched - there is no error.
      VBut when 2 instances launched second one (that will use device 1 instead of device 0) fails with ERROR: clCreateBuffer(cpu_pinned_buf):-4 .
      That is, not enough memory.

      Does it mean that any memory buffer app allocates created on all GPUs although app uses only one (command queue was created only for 1 GPU either GPU0 or GPU1 ).
      What another reason can lead to insufficient GPU memory in this case?

      OR device selection doesn't work in SDK 2.1 and always first device will be used ??
        • OpenCL memory allocation and multi-GPU host
          Illusio

          My understanding is that buffers will be allocated on all devices connected to the same context. The reason for this is likely to simplify runtime management, as otherwise any function involving buffers could fail with out of memory errors during execution.(Say if you suddenly used the buffer in GPU1 in your case)

          So although I haven't tried it myself, I'd guess creating separate contexts for the two GPUs is the way to go for you.

           

          • OpenCL memory allocation and multi-GPU host
            Raistmer
            But how to limit context only to single GPU then ?
            I use only one device for creation of command queue, all buffer allocations go into same command queue so runtime should know that only particular device used...
              • OpenCL memory allocation and multi-GPU host
                nou

                i recomend read this thread. as i understand that RAM on the GPU shoul be just like big cache.

                http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2706

                but current implementation create each buffer on all devices in context.

                  • OpenCL memory allocation and multi-GPU host
                    Raistmer
                    Originally posted by: nou

                    i recomend read this thread. as i understand that RAM on the GPU shoul be just like big cache.




                    http://www.khronos.org/message...c.php?f=28&t=2706



                    but current implementation create each buffer on all devices in context.



                    Thanks, interesting reading (I passed only first page though ).

                    And I still don't understand for what such design was used: to create buffer on all devices but use it only for specific queue bound to particular device. If buffer needed on many devices - create it on many queues and use it on many queues, what can be easier ?...
                      • OpenCL memory allocation and multi-GPU host
                        Illusio

                         

                        Originally posted by: Raistmer Thanks, interesting reading (I passed only first page though ). And I still don't understand for what such design was used: to create buffer on all devices but use it only for specific queue bound to particular device. If buffer needed on many devices - create it on many queues and use it on many queues, what can be easier ?...


                        Just to clear up a small issue here... buffers are not "created on queues", they are created in a context.( cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret) )

                        A context is a grouping construct that's there to allow you to share data/code/workloads between devices in an easy and/or automatic manner. If I have a context containing a CPU and a GPU device I can do a single clBuild to compile code for all devices. Buffers allocated in this context will also be trivially passable to functions on both devices because the buffer is represented on all devices(Not necessarily as a complete copy taking up ram, it could just be mapped). Filling in data in the buffer to multiple devices from the host is also automatic.

                        If you don't want cooperation between devices, then you don't want a context containing more than a single device. In effect, in creating a context with several devices amounts to you informing the OpenCL runtime that you are planning to use the buffers allocated in the context, and code compiled into the context, on several devices.

                        Command queues, on the other hand, provides you with just that - a mechanism to issue commands to a device. You can even have several of these to the same device to aid you in thread synchronization - and they have nothing to do with memory management.

                         

                    • OpenCL memory allocation and multi-GPU host
                      n0thing

                       

                      Originally posted by: Raistmer But how to limit context only to single GPU then ? I use only one device for creation of command queue, all buffer allocations go into same command queue so runtime should know that only particular device used...


                      You can query the individual device Ids in your app and use clCreateContext method on each device to create separate contexts for your devices.

                      A command-queue specific to a single device only so you can do independent operations on each command-queue.

                        • OpenCL memory allocation and multi-GPU host
                          Raistmer
                          Originally posted by: n0thing

                          Originally posted by: Raistmer But how to limit context only to single GPU then ? I use only one device for creation of command queue, all buffer allocations go into same command queue so runtime should know that only particular device used...





                          You can query the individual device Ids in your app and use clCreateContext method on each device to create separate contexts for your devices.




                          A command-queue specific to a single device only so you can do independent operations on each command-queue.



                          Could you list some sample or point to such sample, please?

                          In my case whole app needs only signle device. So queue that binded to signle device is perfect abstraction for me. But context created for all GPUs at once....
                            • OpenCL memory allocation and multi-GPU host
                              Raistmer
                              1)
                              if err=clGetDeviceIDs(platform,CL_DEVICE_TYPE_GPU,num_entries,devices,&num_devices);

                              called with NULL instead of devices (AFAIK allowed by manual) call fails with -30(CL_INVALID_VALUE).
                              2) When I create context I use one of devices returned by this enumeration call in devices array. But later, when I create command queue for this single device in context - what device ID should I use? Same as before or I should run another enumeration call for devices belonging to this already created context (as I did with context creation from type) ?
                              Anyway, currently initialization fails here: Error: Building Program (clBuildProgram)
                        • OpenCL memory allocation and multi-GPU host
                          Raistmer
                          Thanks. "not other" was my mistake. I leaved clBuild unchanged and it tried to build for all devices (though I have only 1 GPU intalled it was result in error somehow).
                          Now app works OK on my host with signle GPU. Will see if it will help with initial problem (and with driver restarts on another host).
                          • OpenCL memory allocation and multi-GPU host
                            Raistmer
                            Report from tester who has 4 identical HD5970 installed (only 2 of them had work for now):
                            "
                            This is what I have observed:

                            No more driver restart.
                            Both GPUs are used from the start
                            After some time only gpu 0 is used, and gpu 1 is app high on CPU.

                            EDIT: after 34 minutes and 4,5% progress on gpu 0 the progress on gpu 1 got to 0,9000%.

                            So there IS progress on gpu 1, albeit veeeery slow.
                            "
                            Why so big difference in speed for identical GPUs?
                            Any ideas? Also, higher CPU usage reported... Could it come from Brook+ part of app ?
                            • OpenCL memory allocation and multi-GPU host
                              Raistmer
                              Probably this problem came from not very good choice of context creation method in template sample. That sample should provide simplest and most often used case, but it shows context creation from type, not for single GPU.
                              It makes no difference when only 1 GPU installed, but with many GPUs it will assume GPU cooperation...