9 Replies Latest reply on Feb 5, 2010 10:11 AM by gaurav.garg

    Different device types in a single context

    _Big_Mac_
      Why is this a bad idea?

      In Dominik Bohr's slides here http://gpgpu.org/wp/wp-content/uploads/2009/09/C1-OpenCL-API.pdf on page 6 we read

      "You may mix different device types in a context. Not
      necessarily a good idea though."

      Why is this not necessarily a good idea? Why not have one context per platform that encompasses all device types?  Is this something specific to AMD's implementation at the time of writing or is it something more general that's likely to apply to all OpenCL implementations?

      If we have a CPU and a GPU plus a single buffer and we want to process it on both devices for some reason, is it better to have both devices in a common context and let OpenCL replicate the buffer or should we have two contexts and manually sync and copy data?

        • Different device types in a single context
          nou

          pracitcaly you need at last one core dedicated to GPU to keep it feed up with data. on nVidia you need one core per GPU because they use spinlock. so when you use CPU you may face decreesing performance.

          secondly you must provide load balancing between them.

            • Different device types in a single context
              _Big_Mac_

              I don't think that really answers the question. You can use nonblocking copying and feed data to many GPUs in a single thread. You could also use many host threads with a single context - copying buffers is done at command-queue granularity, not context granularity, so as long as you have 1 device : 1 command-queue : 1 host thread you should be golden. That's beside the point anyway, the question is why mixing device types is considered "not necessarily a good idea". Apparently there's nothing wrong with having many devices of the same type in a single context.

                • Different device types in a single context
                  nou

                  yes you can use CPU+GPU device. but it can lead to worse performance that only GPU is used. thats is why i think it is "not necessarily a good idea".

                    • Different device types in a single context
                      _Big_Mac_

                      Some algorithms may work faster on a CPU. This is still beside the point - why would having a shared context make anything worse? If you design your algorithm such that it made frequent memcopies between CPU and GPU this might cost you, sure. But it's a question of algorithm design, not why a certain API solution would be a bad idea.

                       

                      Or in other words, if you came to the conclusion that GPU+CPU is indeed the best idea for a given problem, would it be better to still have both in separate contexts or a single one?

                        • Different device types in a single context
                          nou

                          not necessarily a good idea == not always a good idea.

                          ok?

                            • Different device types in a single context
                              _Big_Mac_

                              I still don't understand why

                               

                              I mean sure, designing an algorithm that uses multiple device types is not necessarily a good idea. But we're not talking about algorithm design, we're talking about the API.

                              So if I decide to use only one device but have a context that spans several device types - will I suffer some kind of penalty? Or if I decide to use several device types (for example I have to because no single device supports all the extensions I need for all the processing), will I be better of having separate contexts? See, how I design my app is orthogonal to why one way of setting contexts is a better idea than the other. Having everything in a single context has its merits, ex. OpenCL can move a shared buffer around without the need to explicitly block the host thread and synchronize things, as long as I juggle the events properly. What are the problems that might ensue, that have been signaled in the slides yet not explained?

                                • Different device types in a single context
                                  gaurav.garg

                                  I see it as a warning to developer that this feature might cause performance overhead.

                                  e.g. take a case where you allocate CL buffer with a context associated with multiple devices (let say CPU + GPU). Now, this CL buffer is exposed to developer as a part of shared memory pool. But, interally OpenCL driver might create two copies of the same buffer on CPU & GPU. And everytime you run a kernel either on CPU or GPU, driver has to copy this modified buffer to GPU or CPU respectively. Of course, If the buffer is created with USE_HOST_PTR or ALLOC_HOST_PTR flags, driver don't have to do all these implicit copies and this buffer will actually be part of shared memory pool.

                                  All this is just a guess, I haven't experimented with shared memory pool myself.

                                    • Different device types in a single context
                                      _Big_Mac_

                                      Wouldn't the driver copy a shared buffer only when it's needed? Ex, when I launch 1000 kernels on the GPU and then 100 on a CPU I'd get only one memcopy in between, when the buffer really actually needs to switch devices, instead of 1100?

                                        • Different device types in a single context
                                          gaurav.garg

                                          Yes, I think it should be that way only. OPenCL implementation should check dirty flag on the buffer before copying.

                                          That's why I said I see this statement as a warning to developer in case they don't know what is happening at driver level.

                                          So, using shared memory pool is similar to managing all the memory sharing yourself. But, in earlier case lots of things are dependent on driver implementation.