8 Replies Latest reply on Jun 7, 2011 11:57 AM by himanshu.gautam

    double buffering

    Meteorhead
      howto do it efficient

      Hi!

      It has come to me several times that I had to do double buffering in OpenCL, but I cannot find a good way to do it efficiently. Problem is that there are two buffers, one READ_ONLY, the other WRITE_ONLY. Once the kernels finish, I wish to read from the active dataset, but buffer flags cannot be changed once they are created. Thus I come to copying buffers on device, to copy the contents of WRITE_ONLY back to READ_ONLY, so it can be used again the same way. Memory bandwidht is high of course, but one can see that it is really unneeded work.

      Is there really no other way of doing this?

        • double buffering
          maximmoroz

          What does prevent you from using CL_MEM_READ_WRITE flag when creating buffers?

            • double buffering
              Meteorhead

              CL_MEM_READ_ONLY enables cached reads, because the compiler surely knows that the data behind the buffer will not change during kernel execution. Plus it enables better vectorization for the same reasons. (I am not that sure about this second one, it might vectorize code regardless, but correctness is naturally not guaranteed, as opposed to being READ_ONLY) Setting buffer flags correctly can significantly increase performance.

              So to put it shortly: I want the advantage of it being READ_ONLY, namely cached access.

                • double buffering
                  himanshu.gautam

                  Hi Meteorhead,

                  cached buffer is surely largely benificial for read buffers. But i guess if you make the WRITE_ONLY buffer as read_write , it might work for you.

                  Also you can use clEnqueueMapBuffer\clEnqueueUnmapBuffer

                    • double buffering
                      Meteorhead

                      Himanshu, I do not quite understand what your point is.

                      I have two buffers, one  that stores data valid for iteration no. x and the other which will hold data in iteration x+1. The first is READ_ONLY, the other is WRITE_ONLY. Once the iteration is done, I want to repeat the same thing, to read from a buffer with this new x+1 dataset that is a READ_ONLY buffer and write to one that is possibly WRITE_ONLY.

                      I am not sure if WRITE_ONLY imposes any impact on performance, if the compiler can optimize something, plus I do not know how NV compiler works. Generally, one should use the most restrictive flag that is still suitable for the application to enable the compilers to do optimizations (even if they are not capable of).

                      If one is READ_ONLY and the other READ_WRITE, I cannot see how I could switch the role of buffers without having to copy data explicitly.

                      I do not quite see how mapping and unmapping is useful (apart from it being slightly faster than clEnqueueReadBuffer).

                      Somewhat optimal application design would be a double buffered layout with flags set properly, launchung threads to update the simulated system, swap the role of buffers, and while the next iterational step is taken, the data actively being read by the kernels is a set of valid data that can be read back to host for processing.

                      So basically there is a running simulation on the GPU and it is polled in arbitrary intervals by the host application when the dataset is saved (not nesseccarily in every iteration, but could be).

                      Worst case scenraio I thought of using clEnqueueCopyBuffer after iteration from the WRITE_ONLY back to READ_ONLY, and possibly at the same time fetch it back to host as well with clEnqueueReadBuffer, both in a non-blocking manner. This way host memory is unused when the system is not polled by the host application, and buffercopies (should be) hidden by fetching data back to host.

                      Opinions?

                    • double buffering
                      maximmoroz

                       

                      Originally posted by: Meteorhead CL_MEM_READ_ONLY enables cached reads, because the compiler surely knows that the data behind the buffer will not change during kernel execution. Plus it enables better vectorization for the same reasons. (I am not that sure about this second one, it might vectorize code regardless, but correctness is naturally not guaranteed, as opposed to being READ_ONLY) Setting buffer flags correctly can significantly increase performance.

                      So to put it shortly: I want the advantage of it being READ_ONLY, namely cached access.

                      The compiler doesn't know what flags you have used when you created buffers you subsequently used in serArg. The only information compiler has is the OpenCL source code. In order to enable caching you will just need to specify const and restrict keywords for the kernel functions' parameters. It works.

                      I mean: When you build program you might not have buffers created at all, you will create them later. Anyway you setArg buffers to the kernel from the already compiled program.

                        • double buffering
                          Meteorhead

                          So you suggest writing the kernel once, the input __global pointer set to const and the other just normally __global, setting both buffers to READ_WRITE and just change the buffers with clSetKernelArg between iterations.

                          I somehow doubt that will work, but I will check. Thanks for the tip.

                          Could someone explain what restric keyword exactly does? OpenCL spec says it works "normally" (or something like that). As I read on wikipedia, restric keyword on processors tells the compiler that no two threads will access the same memory place at once, meaning there's no overlap in memory access. Is this the same on GPU? Isn't this only supposed to mean that writes will not overlap? Reading memory does not collide, or does it?

                            • double buffering
                              maximmoroz

                               

                              Originally posted by: Meteorhead So you suggest writing the kernel once, the input __global pointer set to const and the other just normally __global, setting both buffers to READ_WRITE and just change the buffers with clSetKernelArg between iterations.

                              I somehow doubt that will work, but I will check. Thanks for the tip.

                              Could someone explain what restric keyword exactly does? OpenCL spec says it works "normally" (or something like that). As I read on wikipedia, restric keyword on processors tells the compiler that no two threads will access the same memory place at once, meaning there's no overlap in memory access. Is this the same on GPU? Isn't this only supposed to mean that writes will not overlap? Reading memory does not collide, or does it?

                              Exactly. And it works, at least in the library I am working on.

                              Restrict keyword means that memory accessed through that pointer is accessed only by it and by no other means. This effectively enable compiler to use cached reads for the const * buffers.

                            • double buffering
                              himanshu.gautam

                              Meteorhead,

                              sorry I interpreted the problem wrong.

                              I also beleive what maximoroz said. And that would be nice for your case. You can also use the flag -fno-alias while building the kernel to enable cached reads. 

                              Thanks