6 Replies Latest reply on Aug 6, 2010 8:42 PM by jeff_golds

    How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange

    noah_r
      discussing strategies of CFD applications where multiple kernels are applied, and problem does not fit within GPU memory

      I am porting a scientific computing application to use OpenCL for the main computational workload.  Specifically, this is a Computational Fluid Dynamics solver, so the nature of the algorithm involves communication only with nearest neighbor 'cells' in the fluid.

       

      In targeting modern GPU hardware, my algorithm is well suited in the sense that most communication is limited to neighboring work-items executed in the same work group.  It is limited though, because the amount of floating point calculation per cell at a given time step is relatively small compared to the PCIe bandwidth and latency .

       

      The issue:

      Say I am running on a machine with 16 GB of RAM and equipped with a GPU with on-card 1GB of RAM.

      The overall computation each time step is broken up into multiple sub-steps.  I want each sub-step to be written as an OpenCL function -- probably each a kernel function, because the order and specific functions is determined at run-time.

      This means that if my problem size is 8GB, it is too big to all fit on the GPU buffer.  I want all sub-steps to be executed using the cells that have already been buffered into GPU RAM, before swapping out the buffer over the PCIe bus.

       

      In my reading of the OpenCL spec, including the memory map functions that have been discussed in this forum before, I don't see a way to apply multiple kernels to memory buffers in part only.

       

      I would like to call clEnqueueNDRangeKernel using the entire 8GB problem size.  With a properly linked queue of kernels with dependencies established, it seems this will require executing each kernel on the entire 8GB problem using many PCIe bus transfers, before the next kernel is started, again with many unnecessary memory transfers.

       

      I have considered two possible solutions, but I am mainly wondering if other OpenCL users have encountered this issue or can suggest something more elegant.

       

      One would be to add more complexity to my host code, such the problem domain is broken into smaller parts that can fit entirely on the GPU memory.  Then I can execute a string of kernels on each smaller part, using the device queue, and presumably avoid unnecessary bus communication.  This complexity would include include managing neighbor cells at the splits -- duplicating them using a technique commonly called 'ghost cells'.

       

      There may be another option using only one kernel function that some how dynamically  builds a list of other OpenCL functions to be called for each work item.  I understand that function pointers are not supported in the language, but I think something could still be worked out to implement this.  Then, when a work-item is executed, multiple algorithm steps could be applied using that cell's value and neighboring cells while still only calling clEnqueueNDRangeKernel on the entire problem domain.  I'll admit, I just thought about this while writing, so I haven't thought it all the way through.

       

      Can anyone offer some advice or report having tried something else?

      Thanks.

      -Noah 

        • How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange
          jeff_golds

          There are current limitations to how much memory you can bind to the GPU.  For example, there's no feasible way to bind 8 GB to the device currently.

          We are working on improving this behavior by allowing host memory to be bound directly to the device, alleviating the need to copy all data to the device before kernel execution.  Obviously, accessing host memory is slower than local device memory, but if your data set is so large, then there's no other solution.

          Jeff

          • How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange
            LeeHowes

            It isn't even possible to run against an 8GB buffer is it? I can't see that the runtime could ever know which chunks of it to copy across given that OpenCL has free pointer access to buffers. Why not write a loop that enqueues a copy, enqueues the sequence of kernels on that chunk, enqueues a copy back and then enqueues the copy for the next chunk and so on. Even worse than that how would it know what ghost cells to maintain on the device? OpenCL doesn't provide any information to allow the run time to substantially optimise transfers, unfortunately.

            Probably you even want to use a smaller chunk of the array than the full allowed GPU memory allocation, maybe half, and then double buffer so that once DMA is enabled the copies and computations are able to overlap.

              • How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange
                jeff_golds

                 

                Originally posted by: LeeHowes It isn't even possible to run against an 8GB buffer is it? I can't see that the runtime could ever know which chunks of it to copy across given that OpenCL has free pointer access to buffers.


                With direct access to the host memory, there's no need to copy anything, as I mentioned.

                Jeff

                  • How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange
                    LeeHowes

                     

                    With direct access to the host memory, there's no need to copy anything, as I mentioned.


                    That's true, but if you were going to work on that much data in that fashion there'd be no point getting the GPU to do it at all, it'd be faster to do it over hypertransport than PCIe.

                     

                     

                    If the implementation only supports buffers smaller than the available GRAM size, I guess my concern is misplaced.  This would be dissapointing though.


                    Other than Jeff's solution of using host memory directly I don't think I can see how it could be automated. Possibly a Cell-style software cache on large chunks, but that never worked very efficiently even on Cell. Without clearer descriptions of memory access patterns like a stream programming model I don't think it would make sense to automate it. OpenCL is a low level API and programming model, the down side of having that level of control when compared with a more stream-like model such as Brook is that you have to manually control your data movement.

                  • How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange
                    noah_r

                    So you're saying a call to clCreateBuffer for an 8GB buffer would fail?  I haven't tried.  What is the size limitation or necessary work-around?

                    If the implementation only supports buffers smaller than the available GRAM size, I guess my concern is misplaced.  This would be dissapointing though.

                      • How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange
                        jeff_golds

                         

                        Originally posted by: noah_r So you're saying a call to clCreateBuffer for an 8GB buffer would fail?  I haven't tried.  What is the size limitation or necessary work-around?

                         

                        If the implementation only supports buffers smaller than the available GRAM size, I guess my concern is misplaced.  This would be dissapointing though.



                        Yes, it would fail.  If you check the clGetDeviceInfo() it will tell you how much memory is available.  I believe currently we report 128 MB max allocation size and there should also be information about how much memory is available for allocation (although you aren't guaranteed to be able to use all of that at once).  We also provide a sample app called clinfo that can provide data about device characterics, such as max allocation size and max memory available.

                        Jeff