1 Reply Latest reply on Mar 2, 2011 3:32 PM by himanshu.gautam

    Concurrent Buffer Reads/Writes?

      (and kernels)


      I am getting started on my first OpenCL project.  After having spent a few hours reading through the forums I realized a few limitations to OpenCL that were a little surprising. My original plan for the project was to develop five 'stages', similar to a processor pipeline that would perform different options on a set of data.  After each 'cycle', the data would advance through the pipeline and a different kernel would work on the data. Meanwhile, new data would be piped in (buffer write) while everything was working. 

      After my investigation, it seems that in order for this plan to work, I have to gather all of my code into one kernel, and somehow branch to the various tasks within the combined kernel. Are there any disadvantages to this approach?  For example, there is overhead associated with branching and checking what 'stage' we're working on.  Also, I realized that each task needs to be in divisions of 64 threads. Is there anything else I need to know with this approach?

      Furthermore, I was planning on having kernels running while buffer reads and writes were occuring (new input would be written to the VRAM, and processed data would be read from the VRAM). I haven't found a difinitive answer on if this is possible, but it seems like it isn't?  Are there any schemes to handle this, or does this occur in serial?

      (It's amazing that a library that enables parallelism is implemented in serial??? ... Is this a problem with resource management on the chip? e.g., which cores are in use, and which aren't?)



        • Concurrent Buffer Reads/Writes?

          hi chris,

          Nice to hear you plans. To answer a few of the questions:

          gathering all the code into one kernel will more likely have more code to control the logic than actually implementing it. Branches as such are really bad for GPU if they diverge among wavefronts. But writing different kernel for different parts of code can include more data transfers and kernel launch overhead for each extra kernel launched. So the actual answer can go either way. I guess experimenting is the only route.