20 Replies Latest reply on Feb 5, 2010 6:33 AM by genaganna

    GPU Global Memory

    ryta1203

      I know this has probably been mentioned before but please refresh my memory:

      If I have a kernel (streaming) that has 4 inputs and 1 output, how do I call the global memory in CAL/IL?

      For inputs with size 1024 is it:

      g[Tid.x+0], g[Tid.x+1], g[Tid.x+2], g[Tid.x+3]

      OR

      g[Tid.x+0], g[Tid.x+1024], g[Tid.x+2048], g[Tid.x+3072]

      ???

      And for the output, what is it? If I use g[Tid.x+0] won't that overwrite my input value??

        • GPU Global Memory
          hazeman

          You can attach g register to only one input/output buffer ( image in CAL terminology and it can be both input and output at the same time ).

          So it's you decision where in this buffer you want to place your data.

          You can interleave data so g[Tid.x+0], g[Tid.x+1], g[Tid.x+2], g[Tid.x+3] is ok. But you can also put data from fist buffer, then from second and so on ( so g[Tid.x+0], g[Tid.x+1024], g[Tid.x+2048], g[Tid.x+3072] will be correct ).

          And yes g[Tid.x+0] will overwrite data from first buffer. You can use some part of buffer for output data ( and offset there ).

          On the cypres family you can use uavs to access multiple buffers ( so there is no need to struggle with putting all data into one buffer ). Also you could use TUs to read data from buffers - they give advantage of having cache ( could speed things up ).

          PS. To be clear g[] indexing starts from 0 - so g[0] gives first float4 ( or int4 or uint4 ) from attached buffer.

           

          • GPU Global Memory
            MicahVillmow
            ryta,
            The global buffer is a uniform address space. So you can only bind a single resource/memory to it. How you layout your input and output data is kernel specific. If you want bursting on global, you need to add 0, 1, 2, 3, ... to your base offset into the global.
            I.E.
            mov g[1024 + 0], r0
            mov g[1024 + 1], r1
            mov g[1024 + 2], r2
            mov g[1024 + 3], r3

            Would get you bursting.

            Also, since the g register is both an input and output register and it is a uniform address space, writing to the wrong location can clobber your input data.
              • GPU Global Memory
                ryta1203

                So if you  have a kernel with 8 inputs then the developer must do the combining of these inputs into one "uniform address space"? This seems like not a good idea to me. Also, the developer must keep track of the specific address of where the output begins? Again, this seems like not a good idea.

                Also, OpenCL obviuosly has to "handle" this in some way, does it do some address translation/combining and how much overhead is there associated with that?

                Also, so when you copy over 4 inputs, how is that handled? If I have 4 inputs of 1k size on the CPU side and want to copy to the gpu must I manually compress them into one array and then copy that array to global memory and then manually address each input using the compression offset?? Again, this does not seem like a good idea.

              • GPU Global Memory
                MicahVillmow
                OpenCL currently uses a single UAV and maps all global and emulated pointers onto the same memory surface using a combination of cal api calls and copy shaders.
                • GPU Global Memory
                  MicahVillmow
                  g[idx.x] and g[idx.x + size_first_buffer]
                    • GPU Global Memory
                      ryta1203

                      Micah,

                        Thank you, this is what I thought. So if you have multiple outputs and you want to write out to the indexed element of each of those outputs then it's not possible to burst write, correct? Since the output would be: g[idx.x+0], g[idx.x+size_first_buffer]?

                      • GPU Global Memory
                        ryta1203

                         

                        Originally posted by: MicahVillmow g[idx.x] and g[idx.x + size_first_buffer]


                        So is this considered coalesced (I know not normally, but I'm talking about how the memory hardware is laid out and how it's read)?

                        If not and OpenCL does it this way then why? Wouldn't it be better to interleave the input elements to achieve coalescing?

                      • GPU Global Memory
                        MicahVillmow
                        input/output interleaving is something that is app/kernel specific and is not something that we can reliably generate in the compiler stack.
                        • GPU Global Memory
                          MicahVillmow
                          Not between i/o pointers, but it is possible to trigger bursting to the same pointer.
                            • GPU Global Memory
                              ryta1203

                              Micah,

                                So essentially if I have 2 outputs, you are saying it's possible to burst write to mutliple consecutive locations of one output but not to burst write into both outputs, with the way it's setup now, because you can't garuantee that the two outputs are setup for consecutive location writes?

                            • GPU Global Memory
                              MicahVillmow
                              Burst writing only happens to sequential addresses. Where the memory exists doesn't matter, as long as the addresses are sequential, then the writes can be bursted. OpenCL can't interleave pointers since pointers can be aliased and point to the exact same memory location. Hope this helps.