11 Replies Latest reply on Jan 15, 2009 1:56 PM by thesquiff

    GPU memory architecture?

    jski

      Some quick (hopefully) questions:

      1) Where does streamRead() store the data on the video card (texture memory)?

      2) Where does streamWrite() get the data from to write back to RAM?

      3) Is there a good discussion of this somewhere?

      ---jski

        • GPU memory architecture?
          jean-claude

          Same questions, plus one related to memory block boundaries:

          How does Brook/CAL cope with out-of-memory explorations, ie for

          instance whenever ones points to a pixel outside of the texture?

           

          Let me  just take a very simplistic example to illustrate:

           

           

           

           

           

           

           

           

           

           

           

          kernel

           

          void test_average(out float4 filtered<>, float4 input[][]) {

           



           

          const float2 si = indexof(filtered);  // pixel position

           

           

          float3 val = 0.0f;  // local average

           

           

          float i, j;

           

           

          for  (i = -2.0f; i <= -2.0f; i += 1.0f) {

           

           

             for  (j = -2.0f; j <= -2.0f; j += 1.0f) {

           

           

                float2 index = {i, j};

           

           

                float4 val = input[si + index]; // here: risk of out of bounds

                dis += val;

              }

          // end j loop

           }

          // end i loop

          filtered = val/25.0f;

          }

          So: what is for instance the behaviour at the time i=-2.0f and j=-2.0f ?

          Which memory positions are sampled?

          Should the boundary check be performed within the kernel code? In this case obviusly additional computation overhead is implied...

          BTW. I did run a magnified example of this for 1024*1024 image filtering with no in kernel boundaries check, apparently (and surprisingly!) the result is perfect...

          So, all this brings us back to jski question: Where and how does streamRead() store the data on the video card?



            • GPU memory architecture?
              jean-claude

               

              Originally posted by: jean-claude Same questions, plus one related to memory block boundaries:

              How does Brook/CAL cope with out-of-memory explorations, ie for instance whenever ones points to a pixel outside of the texture?

               Let me  just take a very simplistic example to illustrate:

              kernel void test_average(out float4 filtered<>, float4 input[][]) {  

              const float2 si = indexof(filtered);  // pixel position 

              float3 val = 0.0f;  // local average 

              float i, j;  

               

              for  (i = -2.0f; i <= -2.0f; i += 1.0f) {  

                for  (j = -2.0f; j <= -2.0f; j += 1.0f) {  

                    float2 index = {i, j};  

                    float4 val += input[si + index]; // here: risk of out of bounds      

                  }

              }

               

               

              }

               

              So: what is for instance the behaviour at the time i=-2.0f and j=-2.0f ?

               

              Which memory positions are sampled?

               

              Should the boundary check be performed within the kernel code? In this case obviusly additional computation overhead is implied...

               

              BTW. I did run a magnified example of this for 1024*1024 image filtering with no in kernel boundaries check, apparently (and surprisingly!) the result is perfect...

               

              So, all this brings us back to jski question: Where and how does streamRead() store the data on the video card?

               

               

               

               

               



               



              filtered = val/25.0f;

               

              filtered = val/25.0f;

              }

               

              So: what is for instance the behaviour at the time i=-2.0f and j=-2.0f ?

               

              Which memory positions are sampled?

               

              Should the boundary check be performed within the kernel code? In this case obviusly additional computation overhead is implied...

               

              BTW. I did run a magnified example of this for 1024*1024 image filtering with no in kernel boundaries check, apparently (and surprisingly!) the result is perfect...

               

              So, all this brings us back to jski question: Where and how does streamRead() store the data on the video card?

               

               

               

               

               



               



               

               



               

               





               

               



               

               

               



               

               

               



               

               

               

               

               

               



               

               

               



               

               

               

               

               

               



               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               

               



               

            • GPU memory architecture?
              MicahVillmow
              A streamRead operation can be mapped to the CAL function calls of:
              calResMap of the memory resource mapped to your stream
              memcopy the data from your data pointer to the stream resource
              calResUnmap of the memory resource
              calMemCopy the memory to the graphics device(this would only be required if the resource was allocated as remote).

              The streamWrite operation would do the opposite.

              When using the CAL backend in Brook+, all reads that go out of bounds are clamped by the hardware to the boundary pixels. When using the CPU backend a buffer overfun occurs and the results are undefined.
              • GPU memory architecture?
                MicahVillmow
                There are three memory domains that are important in developing GPGPU apps, Host memory, Host PCIe memory and Graphics memory. CAL resources, which are used by Brook+, can be located in two of the three locations. If the memory is created using the local version of the API function, calResAlloc*, then the memory is stored on the graphics card in RAM, if it is created with the remote version, then it is stored in PCIe memory.
                • GPU memory architecture?
                  MicahVillmow
                  jski,
                  1) The host memory is the memory domain that can be directly mapped into what a normal program would use. This memory is only available to the user space program and the GPU cannot access it. This is where all your data structures and program data will reside in the normal run of a computing session.
                  2) The host pcie memory is a section of main memory on your system that is set aside and mapped into the PCIe memory space. This memory is accessible from both the host program and the GPU and thus can be modified by both. Modification of this memory requires synchronization between the GPU and CPU usually with the calCtxIsEventDone api call. In brook+ this is handled for you.
                  3) The graphics memory is the GPU's version of host memory. It is only accessible by the GPU and not accessible via the CPU. There are three ways to copy data to the GPU memory, either implicitly through calResMap/calResUnmap or explicitly via calCtxMemCopy or via a custom copy shader that reads from PCIe memory and writes to GPU memory.

                  The major importance between these three interfaces is the amount of copying involved. In a single naive program that doesn't handle memory transfers, like all of the samples, there is a double copy involved between host->pcie and pcie->graphics. This is why you see a huge performance difference between the System GFlops and the Kernel GFlops. With proper memory transfer management and the use of system pinned memory, via calCtxResCreate in the cal_ext.h, the copy between host->pcie can be removed. However, it is not a very easy API call to use and comes with a lot of caveats.

                  The reason for this problem is just the memory bandwidth that can be achieved. Host->PCIe copies are usually in the hundreds of mb/s, PCIe->graphics memory are in the GB/s range and on-chip memory bandwidth is in the tens->hundred GB/s range. In GPGPU programming, you want to drastically reduce these copy bottlenecks and can be done through either pipelining execution and copies or some other novel technique.
                    • GPU memory architecture?
                      jski

                      "In a single naive program that doesn't handle memory transfers, like all of the samples, there is a double copy involved between host->pcie and pcie->graphics."

                      Could we get an example of say, simple_matmult, which is optimized to eliminate unnecessary copies?

                      ---jski

                      • GPU memory architecture?
                        thesquiff

                         

                        Originally posted by: MicahVillmow jski, 1) The host memory is the memory domain that can be directly mapped into what a normal program would use. This memory is only available to the user space program and the GPU cannot access it. This is where all your data structures and program data will reside in the normal run of a computing session. 2) The host pcie memory is a section of main memory on your system that is set aside and mapped into the PCIe memory space. This memory is accessible from both the host program and the GPU and thus can be modified by both. Modification of this memory requires synchronization between the GPU and CPU usually with the calCtxIsEventDone api call. In brook+ this is handled for you. 3) The graphics memory is the GPU's version of host memory. It is only accessible by the GPU and not accessible via the CPU. There are three ways to copy data to the GPU memory, either implicitly through calResMap/calResUnmap or explicitly via calCtxMemCopy or via a custom copy shader that reads from PCIe memory and writes to GPU memory.


                        Which of these memories are cached and where does the local memory of the stream processors fit into this hierarchy? Do the stream processors have separate local memories or is this part of the graphics memory? There seems to be a huge void of information in the programming guide.

                        Thanks in advance.

                      • GPU memory architecture?
                        MicahVillmow
                        Jski,
                        I'll pass on your request.
                          • GPU memory architecture?
                            rick.weber

                            Where does a "stream" physically reside on the GPU? I have the following function:

                             

                            void throughputUp()

                            {

                            float* test1 = (float*)malloc(sizeof(test1[0])*8192*8192);

                            float* test2 = (float*)malloc(sizeof(test2[0])*8192*8192);

                            float* test3 = (float*)malloc(sizeof(test3[0])*8192*8192);

                            float* test4 = (float*)malloc(sizeof(test4[0])*8192*8192);

                            float* test5 = (float*)malloc(sizeof(test5[0])*8192*8192);

                             

                            float testGPU1<8192,8192>;

                            float testGPU2<8192,8192>;

                            float testGPU3<8192,8192>;

                            float testGPU4<8192,8192>;

                            float testGPU5<8192,8192>;

                             

                            streamRead(testGPU1,test1);

                            streamRead(testGPU2,test2);

                            streamRead(testGPU3,test3);

                            streamRead(testGPU4,test4);

                            streamRead(testGPU5,test5);

                            }



                            This function causes a resource allocation exception on the 4th streamRead call. According to my calculations, the GPU should have enough storage for this much data. My GPU has 2GB ram (Firestream 9170). Each matrix is 256MB in size, and there are five of them (thus, 1.25GB total). Also, I tried putting a single streamRead in a for loop loading to the same location/stream. However, I was getting unreasonable bandwidth numbers (125GB/s). Is the Brook+ environment aware to the fact that I've already loaded these streams?