7 Replies Latest reply on Aug 17, 2011 3:42 AM by zhuzxy

    Why the cost for write GPU resut into global mem and transfer back to CPU is so expensive?

    zhuzxy

      Hello,

         I met an problem, when my final kernel finished, it tooks about 2 ms to execute in case I do not write result back to global mem. But if I did that ( ofcz I need that because I need the calculation result), the kernel time will be about 7.5 ms. The result is about 40 bytes for each work item, and total 640 work items. My question is why it is so expensive to write the result back to global mem? Di I have a better way to get the result back to CPU?

      Thanks

        • Why the cost for write GPU resut into global mem and transfer back to CPU is so expensive?
          genaganna

           

          Originally posted by: zhuzxy Hello,

           

             I met an problem, when my final kernel finished, it tooks about 2 ms to execute in case I do not write result back to global mem. But if I did that ( ofcz I need that because I need the calculation result), the kernel time will be about 7.5 ms. The result is about 40 bytes for each work item, and total 640 work items. My question is why it is so expensive to write the result back to global mem? Di I have a better way to get the result back to CPU?

           

          From the given information, It is difficult say any things. Please go through ath chapter 4 of programming guide where optimizations are explained. You can ask questions if you have any from that chapter.

            • Why the cost for write GPU resut into global mem and transfer back to CPU is so expensive?
              nou

              compiler may optimize away whole calculation as you dont use result from it.

                • Why the cost for write GPU resut into global mem and transfer back to CPU is so expensive?
                  zhuzxy

                  Thanks nou for your remind, I modified my measurement.

                  my code is like the following:

                  normFact  :private float var,

                  descriptor  :private array with sz to be 36.

                  desc          :global var.

                  sum_val   : global var

                  squlen       : private var

                  {

                      // do a lot of computation before

                       for(int i=0; i<36; i++)
                      {
                          int val = (int)(normFact*(float )descriptor);

                  if I comment out the  'desc[desc_offset  + i] = (signed char)val; ',
                  the kernel performance will be 5.1 ms, if I do not comment out it, the kernel performance is 7.6 ms. there's total 512 work items, 64work item per work group. So the bandwidth is 512 * 36/2.5ms, and the result is about 7MB/s.  I am using zero copy buffer for the global memory. The platform is a8-3850. Can you give me some advices on the copy performance ? Thanks a lot.

                   



                          desc[desc_offset  + i] = (signed char)val; // comment out it , performance is 5.1 ms, otherwise  is 7.6 ms.
                          squLen += val*val;
                      }

                     sum_val[pos]= squLen;

                       //kernel finish and exit

                  }

                    • Why the cost for write GPU resut into global mem and transfer back to CPU is so expensive?
                      genaganna

                       

                      Originally posted by: zhuzxy Thanks nou for your remind, I modified my measurement.

                      my code is like the following:

                      normFact  :private float var,

                      descriptor  :private array with sz to be 36.

                      desc          :global var.

                      sum_val   : global var

                      squlen       : private var

                      {

                          // do a lot of computation before

                           for(int i=0; i<36; i++)     {         int val = (int)(normFact*(float )descriptor); if I comment out the  'desc[desc_offset  + i] = (signed char)val; ', the kernel performance will be 5.1 ms, if I do not comment out it, the kernel performance is 7.6 ms. there's total 512 work items, 64work item per work group. So the bandwidth is 512 * 36/2.5ms, and the result is about 7MB/s.  I am using zero copy buffer for the global memory. The platform is a8-3850. Can you give me some advices on the copy performance ? Thanks a lot.

                              desc[desc_offset  + i] = (signed char)val; // comment out it , performance is 5.1 ms, otherwise  is 7.6 ms.         squLen += val*val;     }

                         sum_val[pos]= squLen;

                           //kernel finish and exit

                      }

                      Zhuzxy,  

                      512 work items are very small number.

                      What are the flags you have used to create following global buffers?

                            1. desc

                            2. sum_val