7 Replies Latest reply on Jun 12, 2010 2:33 PM by niravshah00

    can you execute a kernel and do a dma memcopy in parallel?

    foobar2342

      so can you dma a memory block to or from the GPU while a kernel

      is executing?

        • can you execute a kernel and do a dma memcopy in parallel?
          godsic

          Yes you can.

          However, I'm not sure (see this post http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=133598) how calMemCopy implemented in CAL runtime.

          So far, using benchmark provided in the post one can notice that calMemCopy results are some times close to that of simple data copy kernel method. Therefore based on my measurements which have been done on HD5850, HD2600Pro, HD4890, HD3470, I can suggest that calMemCopy utilize DMA function for Remote->Remote and Remote<->Local transfers, while for Local->Local it using another algorithm. Moreover, based on the measurements I suggest that for Local->Local transfers special data copy kernels are in use.

          As for DMA it look's like that it is extremely  slow. Therefore I suggest that performance of calMemCopy are strongly implementation depend on DMA implementation. Moreover, based on the measurements one can conclude that DMA function (DMA controller) are not optimized for large data transfers. Additionally, I notice that for AMD NB results much greater than for Intel NB.

          From that point one can conclude that for efficient GPGPU usage vendors should integrate and OPTIMIZE ALL HARDWARE and software implementation, rather than increasing SIMD count of GPU. Moreover, AMD is a full platform vendor and therefore I cannot realize why they discard all this optimization

            • can you execute a kernel and do a dma memcopy in parallel?
              hazeman

              And one more thing. Any transfer ( dma, non-dma ) during kernel execution results in some performance degradation ( kernel is runing slower ). It's small penalty for 5xxx, but quite huge for 4xxx.

              If the data transmition takes <5-10% of execution time then it's best to simply use send-compute-receive approach.

                • can you execute a kernel and do a dma memcopy in parallel?
                  godsic

                  If DMA transfer cause any performance reduction when it is probably not completely DMA

                  Also not sure about GPU memory controller, how it deal with GPU cache coherency. Definetly, AMD did not make any steps towards specific hardware optimizations, since there is no any new PCIex or IOMMU feautures in the latest RD890 chipset and Phenom II processors

                   

                   

                   

                    • can you execute a kernel and do a dma memcopy in parallel?
                      niravshah00

                      how to do this in brook+ ?
                      Is it even possibleto do it in brook+?

                      Well what i want to do is i m running my algorithm in tiles so while the first has completed i want to copy its results and run the second tile in parallel.

                      This would be very helpful for me .

                        • can you execute a kernel and do a dma memcopy in parallel?
                          genaganna

                           

                          Originally posted by: niravshah00 how to do this in brook+ ? Is it even possibleto do it in brook+?

                           

                          Well what i want to do is i m running my algorithm in tiles so while the first has completed i want to copy its results and run the second tile in parallel.

                           

                          This would be very helpful for me .

                           

                          Yes, you can do in Brook+ also.  Please take look at Asynchronous sample available in samples\CPP\tutorials and read Section 2.13 from Stream_Computing_User_Guide.pdf coming from Brook+ SDK.

                            • can you execute a kernel and do a dma memcopy in parallel?
                              niravshah00

                              Thanks genaganna i got what i wanted.

                               

                              thanks a ton.

                              Also if you could help me with my post How to return values from the kernel?

                              I remember you started helping on the same issue on other of my post and i also gave you C-reference code.

                              I am only stuck on that part and if i figure that out i am dont with the project

                               

                                • can you execute a kernel and do a dma memcopy in parallel?
                                  niravshah00

                                  int main(int argc, char ** argv)
                                  {
                                     
                                      int i,j,k,range;   
                                      int startRange =1000;
                                      int endRange = 1100;
                                      int *solution;
                                      time_t start, end;
                                      unsigned int dim[] = {10,10,10};

                                      start = time(NULL);
                                     

                                     
                                      for(i=0;i<(endRange - startRange)
                                      {
                                          if((endRange - startRange-i)<8192)
                                                      dim[0] = endRange - startRange-i;
                                                  else
                                                      dim[0] = 8192;
                                          for(j=0;j<(endRange - startRange)
                                          {
                                              if((endRange - startRange-j)<90)
                                                      dim[1] = endRange - startRange-j;
                                                  else
                                                      dim[1] = 90;
                                              for(k=0;k<(endRange - startRange)
                                              {
                                                 
                                                  if((endRange - startRange-k)<90)
                                                      dim[2] = endRange - startRange-k;
                                                  else
                                                      dim[2] = 90;           
                                                  Stream<int>  aStream(3,dim);
                                                  threadABC(startRange+i,startRange+j,startRange+k,aStream);
                                                   
                                                  //Every pass writes the result of the previous
                                                  //this check is to see me its the first pass
                                                  //i am not checking for aStream.isSync() since in either case i
                                                  //  want to do this step
                                                  //by default it will be done in parallel
                                                  if(i!=0||k!=0){
                                                      writeResultsToFile(solution,dim);
                                                      free(solution);
                                                  }

                                                  solution = (int *)malloc(dim[0]*dim[1]*dim[2]*sizeof(int));

                                                  // streamwrites are blocking it will wait for kernel to finish?
                                                  streamWrite(aStream,solution);
                                                  k+=90;
                                              }
                                              j+=90;
                                          }
                                          i+=8192;
                                      }
                                      writeResultsToFile(solution,dim);
                                      end = time(NULL);
                                      printf("according to difftime()%.2f sec's\n", difftime(end, start));
                                     
                                      getch();
                                      return 0;
                                  }

                                  Is this the correct ?  i mean look at he code that is BOLD