cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

foobar2342
Journeyman III

can you execute a kernel and do a dma memcopy in parallel?

so can you dma a memory block to or from the GPU while a kernel

is executing?

0 Likes
7 Replies
godsic
Journeyman III

Yes you can.

However, I'm not sure (see this post http://forums.amd.com/forum/messageview.cfm?catid=328&threadid=133598) how calMemCopy implemented in CAL runtime.

So far, using benchmark provided in the post one can notice that calMemCopy results are some times close to that of simple data copy kernel method. Therefore based on my measurements which have been done on HD5850, HD2600Pro, HD4890, HD3470, I can suggest that calMemCopy utilize DMA function for Remote->Remote and Remote<->Local transfers, while for Local->Local it using another algorithm. Moreover, based on the measurements I suggest that for Local->Local transfers special data copy kernels are in use.

As for DMA it look's like that it is extremely  slow. Therefore I suggest that performance of calMemCopy are strongly implementation depend on DMA implementation. Moreover, based on the measurements one can conclude that DMA function (DMA controller) are not optimized for large data transfers. Additionally, I notice that for AMD NB results much greater than for Intel NB.

From that point one can conclude that for efficient GPGPU usage vendors should integrate and OPTIMIZE ALL HARDWARE and software implementation, rather than increasing SIMD count of GPU. Moreover, AMD is a full platform vendor and therefore I cannot realize why they discard all this optimization

0 Likes

And one more thing. Any transfer ( dma, non-dma ) during kernel execution results in some performance degradation ( kernel is runing slower ). It's small penalty for 5xxx, but quite huge for 4xxx.

If the data transmition takes <5-10% of execution time then it's best to simply use send-compute-receive approach.

0 Likes

If DMA transfer cause any performance reduction when it is probably not completely DMA

Also not sure about GPU memory controller, how it deal with GPU cache coherency. Definetly, AMD did not make any steps towards specific hardware optimizations, since there is no any new PCIex or IOMMU feautures in the latest RD890 chipset and Phenom II processors

 

 

 

0 Likes

how to do this in brook+ ?
Is it even possibleto do it in brook+?

Well what i want to do is i m running my algorithm in tiles so while the first has completed i want to copy its results and run the second tile in parallel.

This would be very helpful for me .

0 Likes

Originally posted by: niravshah00 how to do this in brook+ ? Is it even possibleto do it in brook+?

 

Well what i want to do is i m running my algorithm in tiles so while the first has completed i want to copy its results and run the second tile in parallel.

 

This would be very helpful for me .

 

Yes, you can do in Brook+ also.  Please take look at Asynchronous sample available in samples\CPP\tutorials and read Section 2.13 from Stream_Computing_User_Guide.pdf coming from Brook+ SDK.

0 Likes

Thanks genaganna i got what i wanted.

 

thanks a ton.

Also if you could help me with my post How to return values from the kernel?

I remember you started helping on the same issue on other of my post and i also gave you C-reference code.

I am only stuck on that part and if i figure that out i am dont with the project

 

0 Likes

int main(int argc, char ** argv)
{
   
    int i,j,k,range;   
    int startRange =1000;
    int endRange = 1100;
    int *solution;
    time_t start, end;
    unsigned int dim[] = {10,10,10};

    start = time(NULL);
   

   
    for(i=0;i<(endRange - startRange)
    {
        if((endRange - startRange-i)<8192)
                    dim[0] = endRange - startRange-i;
                else
                    dim[0] = 8192;
        for(j=0;j<(endRange - startRange)
        {
            if((endRange - startRange-j)<90)
                    dim[1] = endRange - startRange-j;
                else
                    dim[1] = 90;
            for(k=0;k<(endRange - startRange)
            {
               
                if((endRange - startRange-k)<90)
                    dim[2] = endRange - startRange-k;
                else
                    dim[2] = 90;           
                Stream<int>  aStream(3,dim);
                threadABC(startRange+i,startRange+j,startRange+k,aStream);
                 
                //Every pass writes the result of the previous
                //this check is to see me its the first pass
                //i am not checking for aStream.isSync() since in either case i
                //  want to do this step
                //by default it will be done in parallel
                if(i!=0||k!=0){
                    writeResultsToFile(solution,dim);
                    free(solution);
                }

                solution = (int *)malloc(dim[0]*dim[1]*dim[2]*sizeof(int));

                // streamwrites are blocking it will wait for kernel to finish?
                streamWrite(aStream,solution);
                k+=90;
            }
            j+=90;
        }
        i+=8192;
    }
    writeResultsToFile(solution,dim);
    end = time(NULL);
    printf("according to difftime()%.2f sec's\n", difftime(end, start));
   
    getch();
    return 0;
}

Is this the correct ?  i mean look at he code that is BOLD

0 Likes