cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

zhuzxy
Journeyman III

Why the cost for write GPU resut into global mem and transfer back to CPU is so expensive?

Hello,

   I met an problem, when my final kernel finished, it tooks about 2 ms to execute in case I do not write result back to global mem. But if I did that ( ofcz I need that because I need the calculation result), the kernel time will be about 7.5 ms. The result is about 40 bytes for each work item, and total 640 work items. My question is why it is so expensive to write the result back to global mem? Di I have a better way to get the result back to CPU?

Thanks

0 Likes
7 Replies
genaganna
Journeyman III

Originally posted by: zhuzxy Hello,

 

   I met an problem, when my final kernel finished, it tooks about 2 ms to execute in case I do not write result back to global mem. But if I did that ( ofcz I need that because I need the calculation result), the kernel time will be about 7.5 ms. The result is about 40 bytes for each work item, and total 640 work items. My question is why it is so expensive to write the result back to global mem? Di I have a better way to get the result back to CPU?

 

From the given information, It is difficult say any things. Please go through ath chapter 4 of programming guide where optimizations are explained. You can ask questions if you have any from that chapter.

0 Likes

compiler may optimize away whole calculation as you dont use result from it.

0 Likes
zhuzxy
Journeyman III

Thanks nou for your remind, I modified my measurement.

my code is like the following:

normFact  :private float var,

descriptor  :private array with sz to be 36.

desc          :global var.

sum_val   : global var

squlen       : private var

{

    // do a lot of computation before

     for(int i=0; i<36; i++)
    {
        int val = (int)(normFact*(float )descriptor);

if I comment out the  'desc[desc_offset  + i] = (signed char)val; ',
the kernel performance will be 5.1 ms, if I do not comment out it, the kernel performance is 7.6 ms. there's total 512 work items, 64work item per work group. So the bandwidth is 512 * 36/2.5ms, and the result is about 7MB/s.  I am using zero copy buffer for the global memory. The platform is a8-3850. Can you give me some advices on the copy performance ? Thanks a lot.



        desc[desc_offset  + i] = (signed char)val; // comment out it , performance is 5.1 ms, otherwise  is 7.6 ms.
        squLen += val*val;
    }

   sum_val[pos]= squLen;

     //kernel finish and exit

}

0 Likes

Originally posted by: zhuzxy Thanks nou for your remind, I modified my measurement.

my code is like the following:

normFact  :private float var,

descriptor  :private array with sz to be 36.

desc          :global var.

sum_val   : global var

squlen       : private var

{

    // do a lot of computation before

     for(int i=0; i<36; i++)     {         int val = (int)(normFact*(float )descriptor); if I comment out the  'desc[desc_offset  + i] = (signed char)val; ', the kernel performance will be 5.1 ms, if I do not comment out it, the kernel performance is 7.6 ms. there's total 512 work items, 64work item per work group. So the bandwidth is 512 * 36/2.5ms, and the result is about 7MB/s.  I am using zero copy buffer for the global memory. The platform is a8-3850. Can you give me some advices on the copy performance ? Thanks a lot.

        desc[desc_offset  + i] = (signed char)val; // comment out it , performance is 5.1 ms, otherwise  is 7.6 ms.         squLen += val*val;     }

   sum_val[pos]= squLen;

     //kernel finish and exit

}

Zhuzxy,  

512 work items are very small number.

What are the flags you have used to create following global buffers?

      1. desc

      2. sum_val

0 Likes

the global mem buffer was created using ' CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR'

And I just tried to treat the desc buffer as char4 vector , the performance improves a lot.

0 Likes

Originally posted by: zhuzxy the global mem buffer was created using ' CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR'

 

And I just tried to treat the desc buffer as char4 vector , the performance improves a lot.

 

Try with int4, you can see more improvment.

Please try also just  with CL_MEM_READ_WRITE flag.

0 Likes

I am trying to benefit from the a8-3850's zero copy buffer. If I use only CL_MEM_READ_WRITE flag, at the end I will still sufer the time of copying data from GPGPU to CPU.

For int4, as my data type is char, I am not sure the cost of converting 4 char into a int and copy it into int4. I may try it later

And thanks for your advices.

0 Likes