Archives Discussions

zhuzxy · ‎08-15-2011

Hello,

I met an problem, when my final kernel finished, it tooks about 2 ms to execute in case I do not write result back to global mem. But if I did that ( ofcz I need that because I need the calculation result), the kernel time will be about 7.5 ms. The result is about 40 bytes for each work item, and total 640 work items. My question is why it is so expensive to write the result back to global mem? Di I have a better way to get the result back to CPU?

Thanks

genaganna · ‎08-15-2011

Originally posted by: zhuzxy Hello,

I met an problem, when my final kernel finished, it tooks about 2 ms to execute in case I do not write result back to global mem. But if I did that ( ofcz I need that because I need the calculation result), the kernel time will be about 7.5 ms. The result is about 40 bytes for each work item, and total 640 work items. My question is why it is so expensive to write the result back to global mem? Di I have a better way to get the result back to CPU?

From the given information, It is difficult say any things. Please go through ath chapter 4 of programming guide where optimizations are explained. You can ask questions if you have any from that chapter.

nou · ‎08-15-2011

compiler may optimize away whole calculation as you dont use result from it.

zhuzxy · ‎08-16-2011

Thanks nou for your remind, I modified my measurement.

my code is like the following:

normFact :private float var,

descriptor :private array with sz to be 36.

desc :global var.

sum_val : global var

squlen : private var

{

// do a lot of computation before

     for(int i=0; i<36; i++)
    {
        int val = (int)(normFact*(float )descriptor);

if I comment out the 'desc[desc_offset + i] = (signed char)val; ',
the kernel performance will be 5.1 ms, if I do not comment out it, the kernel performance is 7.6 ms. there's total 512 work items, 64work item per work group. So the bandwidth is 512 * 36/2.5ms, and the result is about 7MB/s. I am using zero copy buffer for the global memory. The platform is a8-3850. Can you give me some advices on the copy performance ? Thanks a lot.

        desc[desc_offset + i] = (signed char)val; // comment out it , performance is 5.1 ms, otherwise is 7.6 ms.
        squLen += val*val;
    }

sum_val[pos]= squLen;

//kernel finish and exit

}

genaganna · ‎08-16-2011

Originally posted by: zhuzxy Thanks nou for your remind, I modified my measurement.

my code is like the following:
normFact :private float var,
descriptor :private array with sz to be 36.
desc          :global var.
sum_val   : global var
squlen       : private var
{
// do a lot of computation before
for(int i=0; i<36; i++)     {         int val = (int)(normFact*(float )descriptor); if I comment out the  'desc[desc_offset + i] = (signed char)val; ', the kernel performance will be 5.1 ms, if I do not comment out it, the kernel performance is 7.6 ms. there's total 512 work items, 64work item per work group. So the bandwidth is 512 * 36/2.5ms, and the result is about 7MB/s. I am using zero copy buffer for the global memory. The platform is a8-3850. Can you give me some advices on the copy performance ? Thanks a lot.
desc[desc_offset + i] = (signed char)val; // comment out it , performance is 5.1 ms, otherwise is 7.6 ms.         squLen += val*val;     }
sum_val[pos]= squLen;
//kernel finish and exit
}

Zhuzxy,

512 work items are very small number.

What are the flags you have used to create following global buffers?

1. desc

2. sum_val

zhuzxy · ‎08-16-2011

the global mem buffer was created using ' CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR'

And I just tried to treat the desc buffer as char4 vector , the performance improves a lot.

genaganna · ‎08-16-2011

Originally posted by: zhuzxy the global mem buffer was created using ' CL_MEM_READ_WRITE| CL_MEM_ALLOC_HOST_PTR'

And I just tried to treat the desc buffer as char4 vector , the performance improves a lot.

Try with int4, you can see more improvment.

Please try also just with CL_MEM_READ_WRITE flag.

zhuzxy · ‎08-17-2011

I am trying to benefit from the a8-3850's zero copy buffer. If I use only CL_MEM_READ_WRITE flag, at the end I will still sufer the time of copying data from GPGPU to CPU.

For int4, as my data type is char, I am not sure the cost of converting 4 char into a int and copy it into int4. I may try it later

And thanks for your advices.

Archives Discussions

Why the cost for write GPU resut into global mem and transfer back to CPU is so expensive?