cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rexiaoyu
Journeyman III

Bad performance on moving data between private memory and local memory

Moving data from private memory to local memory is a very time-consuming job, isn't it?  When using the local memory in the kernel, my program runs much slower than before.

code:

__private float4 block[4];

__local float4 local_block[16];

//very slow here. Why?

local_block[local_id] = block[0];

local_block[local_id + 1] = block[1];

local_block[local_id + 2] = block[2];

local_block[local_id + 3] = block[3];

barrier(CLK_LOCAL_MEM_FENCE);



0 Likes
3 Replies
n0thing
Journeyman III

Local Data Share(LDS) supports only owner writes in R7xx series GPUs. It is emulated as global memory internally and hence you will not get expected performance.

See this slide (note the asterix on LDS) :  http://img17.imageshack.us/img17/1153/openclarchitecture.jpg

 

0 Likes

Please forgive my temporary inablility to check for my self, but these older cards do report CL_GLOBAL for local memory type right?

0 Likes

rexiaoyu,
One think you can try that might help with performance is to use the async_copy instead of manually copying. This does the copy utilizing the whole group in parallel.
0 Likes