Archives Discussions

rexiaoyu · ‎11-03-2009

Moving data from private memory to local memory is a very time-consuming job, isn't it? When using the local memory in the kernel, my program runs much slower than before.

code:

__private float4 block[4];

__local float4 local_block[16];

//very slow here. Why?

local_block[local_id] = block[0];

local_block[local_id + 1] = block[1];

local_block[local_id + 2] = block[2];

local_block[local_id + 3] = block[3];

barrier(CLK_LOCAL_MEM_FENCE);

n0thing · ‎11-03-2009

Local Data Share(LDS) supports only owner writes in R7xx series GPUs. It is emulated as global memory internally and hence you will not get expected performance.

See this slide (note the asterix on LDS) : http://img17.imageshack.us/img17/1153/openclarchitecture.jpg

jcpalmer · ‎11-03-2009

Please forgive my temporary inablility to check for my self, but these older cards do report CL_GLOBAL for local memory type right?

MicahVillmow · ‎11-03-2009

rexiaoyu,
One think you can try that might help with performance is to use the async_copy instead of manually copying. This does the copy utilizing the whole group in parallel.

Archives Discussions

Bad performance on moving data between private memory and local memory