__kernel void half_temp_range1_kernel(__global float4* gpu_dechirped, __global float4* gpu_power,const float ravg_pow){ uint tid = get_global_id(0); uint dchunk=get_global_id(1); float4 temp; float4 power;//R: processing 4 elements per workitem for(uint i=0;i<2*16;i++){ temp=gpu_dechirped[32*(32768/2)*dchunk+i*(32768/2)+2*tid]; power.x = (temp.x*temp.x + temp.y*temp.y)*ravg_pow; power.y = (temp.z*temp.z + temp.w*temp.w)*ravg_pow; temp=gpu_dechirped[32*(32768/2)*dchunk+i*(32768/2)+2*tid+1]; power.z = (temp.x*temp.x + temp.y*temp.y)*ravg_pow; power.w = (temp.z*temp.z + temp.w*temp.w)*ravg_pow; gpu_power[tid+i*(32768/4/2)+32*dchunk*(32768/4/2)]=power; } }
Hi raistmer,
Are you saying that kernel with smaller buffer takes 0.97183ms and kernel with larger buffer takes 30.45218ms.Can you provide a simple test case? You can also send the test case to streamdevelpor@amd.com in case your code is proprietary.
Hi raistmer,
I tried to reproduce your problem on my juniper system but did not find the values reported by you.I will try to reproduce on 7xx cards.
I would be really kind if you can provide provide a test case.
As far as your Global writes access pattern problem:
Global reads\writes can either be coelesced(all workitems write on consecutive memory elements) or one by one.If coelesced write is not possible try writing the maximum number of elements per workitem.
Also try to use the same memory channel for all workitems in a workgroup(one by one write).This might seem to be serializing when coelescing not used,but it is a good access pattern as many workgroups can write concurrently.
I hope it helps.
Raistmer,
The counters provided in profiler are hardware dependent.So we cannot have all the counters that 5xx series has.
Regarding the manual i think all AMD GPUs have similar architectures and so the same principles apply.Although you can expect 5xx GPUs performing much better as the implementation must have improved.
I recommend you to register for the webinar series AMD is organizing.I think most of the optimization techniques and hardware properties will be explained by AMD professionals in a good way.
you can send the test case at streamdeveloper@amd.com if the code is propriatry.
raistmer,
Did you try to produce a simple test case. It would be really difficult to go through a large code. But you can send the bigger version too if it cant be avoided.
You can try the SDK 2.3 which is expected to be released in a few weeks. It might improve the performance of your code.
Originally posted by: MicahVillmow
Raistmer,
Are you changing the kernel at all between runs? If you are changing the kernel, most likely you are either spilling registers to scratch or you are causing really bad access memory patterns.
AFAIK 4xxx support only one UAV buffer. and 5xxx support multiple UAV buffers. accesible from one kernel.
Originally posted by: nou
AFAIK 4xxx support only one UAV buffer. and 5xxx support multiple UAV buffers. accesible from one kernel.