Few my kernels show quite high write unit stalled percentage (~80%).
What could be done to rearrange writes maybe to decrease this counter (% measured in CodeXL profiler)?
Currently kernel writes 4 floats at ones per workitem in adjacent threads.
Kernel's ISA looks like:
143 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1], R0, ARRAY_SIZE(4) MARK VPM
144 ALU: ADDR(1728) CNT(2)
466 x: MULADD_e R2.x, R5.x, R3.w, -R1.y
y: MULADD_e R2.y, R5.y, R3.w, R1.w
145 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R3], R2, ARRAY_SIZE(4) MARK VPM
And CL looks like:
__global float4* gpu_dechirped
...
uint tid = get_global_id(0); | |
uint dchunk=get_global_id(1); .... |
gpu_dechirped[32*(FFT_SIZE/2)*dchunk+(2*i+1)*(FFT_SIZE/2)+tid]=cur_dechirp; |
Are you using a dGpu or integrated GPU? How have you created the global buffers?
In the case of dGpu, if you are using CL_MEM_USE_HOST_PTR for the output buffer, it will give you worse performance. I would recommend to create a device output buffer and do a single clReadBuffer call after the kernel execution.