Few my kernels show quite high write unit stalled percentage (~80%).
What could be done to rearrange writes maybe to decrease this counter (% measured in CodeXL profiler)?
Currently kernel writes 4 floats at ones per workitem in adjacent threads.
Kernel's ISA looks like:
143 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R1], R0, ARRAY_SIZE(4) MARK VPM
144 ALU: ADDR(1728) CNT(2)
466 x: MULADD_e R2.x, R5.x, R3.w, -R1.y
y: MULADD_e R2.y, R5.y, R3.w, R1.w
145 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R3], R2, ARRAY_SIZE(4) MARK VPM
And CL looks like:
__global float4* gpu_dechirped
...
| uint tid = get_global_id(0); |
| uint dchunk=get_global_id(1); .... |
| | gpu_dechirped[32*(FFT_SIZE/2)*dchunk+(2*i+1)*(FFT_SIZE/2)+tid]=cur_dechirp; |