Off corse, each addressable element of global buffer is 16-byte long.
You can just copy kernel listed above to the ShaderAnalyzer and see the result of disassembly.
Let us change the L1 by the
dcl_literal l1, 2047, 0, 0, 0
and we will see the following code
1 MEM_EXPORT_WRITE: DWORD_PTR[8188], R0, ELEM_SIZE(3)
Now, change literal by this:
dcl_literal l1, 2048, 0, 0, 0
and the code is:
01 MEM_EXPORT_WRITE: DWORD_PTR[0], R0, ELEM_SIZE(3)
And so on,
dcl_literal l1, 2049, 0, 0, 0
01 MEM_EXPORT_WRITE: DWORD_PTR[4], R0, ELEM_SIZE(3)
dcl_literal l1, 4095, 0, 0, 0
01 MEM_EXPORT_WRITE: DWORD_PTR[8188], R0, ELEM_SIZE(3)
dcl_literal l1, 4096, 0, 0, 0
01 MEM_EXPORT_WRITE: DWORD_PTR[0], R0, ELEM_SIZE(3)
etc.
In R600 ISA documentation the microcode format of MEM_EXPORT instruction contains INDEX_GPR field:
"The address in the INDEX_GPR is a DWORD address, no matter how much data exported.
SP supplies a 32-bit integer address offset per pixel (assume zero if no EA export).
Per_pixel DWORD address=
{BASE reg, 6'h0} + clamp({ARRAY_SIZE,6'h0}, (BC increment counter *elemsize + INDEX_GPR + ARRAY_BASE))
So, the problem seems to be in IL compiler?