Going by your access pattern "dataInput[nx + Nx * ny + Nx * Ny * nz]", it is hard to say anything.
The memory bandwidth usage depends on what successive "workitems" in a workgroup are doing
And, what each "workgroup" is doing.
Are you spawning a 3D workgroup and partitioning the 3D volume among multiple workgroups?
Also, what does "X_000 Y_000 Z_000 X_001 Y_001 Z_001 ... X_010 Y_010 Z_010 X_011 Y_011 Z_011 ..." mean?
Are you storing data of (0,0,0) followed (1,1,1) followed by (2,2,2) etc..? Looks like you are storing the diagonal...
Please clarify.