I'm having issues when using __local variables in the attached kernel. When run in a loop with constant input data, the output differs at semi-random locations. On a Tahiti GPU, data differs starting in a random iteration in a random location. On a Tonga GPU, data differs in the second iteration at a fixed location. In both cases, the data inconsistency starts at memory addresses written by local_id(1) >= 64. For my use case, I'd expect the contents of 'sums' and 'textures' to be the same in each iteration.
Here is the relevant input data:
IMAGE_WIDTH is defined to be 680
IMAGE_HEIGHT is defined to be 512
NUM_DISP is defined to be 112
WINDOW_SIZE is defined to be 5
Work size is (IMAGE_HEIGHT, NUM_DISP ), work group size is (1, NUM_DISP).
left, right, sums, textures, and prefilterCap are identical for each kernel run.
Both GPUs use the latest non-beta Catalyst drivers.
The inconsistency disappears for NUM_DISP <= 64, and when the kernel is running on a CPU device. Did I miss a barrier call somewhere? As far as I can see, all work items should hit every barrier, and only use the local variable's contents after the barrier.