I'm having issues when using __local variables in the attached kernel. When run in a loop with constant input data, the output differs at semi-random locations. On a Tahiti GPU, data differs starting in a random iteration in a random location. On a Tonga GPU, data differs in the second iteration at a fixed location. In both cases, the data inconsistency starts at memory addresses written by local_id(1) >= 64. For my use case, I'd expect the contents of 'sums' and 'textures' to be the same in each iteration.
Here is the relevant input data:
IMAGE_WIDTH is defined to be 680
IMAGE_HEIGHT is defined to be 512
NUM_DISP is defined to be 112
WINDOW_SIZE is defined to be 5
Work size is (IMAGE_HEIGHT, NUM_DISP ), work group size is (1, NUM_DISP).
left, right, sums, textures, and prefilterCap are identical for each kernel run.
Both GPUs use the latest non-beta Catalyst drivers.
The inconsistency disappears for NUM_DISP <= 64, and when the kernel is running on a CPU device. Did I miss a barrier call somewhere? As far as I can see, all work items should hit every barrier, and only use the local variable's contents after the barrier.
My apologies for this late reply.
Has your problem been resolved? If not, please provide the complete project (with host-side code) such that we can run it at our end. Also, please mention the setup details such as OS, driver, SDK, GPU etc.