I am not sure if this is the right place for OpenCL bug reports, so please forgive me if I am wrong. Here is the link to the simple program that should add two vectors multiple times: https://gist.github.com/ddemidov/5398174. The source is also attached here for convenience.
This simple program, when compiled with
g++ -std=c++0x -o vector_sum vector_sum.cpp -lOpenCL
outputs 4096 == 4096 on NVIDIA and Intel OpenCL implementations. When, however, it is executed on AMD GPUs (the ones I tested are HD 7970 'Tahiti' and HD 7770 'Capeverde'), it may output 4096 == 4081, 4096 == 4082, or something else.
Adding call to cl::CommandQueue::finish() after each kernel launch (but not after the complete loop) solves the issue, but should be unnecessary according to standard.
Replacing definition of global_size at line 99 with
size_t global_size = alignup(N, workgroup_size);
also helps, but is equally unnecessary.
The current operating system is Gentoo linux, kernel version 3.7.1. ati-drivers package has version 13.1. But I have observed this behavior on several machines for several consecutive versions of ati-drivers (and several linux kernels).
Is this a bug in AMD OpenCL, or am I doing something wrong?