Hello. I've noticed strange behaviour of our scientific application when launching several instances on the same GPU - sometimes it gives incorrect results.
I've reproduced this behaviour on the small program (see attachment).
1. It creates two slightly different matrix multiplication kernels.
2. Creates three buffers a, b, c.
3. First kernel multiplies a * b and writes result to c. Second kernel multiplies b * a and writes result to the same c.
4. It launches kernel 1, kernel 2 in the loop several times (at the end of the loop, kernel 1 always), then gets the result back and compares it with gold a * b.
When executing single.sh (see attachment): correct result on all platforms (AMD, Intel, NVIDIA).
When executing multi.sh (which launches 9 instances of the program in parallel): some of the instances gave wrong result on AMD platform (tested on FirePro W9100 with fglrx-13.352.1014 and fglrx-14.20), on Intel and NVIDIA - always correct results.