With ratGPU this is the render I get using the AMD CPU implementation:
and this is the correct one using the GPU and also from Intel CPU implementation.
So it seems your compiler is messing a bit some ray signs, probably due to some kind of optimization you applied in the drivers?
Apparently, what mess the result is the -cl-fast-relaxed-math flag.
If I use -cl-opt-disable or empty options string then it works ok.
I think it's a bug in your OpenCL CPU implementation because:
1. It works ok with the Radeon GPU implementation.
2. It works ok with other implementation like Intel or NVIDIA.
Btw, I'm using Windows 7 x64 and Catalyst 13.9/13.11 beta, a Radeon 7790 and ratGPU 0.6.0.
What you do inside your runtime for "cl-fast-relaxed-math" can be vendor dependent....
Anyway, I will go check...
Thanks for testing and posting the result here,
I hear that "cl-fast-relaxed-math" must be more accurate on CPU than GPU.
Can you confirm that what you are seeing is just the opposite?
Any repro-case would be helpful.
I visited the ratGPU page.
My proxy failed the download....But I guess you are supplying only binaries..
Any chance you can give us a small petite repro-case?
What is the speed-gain you achieve using cl-fast-relaxed-math option on all platforms?
Thats probably an indication of the degree of loss of accuracy..
Debugging it I see clearly the sign of some operations and inverted. Things that should be X are -X or 0.0f, causing the bug.
The fast relaxed math improves performance around 10%, because I'm making lots of MAD operations as well 1/sqrt and 1/x operations.
Repo case: download ratGPU 0.6.0 for Windows and test yourself with Catalst 13.9/11beta with only the AMD CPU OpenCL device enabled.
ratGPU is available as deb package or EXE installer.
Does it install the sources as well?
I need a compact test-case. If you think the signs are inverted, Can you create a small test-case that shows the problem...............?
I am planning to develop a repro-case template with the necessary host-code to launch a kernel and get output.
May be, you can then just plug-in your kernel fragment, replace some host stubs and can provide us a repro-case.
Do you think that would help?
I have just attached a test-repro-case I wrote for finding some problem in some other thread...
May be, this can give you a headstart in writing your own simple repro-case....