i have a problem with kernels of mine. They are three lowpass filtering kernels for each direction in a voxel field(z,x,y).
The CPU implementation, which I want to parallize is not an ordinary lowpass filter. So using just a standard lowpass filter is out of question.
First my hardware:
AMD Phenom II X4 945
AMD APP SDK
The CPU implementation uses regular X87 code. X87 precision is set to 32bit from the 80bit extended mode. This lowpass filter algorithm uses many iterations and the weight of each sample is 1,2,1(1 for previous voxel, 2 for own voxel, 1 for next voxel). First a there is z-Filtering, which is done according to voxel distances. With my dataset there are 44 iterations in z-distance. Then there is xy-Filtering, which are combined into one loop. First there is x-Filtering, then y-Filtering is executed. This is done 100 times.
This is just for your information. With more iterations, more rounding errors will occure.
Now to my question:
Can it be, that the GPU uses different rounding mechanism than the CPU? I get different results, which are, of course, comparable, but still the difference is significant. Racing conditions shouldn't be a problem, I've already fixed them.^^ For each instance of the kernel, there are the same results, so racing conditions should be out of the way.
Another theory of mine is, that the OpenCL Compiler or the Drivers optimize my code in a certain way, that the execution order is not comparable with CPU implementation. Because of so many iterations, errors will be accumulated. I really don't know the reason, so help me there.
Some infos to the kernels:
For each each kernel the workgroup sizes are set accordingly:
Each workgroup executes one row of values. First the values are stored in local memory, which are as big as the total workgroup size (256!). Workgroup size is set via a #define, which is added before the kernels are compiled.
Then the execution is performed. Each work item checks his own entry according to another voxel field(MRI) of the same size as the input voxel field. If the value of the MRI voxel field is zero, there is nothing to be done, and the value is saved back into the local memory, without any processing. Each workitem as well checks its next and previous entry, according to the value of its entry in the MRI voxel field. If the previous or the next entry are zero, that entry will not be processed as well. For example, if a workitem has no previous value(according to its place in MRI voxel field), but has a next value, the previous value is not processed for the lowpass filter. The same goes vice versa, if the next value is zero for its place in the MRI voxel field.
I hope you can help me. Thx, in advance!
Mhh, I cannot run that code on CPU, because OpenCl is encapsulated in a other Toolkit(ITK; itk.org). This Toolkit prohibits any OpenCL Code on the CPU.
Where are those rounding errors documented? I didn't find them in the specification. I thought that OpenCL is IEEE 754 compliant?
I would slightly argue with nou's comment that there can be rounding errors on the GPU. A rounding error is a consequence of the application. It is true that the GPU will round *differently* from the CPU under some circumstances, but rounding is an inaccuracy by definition.
So it could be that you're seeing slight differences in rounding and particularly if you're using any trig functions or division then you will have very different code on each device which would make huge differences. Beyond that even from one CPU compiler to another you can get variation in floating point rounding because of instruction ordering.
So the question really is are these rounding issues problematic? You should always expect different answers in floating point from one device to another, because floating point isn't accurate. You just need to judge whether the difference is a problem for your algorithm: usually it isn't unless you have a chaotic algorithm.
So my question is: how bad are these "errors"?
Thank you for your answer. Of course, the different CPU compilers generate different code. As well code optimization can be a problem.
"So the question really is are these rounding issues problematic? You should always expect different answers in floating point from one device to another, because floating point isn't accurate. You just need to judge whether the difference is a problem for your algorithm: usually it isn't unless you have a chaotic algorithm."
This is what I want to clarify. The algorithm is pretty chaotic for the parallel execution on the GPU. This is what, I believe, because I've wrote a test software, which compares the software execution and the GPU execution of the same voxel field. Voxels, which are to be processed, are defined by a other voxel field(i call it MRI). Each voxel in my field corresponds to a voxel in the MRI field. If that voxel in the MRI field is not zero, it is processed, but only, if one of its neighbours(according to Z/X/Y) is not zero, too(again according to that MRI). So for each each direction not every voxel is processed the same way. And again it can happen that neighboring voxels are processed differently. This is a pretty beefy task for a GPU, i believe.
And yeah, the rounding errors are significant. For my Test Software, I generate that certain MRI image, which verifies a voxel, and process the lowpass filter for GPU and Software. After that I compare the results. That MRI image is generated via "rand()%(intvalue)" for each entry in that MRI Image(If i set intvalue to 4, the errors are highest). For the voxel field, each values changes from 0 to (1/squrt(2*pi))*10)(3.98942 ), which looks like a checked pattern. The maximum error then lies around 0.017(for rand()%4), which is very problematic(results are processed again). So the maximum error corresponds to the entropy of my MRI image.
It is pretty much just an adaption of FSLs FAST algorithm, with some modifications for OpenCL. I recommend to take the original source code, which works under Cygwin, GCC and MinGW.
btw.. How did you force the x87 to use 32-bit only. I thought the only way to circumvent is to use scalar-SSE
Also, note that order of floating point operations dictate the result of a floating point arithmetic.
Since parallel break down, changes the order in which floating point operations are performed, signficant changes can be expected.
I fear there is no way out. Thats how the parallel world is and computer memories are too finite to represent the infinite shades of a real number, What computer stores is a faked real number. Sigh!
It is possible to set certain flags in the X87 FPU control register. There you can change the precision of the mantissa.
If I change from 80bits to 32bits there are slightly different results, which don't explain the roundig problematique of the GPU execution.
Interesting is the fact, that divisions have errors of <=2,5 ULPs(specified by OpenCL 1.2), where all other operations(Mul, Add, Sub) are correctly rounded. Maybe I will replace the division through a multiplication.