i have a problem with kernels of mine. They are three lowpass filtering kernels for each direction in a voxel field(z,x,y).
The CPU implementation, which I want to parallize is not an ordinary lowpass filter. So using just a standard lowpass filter is out of question.
First my hardware:
AMD Phenom II X4 945
AMD APP SDK
The CPU implementation uses regular X87 code. X87 precision is set to 32bit from the 80bit extended mode. This lowpass filter algorithm uses many iterations and the weight of each sample is 1,2,1(1 for previous voxel, 2 for own voxel, 1 for next voxel). First a there is z-Filtering, which is done according to voxel distances. With my dataset there are 44 iterations in z-distance. Then there is xy-Filtering, which are combined into one loop. First there is x-Filtering, then y-Filtering is executed. This is done 100 times.
This is just for your information. With more iterations, more rounding errors will occure.
Now to my question:
Can it be, that the GPU uses different rounding mechanism than the CPU? I get different results, which are, of course, comparable, but still the difference is significant. Racing conditions shouldn't be a problem, I've already fixed them.^^ For each instance of the kernel, there are the same results, so racing conditions should be out of the way.
Another theory of mine is, that the OpenCL Compiler or the Drivers optimize my code in a certain way, that the execution order is not comparable with CPU implementation. Because of so many iterations, errors will be accumulated. I really don't know the reason, so help me there.
Some infos to the kernels:
For each each kernel the workgroup sizes are set accordingly:
Each workgroup executes one row of values. First the values are stored in local memory, which are as big as the total workgroup size (256!). Workgroup size is set via a #define, which is added before the kernels are compiled.
Then the execution is performed. Each work item checks his own entry according to another voxel field(MRI) of the same size as the input voxel field. If the value of the MRI voxel field is zero, there is nothing to be done, and the value is saved back into the local memory, without any processing. Each workitem as well checks its next and previous entry, according to the value of its entry in the MRI voxel field. If the previous or the next entry are zero, that entry will not be processed as well. For example, if a workitem has no previous value(according to its place in MRI voxel field), but has a next value, the previous value is not processed for the lowpass filter. The same goes vice versa, if the next value is zero for its place in the MRI voxel field.
I hope you can help me. Thx, in advance!