cancel
Showing results for
Did you mean:

# OpenCL

Highlighted ## Vectorization results in more precision?

I have been working with the binomial lattice problem lately and my original implementation assigned one value per thread. This method worked fine, but resulted in precision being lost as the number of timesteps increased past 32768.

Afterwards, in an attempt at increasing performance I tried to vectorize the kernel. While I was successful in getting the vectorization to work (although it doesn't run faster because of the branching it requires), oddly, I found it gave me greater precision, up to 114688 timesteps (which is the same point where the CPU single precision implementation loses too much precision compared to a CPU-based double precision implementation).

Does anyone have any idea why this is the case? Is it an effect of vectorization forcing the math not to be cut off at 24 bits (as MAD and MUL are IIRC) and instead forcing it to 32 bits? I took a brief look through both versions of the IL code in the kernel analyzer and didn't see anything that seemed out of place as to why it was originally being truncated.

-Matt

13 Replies
Highlighted Journeyman III

## Vectorization results in more precision?

That would seem quite improbable to me. Vectorization is essentially a hint to the compiler, it should have no effect on your results if your vectorized code is algorithmically equivalent to your original code. If you're seeing different results with and without vectorization, you're missing something in your own implementation.

Highlighted Journeyman III

## Vectorization results in more precision?

Could it be a floating point number issue? Floating point numbers (float and double data types) are not perfect representations of numbers and the order in which operations are executed can produce different results. For instance, consider the case where A is an array and the 'i' element of the array is A = 1.0/2.0^i

If you want to add the elements in A, a program would return different results depending of how you traverse the array, i.e. adding elements from i = 0 to N would give a different answer than if you do from i=N to 0. The reason is that accuracy is lost when you add small and big numbers.

That is just an example of how order of operations can affects results in floating point arithmetic but there are a few other potential issues that could have been mitigated in your vectorized version of your program.

Highlighted ## Vectorization results in more precision?

 Originally posted by: eugenek That would seem quite improbable to me. Vectorization is essentially a hint to the compiler, it should have no effect on your results if your vectorized code is algorithmically equivalent to your original code. If you're seeing different results with and without vectorization, you're missing something in your own implementation.

That's what I thought. Just to give a bit of background information, I received the original CUDA code which the developer used which worked up to 32768 time steps, after which he also lost too much precision. I converted his code to OpenCL and found that my implementation also lost too much precision after 32768 time steps (so our solutions were equal).

Then, I vectorized the code in an attempt to make it run faster at which point I noticed that I was accurate to 5-6 decimal places instead of only 3.

 Originally posted by: N3KO Could it be a floating point number issue? Floating point numbers (float and double data types) are not perfect representations of numbers and the order in which operations are executed can produce different results. For instance, consider the case where A is an array and the 'i' element of the array is A = 1.0/2.0^i If you want to add the elements in A, a program would return different results depending of how you traverse the array, i.e. adding elements from i = 0 to N would give a different answer than if you do from i=N to 0. The reason is that accuracy is lost when you add small and big numbers. That is just an example of how order of operations can affects results in floating point arithmetic but there are a few other potential issues that could have been mitigated in your vectorized version of your program.

I hadn't thought of that, but that shouldn't be an issue if all threads are independent and working on independent data, right?

Attached are both the original and then vectorized versions which do the same thing, but the vectorized version works for a larger number of time steps. One thing I also noticed when I was running some more tests is that, on my 5870M I was able to get to 114688 time steps and keeping precision to 3 decimal places. But when I ran the same code on an A6-3400 (laptop), it only seemed to be precise to 3 decimal places for up to 65536 time steps. In this case I would assume it has become a compiler "issue" since the target devices are different.

EDIT: re-ran my algorithm and the precision was now the same for both the dedicated and integrated GPU (but results still differ between the vectorized and unvectorized code).

`// original unvectorized code __kernel void LookbackOpenCL(__global float *optionValues, __global float *tempOptionValues, int N, float u, float d, float pu, float pd, float disc, int i) { int globalID = get_global_id(0); int localID = get_local_id(0); float myYValue, tempOptionValue; myYValue = pow(u, globalID); if ( globalID < i ) { if ( globalID == 0 ) tempOptionValue = fmax(myYValue - 1.0f, ((pu * optionValues * d) + (pd * optionValues * u)) * disc); else tempOptionValue = fmax(myYValue - 1.0f, ((pu * optionValues[localID + 1] * d) + (pd * optionValues[globalID - 1] * u)) * disc); tempOptionValues[globalID] = tempOptionValue; } } // vectorized kernel __kernel void LookbackOpenCL(__global float2 *optionValues, __global float2 *tempOptionValues, int N, float u, float d, float pu, float pd, float disc, int i) { int globalID = get_global_id(0); int localID = get_local_id(0); float2 myYValue; float2 tempOptionValue; int highIndex = globalID * size + size; myYValue.s0 = pow(u, lowIndex); myYValue.s1 = pow(u, lowIndex + 1); if ( highIndex < i ) { if ( globalID == 0 ) { tempOptionValue.s0 = fmax(myYValue.s0 - 1.0f, ((pu * optionValues.s1 * d) + (pd * optionValues.s0 * u)) * disc); tempOptionValue.s1 = fmax(myYValue.s1 - 1.0f, ((pu * optionValues.s0 * d) + (pd * optionValues.s0 * u)) * disc); } else { tempOptionValue.s0 = fmax(myYValue.s0 - 1.0f, ((pu * optionValues[localID].s1 * d) + (pd * optionValues[globalID - 1].s1 * u)) * disc); tempOptionValue.s1 = fmax(myYValue.s1 - 1.0f, ((pu * optionValues[localID + 1].s0 * d) + (pd * optionValues[globalID ].s0 * u)) * disc); } tempOptionValues[globalID].s0 = tempOptionValue.s0; tempOptionValues[globalID].s1 = tempOptionValue.s1; } barrier(CLK_GLOBAL_MEM_FENCE); if ( highIndex == i ) { if ( globalID == 0 ) { tempOptionValue.s0 = fmax(myYValue.s0 - 1.0f, ((pu * optionValues.s1 * d) + (pd * optionValues.s0 * u)) * disc); } else { tempOptionValue.s0 = fmax(myYValue.s0 - 1.0f, ((pu * optionValues[localID].s1 * d) + (pd * optionValues[globalID - 1].s1 * u)) * disc); } tempOptionValues[globalID].s0 = tempOptionValue.s0; } }`
Highlighted Journeyman III

## Vectorization results in more precision?

The 5870M should have a fused-multiply add (FMA) instruction that would do: a = a + (b*c) in a single step. The A6-3400 (cpu?) is most probably doing the multiplication and addition in two separate operations (first b*c then add a).

If you consider that each floating point operation will finally round the results into a float (32 bit) or a double (64 bit), the A6-3400 performs two rounding operations while the 5870's FMA performs a single rounding operation. The FMA enhances the precision of the computations while using single precision data types.

Moreover, floating point arithmetic is not associative or distributive. To check this, you can force the compiler into the order in which the operations are performed by using parenthesis.

Highlighted ## Vectorization results in more precision?

I was using the GPU of the A6-3400. Where did you get that the 5870M has FMA? If you have a source, I'd like to add it to my collection since information like that would be useful.

Highlighted Staff

## Vectorization results in more precision?

The 5870M does not have FMA since it does not support double precision and that is a requirement of FMA.

A relatively complete list can be found here:
http://en.wikipedia.org/wiki/C...hics_processing_units

There is no mobile chip that has double precision since R7XX series, and FMA was introduced with EG chips, so no mobile with FMA.
Highlighted ## Vectorization results in more precision?

 Originally posted by: MicahVillmow The 5870M does not have FMA since it does not support double precision and that is a requirement of FMA. A relatively complete list can be found here: http://en.wikipedia.org/wiki/C...hics_processing_units There is no mobile chip that has double precision since R7XX series, and FMA was introduced with EG chips, so no mobile with FMA.

EDIT: re-ran my algorithm and it looks like the dedicated vs. integrated GPU precision problem has been solved.

3 questions:

1. Is there any documentation available publicly as to which chips support which features?

2. Do you have any ideas as to why this is occurring? I mean, it shouldn't be different for both codes since it should compile to nearly the same base code.

3. I understand you can't comment on unreleased products so this is more of an idea/theoretical question. Have you considered doing a limited production run of APUs that would support double precision? It would be useful for research where 64-bit precision is required but the algorithm shouldn't be run on a dedicated GPU because the memory access is better suited to an APU.

Highlighted Grandmaster

## Vectorization results in more precision?

1. You can refer to the ISA documents to know if some particular instruction is supported or not. Analyzing the ISA generated would also be a good idea.

2. Not really sure about this. Should mostly be some algorithmic alteration due to which better precision is observed.Do you see the same precision difference on every other card or soes it happen for 5870M only.

3. I guess that would be surely under considerations.

Highlighted 