Archives Discussions

liwoog · ‎01-21-2011

Different results on x86, NVidia and ATI

1.0f / 96.0f (expression evaluated at runtime as 1.0f / x, with x = 96.0)

Gives on x86 (westmere) and NVidia:

1.041666697711e-02 (abs error compared to 1.04166..e-02 is ~3.1e-10)

On ATI 6970 HD, OpenCL 1.1, SDK 2.3

1.041666604578e-02 (abs error compared to 1.04166..e-02 is ~6.2e-10)

So the x86 and Nvidia hardware give the proper answer (the one with the lowest error).

What can I do to get full accuracy on the 6970 and OpenCL?

MicahVillmow · ‎01-21-2011

The error ranges of division is specified in the OpenCL spec.

liwoog · ‎01-21-2011

While this answer is correct and the division is within OpenCL tolerances, it does not bode well for my choosing AMD hardware over NVidia's as a platform to move my numerical code to.

... and I am talking of geophysical modeling on dozens of GPUs.

FrodoTheGiant · ‎01-24-2011

I was reading somewhere (don't know where anymore) that for the new 69xx cards floating point precision is less accurate than in the older 58xx cards.

Is that true?

nou · ‎01-24-2011

AFAIK on cayman there is lowered precision on transcendental functions. cos, sin, sqrt on their native_* variants.

LeeHowes · ‎01-24-2011

You should be very careful over using the word "correct" anywhere near floating point operations. Remember that even the x86 and nvidia implementations, while more accurate from your point of view in this case, are still only operating within defined tolerances. Floating point precision often varies significantly from one implementation to another and your application should be defined to work within the tolerances defined by the specification of whatever language or API you are using rather than those that you happen to find on a device through experimentation at a given point in time and with a given SDK version.

I'll let someone else confirm known precision differences between Cayman and Evergreen GPUs.

liwoog · ‎01-24-2011

How about this one:

rint(14.5f) produces 1.400000000000e+01 (14)

same with convert_int_rte(14.5f)

The OpenCL spec states that rint must be exact.

And 14.5 has an exact floating point representation (1.89125*2^3)

LeeHowes · ‎01-24-2011

I'm not sure I understand.

What does the spec say rint is meant to do? I would check but I'm on the world's most unreliable internet connection at the moment. I would guess it either truncates like (int) would, or has a defined rounding mode.

convert_int_rte presumably rounds to nearest even, given the name, so you'd expect that to return 14 wouldn't you?

Given that, aren't the values correct?

liwoog · ‎01-24-2011

Both round to nearest even, rint returns a float, convert_int_rte returns an int.

You are correct. I was expecting a round to nearest, not a round to nearest even.

liwoog · ‎01-24-2011

On the division side, one step of the Newton-Raphson method got me my desired accuracy (paying attention to catastrophic cancellation issues).

invval = fma(invval, fma(- value , invval, 1.0f), invval);

moozoo · ‎01-25-2011

If you compare ulps information from the CUDA programming guide with that of the opencl 1.1 spec you will see the CUDA spec has a much lower error specification.

I'm guessing that CUDA is more targeted at engineering and scientific usage than opencl.

nou · ‎01-25-2011

no. OpenCL is targeting much wider variety of devices. so specification must cover this devices capabilities.

Meteorhead · ‎01-25-2011

OpenCL would like to target CPUs, GPUs, (APUs,) mobile phones, calculators and heaps of similar low power devices. The standard specifies absolute minimum precision a device has to achieve.

We all know that OpenCL is an API built on top of CAL, and also it is built on top of CUDA. If NV cards are capable of reaching certain precision in CUDA but they do not bring the same under OpenCL, it is almost like an owngoal. If AMD cards have lower precision in division, than AMD has to work on that a little more. But it is not a matter of the API.

moozoo · ‎01-26-2011

Originally posted by: MeteorheadWe all know that OpenCL is an API built on top of CAL, and also it is built on top of CUDA. If NV cards are capable of reaching certain precision in CUDA but they do not bring the same under OpenCL, it is almost like an owngoal. If AMD cards have lower precision in division, than AMD has to work on that a little more. But it is not a matter of the API.

No AMD have meet the requirements of the API they don't have to do anymore work on the lower precision. This is what Micah was saying in his short to the point answer.

If you program to CUDA then you are guaranteed certain precision and that precision is higher than you are guaranteed with opencl 1.1.

It may be that Nvidia or AMD might actually give a higher precision than the minimum under opencl as per Nividia and AMD prior to the 6xxx series. but you can not count on it.

The correct way to handle this is exactly what liwoong did. Add one step of the Newton-Raphson method.

It's just something you have to be aware of. Don't automatically assume a higher level of precision than the opencl 1.1 specs based on the hardware you just happen to have.

golgo_13 · ‎01-27-2011

Moozoo, I think you mean accuracy, not precision.

I'm wondering why liwoog is using single precision at all if accuracy is a concern? Isn't the flushing of subnormal values to zero also a problem?

I'd also like to mention that single precision fma only has hardware support on double-capable GPUs. It takes a lot of work to get fma right in software (try it!).

liwoog · ‎01-27-2011

golgo_13:

single precision = 4x the speed & 1/2x the memory

moozoo · ‎01-28-2011

Originally posted by: golgo_13 Moozoo, I think you mean accuracy, not precision.

Yep, sorry I do mean accuracy.

Comparing the double precision ulp information (for all functions) between CUDA and opencl shows this is also true of double precision but of course double precision is much more accurate than single precision.

Alexium · ‎02-16-2011

I also have a story to tell. I've created very simple basic N-Body simulation program, with absolute minimum of FP instructions. I've tested it on CPU (without OpenCL!), it was all fine. I then ported it to OpenCL in order to test it on my AMD GPU. The kernel code was absolutely the same as corresponding function in initial CPU code. But when I tested it - FP errors blew the simulation to hell. And though my GPU (RV770) supports double precision, it's not avaliable via OpenCL. Interesting thing is the same single precision kernel written and compiled with Brook+ worked fine. That's OpenCL for you...

genaganna · ‎02-16-2011

Originally posted by: Alexium I also have a story to tell. I've created very simple basic N-Body simulation program, with absolute minimum of FP instructions. I've tested it on CPU (without OpenCL!), it was all fine. I then ported it to OpenCL in order to test it on my AMD GPU. The kernel code was absolutely the same as corresponding function in initial CPU code. But when I tested it - FP errors blew the simulation to hell. And though my GPU (RV770) supports double precision, it's not avaliable via OpenCL. Interesting thing is the same single precision kernel written and compiled with Brook+ worked fine. That's OpenCL for you...

Double precision is supported on RV770. Only few math functions won't work on RV770.

Could you please post your OpenCL kernel and Brook+ kernel?

What happens when you run on CPU through OpenCL?

Alexium · ‎02-16-2011

1) Sorry, my bad. I had problem with double and I forgot that I solved it. But using double instead of float didn't help (which is strange).
2) I couldn’t get it working on CPU. It was quite some time ago, but AFAIR I was getting runtime error somewhere inside OpenCL function calls. That probably (obviously?) indicates problems with my code, but still – it works OK for GPU, and I couldn’t find any mistakes, so I was lost there and stopped trying to test on CPU.
Loke I said, my code is strange for it shouldn’t fail on CPU. But I don’t think it’s the same error that’s causing precision problems.
Brook+ code is not written by me but by my friend, it’s a bit simpler (I’m doing additional computations). When I encountered problems, I’ve made my kernel to be just like his, but that didn’t help much. I think I’ll try to play with the code to find out what’s going on, but not today, unfortunately.
I realize I didn’t provide strict evidence, but I’m telling you something’s not right.
Below are links to the code:
http://codepaste.ru/5403/

http://codepaste.ru/5402/

genaganna · ‎02-17-2011

Originally posted by: Alexium 1) Below are links to the code: http://codepaste.ru/5403/ http://codepaste.ru/5402/

It looks to me both kernels are not doing same.

One more thing : in OpenCL kernel, you are using float3. float3(3 component vectors) is not supported on RV770.

Archives Discussions

IEEE 754 Floating point division discrepancy