Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- AMD Community
- Communities
- Developers
- Devgurus Archives
- Archives Discussions
- Re: Floating point operations difference between C...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

11-15-2012
06:37 AM

Floating point operations difference between CPU and GPU

Hallo,

I have an OpenCL kernel that implements a dot product between two float arrays. The first is an array of size*n elements and the second is an array of n elements.

This is a sample code

_kernel

void evaluate_product(__global const float *pFirstArray,

const int n,

__global const float *pSecondArray,

__global float *pOutput)

{

int gid = get_global_id(o);

int size = get_global_size(o);

if (gid>=0 && gid<size)

{

float output = 0;

for (int k=0; k<n; k++)

{

output += pFirstArray[gid + k*size]*pSecondArray

}

pOutput[gid] = output;

}

}

If I execute the same operations on CPU, I have different results, above all after 6 or 7 decimal digit. Why this strange behaviour? In kronos OpenCL specification (v 1.2) they say the x+y and x*y are correctly rounded as well as IEEE 754 compliant.

Any ideas?

7 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

11-15-2012
09:58 AM

Lately this is so common "problem" that I just repost my answer from other thread

First of all if we talk about basic operations ( +, -, /, * ) AMD GPUs give exactly the same results as CPU ( with exception of native double div ). For fused mad the accuracy is even higher than what CPUs can do.

Most people just simply forget that CPU/FPU uses 80 bit precision for internal registers and all operations. So only when you store float/double values in memory they are truncated to proper size/representation.

The difference is not because of GPU's inaccuracies but because you compare results from 80 bit math with results from 32 or 64 bit math.

There are 2 options to get the same results on CPU. You can make basic operations that store results in memory before they are reused ( overload operators in C++ ). Or you can switch to SSE because it doesn't use this archaic FPU 80 bit mode ( you can force gcc ( flag -fmfpmath=sse ) to use SSE instead of FPU ).

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

11-23-2012
09:24 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

11-23-2012
08:25 PM

AMD GPU's don't have true/native double precision transcendental ( sin, cos, log, tan, ... ) functions. Native double functions just return single precision result converted to double precision ( with lowest bits 0 ). To improve accuracy GPU must use basic operations. Achieving full accuracy ( error <=0.5 ulp ) takes too much instructions and isn't practical. That's why OpenCL specifies quite huge error for transcendental functions ( >2-3 ulp ).

Also native single precision transcendental functions on AMD's GPU can return results with quite huge error ( but it depends on a function ).

Usually CPU gives much lower error for those functions ( Although if I remember correctly it isn't always fully accurate so in some cases error is >0.5 ulp ).

But like I said with exception of native double div ( which is single precision div converted to double precision by adding 0's ) all AMD's GPU basic operations ( add,sub,div,mul,... ) give error <=0.5 ulp.

This means they are exactly rounded and are the same as on CPU.

Of course we should remember that any change in order of floating point operations can give different result ( and it can be quite drastic ). So sometimes enabling code optimization in compiler can be an issue.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

11-25-2012
03:54 AM

That is good to know. Could you please give me the sources for this information?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

11-25-2012
05:16 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

11-26-2012
11:27 AM

Also native single precision transcendental functions on AMD's GPU can return results with quite huge error ( but it depends on a function ).

Usually CPU gives much lower error for those functions ( Although if I remember correctly it isn't always fully accurate so in some cases error is >0.5 ulp ).

This is true of older VLIW5 and VLIW4 devices but the GCN architecture seems to have full 32 bit floating point accuracy pretty much equal to anything on a PC. The GCN native sin function, v_sin_f32, compared to gcc's sinf() library function over a full range of values gives:

native v_sin_f32 average error: 2.78e-8

gcc float sinf() average error: 2.62e-8

The other 32 bit hardware transcendentals exp, log, sqrt, rsqrt, pow, and div (reciprocal) have roughly the same accuracy, at least for the values I've tested. With a little effort, the GPU's can do a wide range of floating point calculations almost completely in hardware while the PC/CPU must use long library functions for most transcendental calculations. Although CPU's do have hardware acceleration for sqrt and sin/cos, it's very slow compared to the GPU's single cycle (4 clock) transcendentals. The problem with GPU's is integer divide!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

08-08-2013
10:51 AM

Seems like you made a mistake in the option name to not use FPU but SSE :

it's not -fmfpmath=sse but **-mfpmath=sse.**