cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Bdot
Adept III

random calculation error

Hi,

this part of my code

Code:

    if (get_global_id(0)==TRACE_TID) printf((__constant char *)"div2.6: q.d4=%x, carry=%x, nn.d3=%x\n", q.d4, carry, nn.d3);
  q.d4 = q.d4 - nn.d3 + carry;
  if (get_global_id(0)==TRACE_TID) printf((__constant char *)"div2.7: q.d4=%x, carry=%x, nn.d3=%x\n", q.d4, carry, nn.d3);

prints the following:

Code:

div2.6: q.d4=51ed, carry=0, nn.d3=51ed
 div2.7: q.d4=394d8646, carry=0, nn.d3=51ed

So, according to Catalyst 13.9 on HD5770,

0x51ed - 0x51ed + 0 = 0x394d8646.

In fact each time I run it, the result is something different. All variables/struct members are of type uint. 13.10 beta does not correct the problem. When I redefine the variables' type to uint2 or uint4, the code works correctly. It also works fine when running it on the CPU device.

Of course, when just putting that into a small test program, it always works. I can provide all source code and/or binaries if needed. But first I wanted to check if that is a known problem?

0 Likes
5 Replies
himanshu_gautam
Grandmaster

printf() has been quite quirky...even yesterday there was an issue reported on printing 64-bit numbers using %lu

I have asked for source code for repro..

Please provide a repro case and I will file a bug tomorrow.

-

Bruhaspati

0 Likes

Oh, maybe I was not really clear in my description ... printf is perfectly OK and prints the correct values. It is the actual calculation that is wrong.

Where can I send my program + sources and explanation? (as mentioned, it is not reproducible in a simple test case)

0 Likes

okay... It could actually be either way...

Use the advanced editor (see top right on the text box -- it looks like it is disabled and faded. but it is not).

It allows you to add attachments.

Please give me a small repro case so that I can take up the case with Engg team.

+

Please tell me what platform, what driver, bitness of your platform, APP SDK version used, CPU and GPU used...etc..

Thank you,

0 Likes

Trying to simplify to OpenCL-code always failed to show the issue. I hope this one is still useful. On my HD5770 / Cat13.12 / Win7/64 / APPSDK 2.8.1214.3 it shows this:

To run the selftest call "mfakto -st". It will start like this:

mfakto 0.14pre3-Win (64bit build)

Runtime options

...

######### testcase 1/1559 (M50804297[67-68]) #########

Starting trial factoring M50804297 from 2^67 to 2^68 (0.59GHz-days)

Using GPU kernel "cl_barrett32_87"

2097152 FCs copied in 0.00 ms (1.#J MB/s), proc'd in 18.05 ms (116.22 M/s)

2097152 FCs copied in 0.00 ms (1.#J MB/s), proc'd in 18.04 ms (116.27 M/s)

2097152 FCs copied in 0.00 ms (1.#J MB/s), proc'd in 18.04 ms (116.27 M/s)

Date    Time | class   Pct |   time     ETA | GHz-d/day    Sieve     Wait

Mar 12 23:36 | 3387   0.1% |  0.500    n.a. |      n.a.    25000    n.a.%

no factor for M50804297 from 2^67 to 2^68 [mfakto 0.14pre3-Win cl_barrett32_87_1]

ERROR: selftest failed for M50804297 (cl_barrett32_87)

  no factor found

tf(): total time spent:  0.515s

Press Ctrl-C to interrupt the test. The ERROR line shows a calculation error. To locate the error, I added tracing. To enable, modify mfakto_Kernels.cl, line45: #define TRACE_KERNEL (higher numbers mean more trace - 2 shows the issue), then rerun "mfakto -st":

...

######### testcase 1/1559 (M50804297[67-68]) #########

Starting trial factoring M50804297 from 2^67 to 2^68 (0.59GHz-days)

Using GPU kernel "cl_barrett32_87"

cl_barrett32_87: tid=0, f=9:ca9c44cb:6c467957, shift=19

div2.6: q=0:1380:383666f:dfba1994:d1000000, nn=1380:c21d0313:6dab666b:2f000000, tmp=383666f, c=0

div2.6: q.d4=1380, carry=0, nn.d3=1380, q.d4 - nn.d3 + carry=0

div2: q=0:9f39e252:383666f:dfba1994:d1000000(c=0), nn=1380:c21d0313:6dab666b:2f000000, res=d12981fd:e9000000:0

div3.1: q=0:9f39e252:383666f:dfba1994:d1000000:0, n=9:ca9c44cb:6c467957, qi=ffffffff, nf=2.3779E-011

cl_barrett32_87: f=9:ca9c44cb:6c467957, u=d1298206:e9007ff8:ffff7fff, ff=2.3779E-011

...

(press Ctrl-C twice to abort - somehow the program is alway hanging when using printf in a kernel)

The source code for this part is in barrett.cl, around line 421:

#if (TRACE_KERNEL > 1)

    if (get_global_id(0)==TRACE_TID) printf((__constant char *)"div2.6: q=%x:%x:%x:%x:%x, nn=%x:%x:%x:%x, tmp=%x, c=%x\n",

        q.d5, q.d4, q.d3, q.d2, q.d1, nn.d3, nn.d2, nn.d1, nn.d0, tmp, carry);

    if (get_global_id(0)==TRACE_TID) printf((__constant char *)"div2.6: q.d4=%x, carry=%x, nn.d3=%x, q.d4 - nn.d3 + carry=%x\n",

        q.d4, carry, nn.d3, q.d4 - nn.d3 + carry);

#endif

  res->d0 = q.d4 - nn.d3 + carry; // attempt to trick AMD compiler to correctly save the result of the subtraction

  q.d4 = q.d4 -nn.d3 + carry;  // <<<===  this one saves a random result into q.d4

#if (TRACE_KERNEL > 1)

    if (get_global_id(0)==TRACE_TID) printf((__constant char *)"div2: q=%x:%x:%x:%x:%x(c=%x), nn=%x:%x:%x:%x, res=%x:%x:%x\n",

        q.d5, q.d4, q.d3, q.d2, q.d1, carry, nn.d3, nn.d2, nn.d1, nn.d0, res->d2, res->d1, res->d0);

#endif

When calculating "q.d4 - nn.d3 + carry"  (which is 0x1380 - 0x1380 + 0) inside printf, "0" is correctly printed. But when saving it to q.d4 (which is used later on), it saves a rather random value (0x9f39e252). This leads to the calculation error.

When running on the CPU device ("mfakto -d cpu -st"), all tests finish successful.

When changing the data type of each of the components from a scalar uint to an uint vector of any size, all tests finish successful (*).

When increasing the trace level to 3 or higher, this calculation step yields the correct result (but it hangs the program because using printf inside the kernel ...).

I'm searching for hints how to get to correct results on the GPU with scalars.

I'd like AMD to research on this issue - is it a compiler bug or an error in my code?

I'd like to know why printf hangs up my program.

Thanks a lot for any help

(*) to do this, disable tracing (#define TRACE_KERNEL 0), edit mfakto.ini, line 107, set VectorSize=2 (or any other vector size)

0 Likes

I know this is a pretty big use case to be analyzed. But did anyone at least run it to try to reproduce the problem? I'm still in need of a solution, or at least a workaround. Thanks.

0 Likes