cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

eugenek
Journeyman III

CUDA port, 64-bit multiplication/division, & a likely bug in Stream SDK

A bit of background. I'm working on a computational mathematics problem. The problem is such that it gets a nice boost from GPU optimizations. I have a working CUDA implementation and I'm trying to port it to AMD (since AMD chips appear to have an edge in operations per second, at least on paper).

The port is mostly going well, two biggest issues encountered so far were the limit on the number of work-units per kernel invocation (NVIDIA can do 32768x256, my current Barts can only do 124x256. This is silly since it's clearly not a hardware issue, and easy to fix), and the lack of 64-bit atomics (that took some coding to work around, but it's also surmountable.)

Here's the issue I'm facing right now. My program has a lot of 64-bit integer divisions. Those are typically quit slow. But there's a workaround. I have this macro:

#define divide(a, b, c) ((__umul64hi(a,b)+a)>>(c))

here 'a' and 'b' are uint64_t, and 'c' is uint8_t. '__umul64hi'  computes the 128-bit product of two 64-bit unsigned integers and saves the upper 64 bit.

If the list of divisors is known in advance, we can precompute b & c and store them somewhere. On NVIDIA (and on x86, too), this speeds up the division by something like a factor of 3.

I tried to get this working on AMD, too. The closest fit to __umul64hi I could find is 'mul_hi' (defined in the OpenCL 1.1 spec, section 6.11.3), though I'm not sure if it's exactly what I need. But when I plug that one into my macro, I get a crash during the kernel compilation stage (see attached log).

So: is this the correct function to call? If it is, there's apparently a bug in the run-time compiler that ships with 2.3.

 

(gdb) r Program received signal SIGSEGV, Segmentation fault. 0x00007ffff6db4392 in ?? () from /lib/libc.so.6 (gdb) bt #0 0x00007ffff6db4392 in ?? () from /lib/libc.so.6 #1 0x00007fffb9522e10 in ?? () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so #2 0x00007fffb9523534 in ?? () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so #3 0x00007fffb9524a40 in ?? () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so <snip> #24 0x00007fffb8dec16a in ?? () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so #25 0x00007fffb8d7d77a in clBuildProgram () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so

0 Likes
12 Replies
nou
Exemplar

best what you can do is provide test case.

0 Likes

That is a  good point, and here's an extremely simple kernel that reproduces the crash:

 

 

 

__kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output) { output[0] = mul_hi(val,d_p[0]); }

0 Likes

insert this snipet into HelloCL.cl from SDK samples and add

typedef ulong uint64_t;

and it compiled without error.

EDIT: just try add macro #define divide(a, b, c) ((mul_hi(a,b)+a)>>(c))

and output[1] = divide(val, d_p[1], 4); it pass.

0 Likes

This is what I get with your kernel.
E:\Users\mvillmow.AMD\AppData\Local\Temp\OCLA260.tmp.cl(1): error: identifier
"uint64_t" is undefined
__kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output)
^

E:\Users\mvillmow.AMD\AppData\Local\Temp\OCLA260.tmp.cl(1): error: identifier
"uint64_t" is undefined
__kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output)
^

E:\Users\mvillmow.AMD\AppData\Local\Temp\OCLA260.tmp.cl(1): error: identifier
"uint64_t" is undefined
__kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output)
^

E:\Users\mvillmow.AMD\AppData\Local\Temp\OCLA260.tmp.cl(5): error: the opencl
builtin function can only takes [unsigned] char/short/int/long as
first argument
output[0] = mul_hi(val,d_p[0]);
^

Please make sure you use normal CL types or provide a complete test case.

If you typedef as nou says, it compiles normally.
0 Likes

Okay, I see what the problem is now.

I had 'uint64_t' defined as 'unsigned long long' (which is standard in 32-bit gcc and also works in 64-bit gcc).

But in the OpenCL spec, the 64-bit type is 'unsigned long', and 'unsigned long long' is the 128-bit type.

If I define uint64_t as 'ulong' as per the OpenCL spec, mul_hi works correctly.

The reason why didn't blow up earlier is that, apparently, Stream does NOT really treat 'unsigned long long' as a 128-bit type. (So it does not get off scot-free, there's still a bug there). In particular, sizeof(unsigned long long) is 8, and all my code except for the mul_hi instruction works as if it were uint64_t.

0 Likes

eugenek,
With our next release this doesn't crash but produces an error about not finding the type.

Here is the example I used, is this sufficient to crash on your setup?

typedef unsigned long long uint64_t; __kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output) { output[0] = mul_hi(val,d_p[0]); }

0 Likes

Yes, it is sufficient, I just pasted this into a CL file in one of the samples and got a segmentation fault.

 

0 Likes

Ok, this is fixed already internally. Thanks for reporting it.
0 Likes

Great. Why don't you also add built-in support for C99 fixed-width integer types, to avoid possible confusion in the future? I would've just #included stdint.h in the cl file, but that does not work for some reason.

http://en.wikipedia.org/wiki/Stdint.h#Exact-width_integer_types

The kernel that uses that mul_hi instruction seems to run NINE TIMES faster with the macro than with explicit integer division.

 

 

0 Likes

Because those types are not the same as the OpenCL types. I don't know the reasoning but stdint.h is implicitly excluded by the spec by not allowing inttypes.h.
0 Likes

Back to the original topic: 64-bit integer division (in cases where I can't replace it with a multiplication macro) appears to be incredibly slow. According to my estimates, I'm getting  700M divisions/second at most, which, for my device, indicates about 200-250 clock ticks per division per stream core. Is that reasonable or am I doing something wrong?

 

Edit: according to the ISA dump, each 64-bit integer division compiles into 900 low-level instructions. Yikes!

0 Likes

eugenek,
64-bit integer division is not a native instruction, so it is software emulated. We can look into improving it in the future, but it will never be fast until we get 64bit hardware instructions.
0 Likes