Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- AMD Community
- Communities
- Developers
- Devgurus Archives
- Archives Discussions
- CUDA port, 64-bit multiplication/division, & a lik...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
08:11 AM

CUDA port, 64-bit multiplication/division, & a likely bug in Stream SDK

A bit of background. I'm working on a computational mathematics problem. The problem is such that it gets a nice boost from GPU optimizations. I have a working CUDA implementation and I'm trying to port it to AMD (since AMD chips appear to have an edge in operations per second, at least on paper).

The port is mostly going well, two biggest issues encountered so far were the limit on the number of work-units per kernel invocation (NVIDIA can do 32768x256, my current Barts can only do 124x256. This is silly since it's clearly not a hardware issue, and easy to fix), and the lack of 64-bit atomics (that took some coding to work around, but it's also surmountable.)

Here's the issue I'm facing right now. My program has a lot of 64-bit integer divisions. Those are typically quit slow. But there's a workaround. I have this macro:

#define divide(a, b, c) ((__umul64hi(a,b)+a)>>(c))

here 'a' and 'b' are uint64_t, and 'c' is uint8_t. '__umul64hi' computes the 128-bit product of two 64-bit unsigned integers and saves the upper 64 bit.

If the list of divisors is known in advance, we can precompute b & c and store them somewhere. On NVIDIA (and on x86, too), this speeds up the division by something like a factor of 3.

I tried to get this working on AMD, too. The closest fit to __umul64hi I could find is 'mul_hi' (defined in the OpenCL 1.1 spec, section 6.11.3), though I'm not sure if it's exactly what I need. But when I plug that one into my macro, I get a crash during the kernel compilation stage (see attached log).

So: is this the correct function to call? If it is, there's apparently a bug in the run-time compiler that ships with 2.3.

(gdb) r Program received signal SIGSEGV, Segmentation fault. 0x00007ffff6db4392 in ?? () from /lib/libc.so.6 (gdb) bt #0 0x00007ffff6db4392 in ?? () from /lib/libc.so.6 #1 0x00007fffb9522e10 in ?? () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so #2 0x00007fffb9523534 in ?? () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so #3 0x00007fffb9524a40 in ?? () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so <snip> #24 0x00007fffb8dec16a in ?? () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so #25 0x00007fffb8d7d77a in clBuildProgram () from /home/eugene/Downloads/ati-stream-sdk-v2.3-lnx64/lib/x86_64/libatiocl64.so

12 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
09:08 AM

best what you can do is provide test case.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
09:21 AM

That is a good point, and here's an extremely simple kernel that reproduces the crash:

__kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output) { output[0] = mul_hi(val,d_p[0]); }

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
09:40 AM

insert this snipet into HelloCL.cl from SDK samples and add

typedef ulong uint64_t;

and it compiled without error.

EDIT: just try add macro #define divide(a, b, c) ((mul_hi(a,b)+a)>>(c))

and output[1] = divide(val, d_p[1], 4); it pass.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
02:18 PM

E:\Users\mvillmow.AMD\AppData\Local\Temp\OCLA260.tmp.cl(1): error: identifier

"uint64_t" is undefined

__kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output)

^

E:\Users\mvillmow.AMD\AppData\Local\Temp\OCLA260.tmp.cl(1): error: identifier

"uint64_t" is undefined

__kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output)

^

E:\Users\mvillmow.AMD\AppData\Local\Temp\OCLA260.tmp.cl(1): error: identifier

"uint64_t" is undefined

__kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output)

^

E:\Users\mvillmow.AMD\AppData\Local\Temp\OCLA260.tmp.cl(5): error: the opencl

builtin function can only takes [unsigned] char/short/int/long

first argument

output[0] = mul_hi(val,d_p[0]);

^

Please make sure you use normal CL types or provide a complete test case.

If you typedef as nou says, it compiles normally.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
08:24 PM

Okay, I see what the problem is now.

I had 'uint64_t' defined as 'unsigned long long' (which is standard in 32-bit gcc and also works in 64-bit gcc).

But in the OpenCL spec, the 64-bit type is 'unsigned long', and 'unsigned long long' is the 128-bit type.

If I define uint64_t as 'ulong' as per the OpenCL spec, mul_hi works correctly.

The reason why didn't blow up earlier is that, apparently, Stream does NOT really treat 'unsigned long long' as a 128-bit type. (So it does not get off scot-free, there's still a bug there). In particular, sizeof(unsigned long long) is 8, and all my code except for the mul_hi instruction works as if it were uint64_t.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
08:37 PM

With our next release this doesn't crash but produces an error about not finding the type.

Here is the example I used, is this sufficient to crash on your setup?

typedef unsigned long long uint64_t; __kernel void do_test_case(__global uint64_t* d_p, uint64_t val, __global uint64_t* output) { output[0] = mul_hi(val,d_p[0]); }

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
08:42 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
08:48 PM

Ok, this is fixed already internally. Thanks for reporting it.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
09:04 PM

Great. Why don't you also add built-in support for C99 fixed-width integer types, to avoid possible confusion in the future? I would've just #included stdint.h in the cl file, but that does not work for some reason.

http://en.wikipedia.org/wiki/Stdint.h#Exact-width_integer_types

The kernel that uses that mul_hi instruction seems to run NINE TIMES faster with the macro than with explicit integer division.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-28-2011
09:43 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-29-2011
12:19 AM

Back to the original topic: 64-bit integer division (in cases where I can't replace it with a multiplication macro) appears to be incredibly slow. According to my estimates, I'm getting 700M divisions/second at most, which, for my device, indicates about 200-250 clock ticks per division per stream core. Is that reasonable or am I doing something wrong?

Edit: according to the ISA dump, each 64-bit integer division compiles into 900 low-level instructions. Yikes!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-31-2011
01:26 PM

64-bit integer division is not a native instruction, so it is software emulated. We can look into improving it in the future, but it will never be fast until we get 64bit hardware instructions.