cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

mike3
Journeyman III

papers that compare Nvidia C2070 and Firestream 9350?

Are there any published studies that compare Nvidia C2070 and Firestream 9350 when implementing algorithms for the same problem?

 

0 Likes
6 Replies
rick_weber
Adept II

Not that I'm aware of, but we compare a c2050 to a Radeon 5870 here:

www.netlib.org/lapack/lawnspdf/lawn228.pdf

The Radeon 5870 is roughly to the Firestream 9370 in theoretical performance according to Wikipedia. My guess is you're paying a premium for the 4GB of memory on the Firestream.

0 Likes
eugenek
Journeyman III

I don't know of any studies. I might put a section into an article. Here are some of my observations so far:

- Firestream will generally be faster with single-precision floating point arithmetics.

- Tesla will be faster with integers.

- Tesla will be especially faster with 64-bit integers. That's mostly because of weaknesses in the AMD compiler. But Tesla also has some hardware support of 64-bits (e.g. there's a native instruction to convert 32-bit float to 64-bit int or vice versa).

- On Tesla, you can pull things out of global memory at a whim: any instruction can take a global memory location as one of its arguments. On AMD, the compiler has to generate an explicit sequence of fetch instructions. Things look especially complicated if you're trying to fetch through a char* pointer (you have to fetch a dword and then shift it to extract a char).

To get good performance, you have to isolate and merge your memory accesses, e.g. instead of just blindly reading through a char* any time you want, it is a good idea to read a 'char4' and save the result somewhere. It's surmountable, but it complicates your life.

0 Likes

If you don't have cl_khr_byte_addressable_store enabled, then you'll get an error if you try to access global memory with anything other than at least a 32-bit stride.

I'd imagine you'd want to stream char4s on Nvidia hardware to shared memory and char16s on ATI hardware. Then you can iterate through shared memory 1 byte at a time. You might want to upcast the shared array into another unsigned int array if you need more than O(n) accesses.

0 Likes

Originally posted by: rick.weber If you don't have cl_khr_byte_addressable_store enabled, then you'll get an error if you try to access global memory with anything other than at least a 32-bit stride.

 

Regardless of the value of cl_khr_byte_addressable_store, the compiler will still access a dword internally. Behold:

The sole purpose of lines 4 through 11 is to fetch a dword and to extract a byte out of that dword.

As you can imagine, if it takes 8 clockticks (the equivalent of 8x5=40 ALU instructions, which could've done 40 single precision FMA's) to pull a char out of global memory, that can become a huge bottleneck.

 

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable __kernel void do_test(__global uint* out, __global uchar* in) { out[0] = in[get_global_id(0)]; } ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(14) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15) 0 x: LSHR R2.x, KC1[0].x, (0x00000002, 2.802596929e-45f).x t: MULLO_INT ____, R1.x, KC0[1].x 1 z: ADD_INT ____, R0.x, PS0 2 y: ADD_INT ____, PV1.z, KC0[6].x 3 w: ADD_INT ____, PV2.y, KC1[1].x 4 y: AND_INT ____, (0xFFFFFFFC, -1.#QNANf).x, PV3.w z: AND_INT ____, PV3.w, (0x00000003, 4.203895393e-45f).y 5 x: MOV R0.x, PV4.z y: ADD_INT R0.y, (0xFFFFFFFE, -1.#QNANf).x, PV4.z z: LSHR R0.z, PV4.y, (0x00000002, 2.802596929e-45f).y w: ADD_INT R0.w, -1, PV4.z 01 TEX: ADDR(64) CNT(1) 6 VFETCH R1.x___, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 02 ALU: ADDR(46) CNT(9) 7 y: CNDE_INT ____, R0.x, 0.0f, (0x00000018, 3.363116314e-44f).x 8 x: CNDE_INT ____, R0.w, (0x00000008, 1.121038771e-44f).x, PV7.y 9 w: CNDE_INT ____, R0.y, (0x00000010, 2.242077543e-44f).x, PV8.x 10 y: ASHR ____, R1.x, PV9.w 11 x: AND_INT R1.x, PV10.y, (0x000000FF, 3.573311084e-43f).x 03 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R2].x___, R1, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

0 Likes

I'd imagine you'd want to stream char4s on Nvidia hardware to shared memory and char16s on ATI hardware. Then you can iterate through shared memory 1 byte at a time. You might want to upcast the shared array into another unsigned int array if you need more than O(n) accesses.


 

Nvidia has the accessing and the caching worked out in such a way that you can just grab data via a char*. Most of the time you don't even need to use shared memory, an explicit fragment of code to fetch stuff from global to shared will make things slightly _slower_, and grouping reads into char4s does not really give any performance boost, unlike on AMD.

Edit: I got around to compile an identical kernel on NVIDIA. As you can see, one instruction 'ld u8' to pull the char.

On the other hand, if I declare both pointers as char4, that confuses NVIDIA's compiler and I get some crazy byte manipulation from THEM (instead of just a simple load & store). The moral is, check the assembly early and do it often, and be wary of the compiler doing things that it shouldn't.

00000000: l3 mov b32 $r1 c1[0x100] 00000008: mov b32 $r0 $tidx 00000010: l3 mov b32 $r4 c0[0x20] 00000018: mov b32 $r5 c0[0x24] 00000020: add b32 $c $r2 $r0 c0[0x28] 00000028: add b32 $r3 0 c0[0x2c] $c 00000030: ld u8 $r0 ca g[$r2d] 00000038: st b32 wb g[$r4d] $r0

0 Likes

eugenek,
This is because you are going down the fast path, if you want to load only a byte, you must go down the complete path, which is dramatically slower.
0 Likes