cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

eci
Journeyman III

Performance Comparison ATI-NVidia

Hi

I'm currently comparing performance of OpenCL on different platforms. I'm especially interested in comparing NVidia and ATI graphics cards. The cards I'm currently using are a NVidia Quadro FX5600 and an ATI Radeon HD 5870.

The task I use for comparison is a backprojection algorithm.

After running some experiments the ATI card is two times slower then the card from NVidia while from a theoretical point of view it should be at least twice as fast. That's a very disappointing result and I'm curious what the reason for this poor performance is.

 

The Problem is as follows:

I reconstruct a volume from projection data. The volume size is 512ˆ3 and I have 400 Projections. For every projection one kernel-run is launched. The task of the kernels is to compute for every voxel a position in the current projection image and take this value to increment the voxel value. For the projection images I am using image_2d with a sampler for linear interpolation.

On the NVidia graphics card I am using a 2D problem over the x-z-direction of the volume. Every kernel runs one line in y-direction and work-groups are aligned along the x-direction. This way memory access is coalesced and I get very good performance.

On the ATI graphics card I tried the same approach, but performance was devastating. So I went back to a 3D problem. I experimented with work-group sizes and alignment along the x-direction seems to be beneficial here too. This type of implementation currently yields the best performance on ATI, but as stated it takes double the time of the NVidia card.

I tried different versions of the kernel and I'm pretty sure, that memory access is the limiting factor. But why? Do I miss something?

One more question: Is there a way to get around the memory restrictions with the current Stream SDK? I'm already using the environment variables to get access to the full 1GB of memory, but can still only allocate 256MB of memory in one block which is very annoying!

I attached the simples version of my kernel code. For NVidia the inner part is surrounded by a loop and for my current ATI version every kernel processes two voxels on two different memory blocks because of the limitation of the maximum memory block size of 256MB on the current Stream SDK.

 

Thanks for your support!

Greetings

Christian

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id1 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id3 * R_L; float w_n = m2 * x + m5 * y + m8 * z + m11; float u_n = (m0 * x + m3 * y + m6 * z + m9 ) / w_n; float v_n = (m1 * x + m4 * y + m7 * z + m10) / w_n; volume[id1 * L * L + id2 * L + id3] += (float)(1.0 / (w_n * w_n) * read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x); return; }

0 Likes
48 Replies
cjang
Journeyman III

Performance Comparison ATI-NVidia

Some suggestions:

1. sampler = CLK_FILTER_NEAREST | CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE;

2. vectorize memory access so: __global floatX * volume, where X is 2 or 4, scalar calculation is ok inside the kernel, just gather and scatter should be with float2 or float4

3. layout image data to maximize use of texture cache, right now your code is only using the first "x" component while each read_imagef returns a quad of four floats

I don't mean this purely as a shameless plug. I've done a lot of optimization from an applications level. Here's something I wrote that may give you some ideas of how different the vectorized cores of ATI are from the scalar cores of NVIDIA: http://golem5.org/gatlas/CaseStudyGATLAS.htm . Kernel design can be different between vendors (and different architectures).

0 Likes
zeland
Journeyman III

Performance Comparison ATI-NVidia

Simple way to optimize is reducing of operation like /

And consider to use __constant  for m1,m2 etc.

float w_n = 1/( m2 * x + m5 * y + m8 * z + m11); float u_n = (m0 * x + m3 * y + m6 * z + m9 ) * w_n; float v_n = (m1 * x + m4 * y + m7 * z + m10) * w_n; volume[id1 * L * L + id2 * L + id3] += (float)((w_n * w_n) / read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x);

0 Likes
debdatta_basu
Journeyman III

Performance Comparison ATI-NVidia

Does vectorizing solve the problem? If it does, the compiler is REALLY bad coz it wasn't able to  vectorize three consecutive identical statements....

Are you sure the compiler is set to fully optimize? In any case... try the vectorized version and share the details..

Cheers!

-Debdatta Basu.

0 Likes
cjang
Journeyman III

Performance Comparison ATI-NVidia

To clarify: I suggest vectorizing the gather/scatter memory access but not the calculations. Leave those as scalar operations. eci's speculation that "memory access is the limiting factor" is almost certainly correct.

debdatta.basu, my experience is that the static analysis required for a compiler to auto-vectorize and schedule is non-trivial. It's a hard problem in general.

0 Likes
n0thing
Journeyman III

Performance Comparison ATI-NVidia

Try this kernel, it might pass correctness .

The number of global threads should now be 1/4th of original kernel.

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float4* volume, __read_only image2d_t projection, float R_L, float O_L, float4 m2m5m8m11, float4 m0m3m6m9, float4 m1m4m7m10, float4 L_eq) // (L^2, L, 1, 0) { float4 id = (float4)(get_global_id(0), get_global_id(1), get_global_id(2), 0); float4 pos = O_L + id * R_L; float w_n = dot(pos, m2m5m8m11); float w_n_rcp = native_recip(w_n); float u_n = dot(pos, m0m3m6m9) * w_n_rcp; float v_n = dot(pos, m1m4m7m10) * w_n_rcp; int write_pos = convert_int(dot(id, L_eq)); volume[write_pos] += w_n_rcp * w_n_rcp * read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)); }

0 Likes
eduardoschardong
Journeyman III

Performance Comparison ATI-NVidia

I put the original code on SKA and got the attached ISA, can I blame the compiler now?

; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(122) KCACHE0(CB0:0-15) KCACHE1(CB1:0-31) 0 t: MULLO_INT ____, R1.z, KC0[1].z 1 z: ADD_INT ____, R0.z, PS0 t: MULLO_INT ____, R1.y, KC0[1].y 2 x: ADD_INT R1.x, PV1.z, KC0[6].z w: ADD_INT ____, R0.y, PS1 t: MULLO_INT ____, R1.x, KC0[1].x 3 x: ADD_INT ____, R0.x, PS2 y: ADD_INT T3.y, PV2.w, KC0[6].y t: U_TO_F T0.x, PV2.x 4 y: ADD_INT T2.y, PV3.x, KC0[6].x t: U_TO_F ____, PV3.y 5 y: MULADD_e T1.y, KC1[6].x, PS4, KC1[7].x w: MULADD_e T0.w, KC1[6].x, T0.x, KC1[7].x t: U_TO_F T0.x, PV4.y 6 x: MUL_e T1.x, KC1[9].x, PV5.w z: MUL_e ____, KC1[10].x, PV5.w 7 x: MUL_e T2.x, KC1[8].x, T0.w y: MULADD_e T0.y, KC1[13].x, T1.y, PV6.z 8 w: MULADD_e T1.w, KC1[6].x, T0.x, KC1[7].x 9 z: MULADD_e ____, KC1[16].x, PV8.w, T0.y w: MULADD_e T0.w, KC1[12].x, T1.y, T1.x 10 y: ADD ____, KC1[19].x, PV9.z w: MULADD_e ____, KC1[11].x, T1.y, T2.x 11 x: MOV T2.x, PV10.y y: MULADD_e ____, KC1[14].x, T1.w, PV10.w z: MULADD_e ____, KC1[15].x, T1.w, T0.w w: MOV T0.w, PV10.y t: MUL_e R0.y, PV10.y, PV10.y 12 x: AND_INT ____, PV11.x, (0x807FFFFF, -1.175494211e-38f).x y: ADD T0.y, KC1[18].x, PV11.z z: AND_INT T2.z, PV11.x, (0x7F800000, 1.#INFf).y w: ADD T3.w, KC1[17].x, PV11.y t: AND_INT T0.z, PV11.x, (0x80000000, -0.0f).z 13 x: SETE_INT T0.x, PV12.z, 0.0f y: AND_INT T1.y, PV12.y, (0x807FFFFF, -1.175494211e-38f).x z: OR_INT ____, PV12.x, (0x3F800000, 1.0f).y w: AND_INT R0.w, PV12.y, (0x7F800000, 1.#INFf).z t: AND_INT R1.w, PV12.w, (0x7F800000, 1.#INFf).z 14 x: SETE_INT T1.x, T2.z, (0x7F800000, 1.#INFf).x y: AND_INT ____, T3.w, (0x807FFFFF, -1.175494211e-38f).y z: CNDE_INT T0.z, PV13.x, PV13.z, T0.z w: SETE_INT T2.w, PV13.w, 0.0f t: AND_INT T1.w, T0.y, (0x80000000, -0.0f).z 15 x: SETE_INT T3.x, R1.w, 0.0f y: OR_INT ____, T1.y, (0x3F800000, 1.0f).x z: AND_INT T1.z, T0.w, (0x7F800000, 1.#INFf).y VEC_120 w: AND_INT ____, T3.w, (0x80000000, -0.0f).z VEC_201 t: OR_INT ____, PV14.y, (0x3F800000, 1.0f).x 16 x: CNDE_INT T2.x, T1.x, T0.z, T2.x y: SETE_INT ____, R0.w, (0x7F800000, 1.#INFf).x z: CNDE_INT T0.z, PV15.x, PS15, PV15.w w: CNDE_INT ____, T2.w, PV15.y, T1.w VEC_102 t: SETE_INT T1.w, PV15.z, (0x7F800000, 1.#INFf).x 17 x: SETE_INT ____, R1.w, (0x7F800000, 1.#INFf).x y: SETE_INT ____, T1.z, 0.0f z: CNDE_INT T1.z, PV16.y, PV16.w, T0.y w: OR_INT ____, PV16.y, T1.x VEC_021 t: OR_INT ____, T2.w, T0.x 18 x: OR_INT ____, T1.w, PV17.x y: CNDE_INT ____, PV17.x, T0.z, T3.w z: OR_INT T0.z, PV17.w, PS17 w: OR_INT ____, PV17.y, T3.x t: RCP_e ____, T2.x 19 x: SUB_INT ____, R0.w, T2.z y: MUL_e T0.y, T1.z, PS18 z: OR_INT R2.z, PV18.x, PV18.w w: SUB_INT ____, R1.w, T2.z VEC_120 t: MUL_e T3.w, PS18, PV18.y 20 x: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).x, PV19.y y: CNDE_INT T2.y, PV19.z, PV19.w, 0.0f z: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).x, PS19 w: CNDE_INT T1.w, T0.z, PV19.x, 0.0f t: MULLO_INT ____, T2.y, KC1[3].x 21 x: ASHR ____, PV20.z, (0x00000017, 3.222986468e-44f).x y: ASHR ____, PV20.w, (0x00000017, 3.222986468e-44f).x z: ASHR ____, PV20.x, (0x00000017, 3.222986468e-44f).x w: ADD_INT ____, T3.y, PS20 t: ASHR ____, PV20.y, (0x00000017, 3.222986468e-44f).x 22 x: AND_INT T3.x, (0x80000000, -0.0f).x, T0.y y: ADD_INT T0.y, PV21.x, PS21 z: ADD_INT T2.z, T1.w, T0.y w: ADD_INT ____, PV21.z, PV21.y t: MULLO_INT ____, PV21.w, KC1[3].x 23 x: SETGE_INT T2.x, PV22.w, (0x000000FF, 3.573311084e-43f).x y: ADD_INT T2.y, R1.x, PS22 z: SETGE_INT T1.z, 0.0f, PV22.w w: AND_INT R2.w, (0x80000000, -0.0f).y, T3.w t: ADD_INT R5.z, T2.y, T3.w 24 x: AND_INT ____, R0.y, (0x7F800000, 1.#INFf).x y: AND_INT T0.y, R0.y, (0x80000000, -0.0f).y z: SETGE_INT R4.z, 0.0f, T0.y w: AND_INT ____, R0.y, (0x807FFFFF, -1.175494211e-38f).z t: SETGE_INT R3.z, T0.y, (0x000000FF, 3.573311084e-43f).w 25 x: SUB_INT R1.x, (0x3F800000, 1.0f).x, PV24.x y: SETE_INT ____, PV24.x, 0.0f z: SETE_INT R0.z, PV24.x, (0x7F800000, 1.#INFf).y w: OR_INT ____, PV24.w, (0x3F800000, 1.0f).x t: LSHL R0.w, T2.y, (0x00000002, 2.802596929e-45f).z 26 x: OR_INT R0.x, PV25.z, PV25.y y: CNDE_INT R1.y, T1.z, T2.z, T3.x z: CNDE_INT R1.z, PV25.y, PV25.w, T0.y w: CNDE_INT R1.w, T0.z, T2.x, 0.0f VEC_201 t: OR_INT R2.y, T3.x, (0x7F800000, 1.#INFf).x 01 ALU: ADDR(154) CNT(24) KCACHE0(CB1:0-15) 27 x: ADD_INT T2.x, KC0[0].x, R0.w y: CNDE_INT ____, R0.z, R1.z, R0.y z: CNDE_INT T0.z, R0.x, R1.x, 0.0f w: OR_INT T3.w, R2.w, (0x7F800000, 1.#INFf).x t: CNDE_INT ____, R1.w, R1.y, R2.y 28 x: CNDE_INT T3.x, R2.z, R3.z, 0.0f y: ASHR T0.y, PV27.z, (0x00000017, 3.222986468e-44f).x z: ADD R2.z, PS27, 0.5 t: RCP_e ____, PV27.y 29 x: AND_INT ____, (0x7FFFFFFF, 1.#QNANf).x, PS28 y: AND_INT R0.y, (0x80000000, -0.0f).y, PS28 z: ADD_INT R4.z, T0.z, PS28 w: CNDE_INT ____, R4.z, R5.z, R2.w VEC_120 t: LSHR R3.x, T2.x, (0x00000002, 2.802596929e-45f).z 30 x: CNDE_INT ____, T3.x, PV29.w, T3.w y: OR_INT R1.y, PV29.y, (0x7F800000, 1.#INFf).x z: ASHR ____, PV29.x, (0x00000017, 3.222986468e-44f).y 31 x: ADD R2.x, PV30.x, 0.5 w: ADD_INT R2.w, PV30.z, T0.y 02 TEX: ADDR(192) CNT(2) 32 SAMPLE R2.x___, R2.xz0x, t0, s0 UNNORM(XYZW) 33 VFETCH R1.x___, R3.x, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 03 ALU: ADDR(178) CNT(7) 34 x: SETGE_INT ____, R2.w, (0x000000FF, 3.573311084e-43f).x z: SETGE_INT ____, 0.0f, R2.w 35 z: CNDE_INT ____, PV34.z, R4.z, R0.y w: CNDE_INT ____, R0.x, PV34.x, 0.0f 36 y: CNDE_INT ____, PV35.w, PV35.z, R1.y 37 x: MULADD_e R1.x, R2.x, PV36.y, R1.x 04 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R3].x___, R1, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

0 Likes
MicahVillmow
Staff
Staff

Performance Comparison ATI-NVidia

eduardoschardong,
What do you see as inefficient in the ISA generation? The compiler looks like it is packing pretty well into the ALU slots. It looks like you have ~150 ALU instructions being packed into 35 ALU bundles for a packing ratio of ~4.28.

0 Likes
nou
Exemplar

Performance Comparison ATI-NVidia

try use profiler to see what is a bottlneck.

0 Likes
eduardoschardong
Journeyman III

Performance Comparison ATI-NVidia

Micah, what I'm seeing is 17 multiplications, 17 additions and 3 divisions becoming ~150 instructions, it is doing much more work than what is seen in that small piece of code and likely much more work than on nVidia GPUs

 

Well... With such below expected performance something should be wrong, right?

0 Likes