eci

Performance Comparison ATI-NVidia

Discussion created by eci on Nov 19, 2010
Latest reply on Mar 2, 2011 by himanshu.gautam

Hi

I'm currently comparing performance of OpenCL on different platforms. I'm especially interested in comparing NVidia and ATI graphics cards. The cards I'm currently using are a NVidia Quadro FX5600 and an ATI Radeon HD 5870.

The task I use for comparison is a backprojection algorithm.

After running some experiments the ATI card is two times slower then the card from NVidia while from a theoretical point of view it should be at least twice as fast. That's a very disappointing result and I'm curious what the reason for this poor performance is.

 

The Problem is as follows:

I reconstruct a volume from projection data. The volume size is 512ˆ3 and I have 400 Projections. For every projection one kernel-run is launched. The task of the kernels is to compute for every voxel a position in the current projection image and take this value to increment the voxel value. For the projection images I am using image_2d with a sampler for linear interpolation.

On the NVidia graphics card I am using a 2D problem over the x-z-direction of the volume. Every kernel runs one line in y-direction and work-groups are aligned along the x-direction. This way memory access is coalesced and I get very good performance.

On the ATI graphics card I tried the same approach, but performance was devastating. So I went back to a 3D problem. I experimented with work-group sizes and alignment along the x-direction seems to be beneficial here too. This type of implementation currently yields the best performance on ATI, but as stated it takes double the time of the NVidia card.

I tried different versions of the kernel and I'm pretty sure, that memory access is the limiting factor. But why? Do I miss something?

One more question: Is there a way to get around the memory restrictions with the current Stream SDK? I'm already using the environment variables to get access to the full 1GB of memory, but can still only allocate 256MB of memory in one block which is very annoying!

I attached the simples version of my kernel code. For NVidia the inner part is surrounded by a loop and for my current ATI version every kernel processes two voxels on two different memory blocks because of the limitation of the maximum memory block size of 256MB on the current Stream SDK.

 

Thanks for your support!

Greetings

Christian

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id1 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id3 * R_L; float w_n = m2 * x + m5 * y + m8 * z + m11; float u_n = (m0 * x + m3 * y + m6 * z + m9 ) / w_n; float v_n = (m1 * x + m4 * y + m7 * z + m10) / w_n; volume[id1 * L * L + id2 * L + id3] += (float)(1.0 / (w_n * w_n) * read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x); return; }

Outcomes