cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

eci
Journeyman III

Performance Comparison ATI-NVidia

Hi

I'm currently comparing performance of OpenCL on different platforms. I'm especially interested in comparing NVidia and ATI graphics cards. The cards I'm currently using are a NVidia Quadro FX5600 and an ATI Radeon HD 5870.

The task I use for comparison is a backprojection algorithm.

After running some experiments the ATI card is two times slower then the card from NVidia while from a theoretical point of view it should be at least twice as fast. That's a very disappointing result and I'm curious what the reason for this poor performance is.

 

The Problem is as follows:

I reconstruct a volume from projection data. The volume size is 512ˆ3 and I have 400 Projections. For every projection one kernel-run is launched. The task of the kernels is to compute for every voxel a position in the current projection image and take this value to increment the voxel value. For the projection images I am using image_2d with a sampler for linear interpolation.

On the NVidia graphics card I am using a 2D problem over the x-z-direction of the volume. Every kernel runs one line in y-direction and work-groups are aligned along the x-direction. This way memory access is coalesced and I get very good performance.

On the ATI graphics card I tried the same approach, but performance was devastating. So I went back to a 3D problem. I experimented with work-group sizes and alignment along the x-direction seems to be beneficial here too. This type of implementation currently yields the best performance on ATI, but as stated it takes double the time of the NVidia card.

I tried different versions of the kernel and I'm pretty sure, that memory access is the limiting factor. But why? Do I miss something?

One more question: Is there a way to get around the memory restrictions with the current Stream SDK? I'm already using the environment variables to get access to the full 1GB of memory, but can still only allocate 256MB of memory in one block which is very annoying!

I attached the simples version of my kernel code. For NVidia the inner part is surrounded by a loop and for my current ATI version every kernel processes two voxels on two different memory blocks because of the limitation of the maximum memory block size of 256MB on the current Stream SDK.

 

Thanks for your support!

Greetings

Christian

const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_LINEAR; __kernel void RoadRunner( __global float* volume, __read_only image2d_t projection, int L, int S_x, int S_y, float R_L, float O_L, float m0, float m1, float m2, float m3, float m4, float m5, float m6, float m7, float m8, float m9, float m10, float m11) { size_t id1 = get_global_id(0); size_t id2 = get_global_id(1); size_t id3 = get_global_id(2); float z = O_L + (float)id1 * R_L; float y = O_L + (float)id2 * R_L; float x = O_L + (float)id3 * R_L; float w_n = m2 * x + m5 * y + m8 * z + m11; float u_n = (m0 * x + m3 * y + m6 * z + m9 ) / w_n; float v_n = (m1 * x + m4 * y + m7 * z + m10) / w_n; volume[id1 * L * L + id2 * L + id3] += (float)(1.0 / (w_n * w_n) * read_imagef(projection, sampler, (float2)(u_n+0.5, v_n+0.5)).x); return; }

0 Likes
48 Replies
eci
Journeyman III

Sorry, I don't understand what you mean.

I execute the kernel a total of 496 times on both cards, but that shouldn't affect the runtime of each one. They are launched one after another and there is synchronization in between because I have to load new projection data onto the card.

Each kernel execution means on ATI: 512*512*512 work-items grouped into work-groups a 64 work-items.

And on NVidia: 512*512 work-items grouped into work-groups a 128 work-items.

0 Likes
Lev
Journeyman III

 ------------

0 Likes
Lev
Journeyman III

 ---

0 Likes
eci
Journeyman III

Yes, I'm aware of that. I meant that one work-group consists of 64 work-items.

 

For clarity:

global work sizes: {512, 512, 512}

local work sizes: {64, 1, 1}

0 Likes
Lev
Journeyman III

 

I was wrong, but it is strange, that best size of work groups is 64. I can say, that kernell launch time on nv is much shorter.

Also, how do you measure time? GPU does not start to perform until some comands like clflush.

I suggest to try to optimize those nv initial variant. Anyway there should not be problems with them.

0 Likes

@eci, would your RoadRunner kernel go faster if you changed this as follows: ?

 

 

 

 

 

From this: if(id2 < L/2) volume1[id3 * L * L/2 + id2 * L + id1] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; else volume2[id3 * L * L/2 + (id2-L/2) * L + id1] += read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; To this: float temp = read_imagef(projection, sampler, (float2)(u_n+0.5f, v_n+0.5f)).x * w * w; int index = id3 * L * L/2 + id1; if(id2 < L/2) volume1[index + id2 * L ] += temp; else volume2[index + (id2-L/2) * L] += temp;

0 Likes

I think your kernel is slower because the 5870's shaders run at 850Mhz, while the Quadro FX5600's shaders run at 1350Mhz.

The 5870's has a bit more SIMD power ( like 2x due to large wavefronts and VLWI design ), but if your kernel is not very well optimized and vectorised then more Mhz will win.

 

And, yep, NVIDIA's OpenCL/CUDA drivers might be a bit more optimised. Remind that Quadro == super high end.

0 Likes

I noticed in the stats you posted, back in December, that ALU Busy is ~9%.

And the write unit is stalled for 51% of the time.

OUCH.

Each kernel launch is only doing 3.4 billion ALU cycles. If you were ALU limited that would take less than 1.5 milliseconds. (I don't know why the ALUInsts column says 25, I thought you were reporting more - so perhaps 2ms.)

You might like to try using a 3D local work size, i.e. 4x4x4. This might change the memory access patterns reducing the write bottleneck. ATI isn't good at intensive scatter with lots of clashes, which this seems to be doing. HD6970 is meant to be better.

I got better performance with 3D images than using 2D images simulating 3D images, so you might like to try that (though I was read-bottlenecked and you're write-bottlenecked). I'm able to allocate a 512MB 3D image (actually I can allocate two on a 1GB HD 5870).

You need to use:

#pragma OPENCL EXTENSION cl_khr_3d_image_writes : enable

(hmm, not sure about SDK 2.3, defnitely in SDK 2.2).

You can read and write 3D images from one kernel. If you're really sneaky you can read and write the same 3D image (though OpenCL says this is not supported), but I don't think that's relevant here.

0 Likes

Originally posted by: eci Hi

I can provide profiler results now:

Method , ExecutionOrder , GlobalWorkSize , GroupWorkSize , Time , LDSSize , DataTransferSize , GPRs , ScratchRegs , FCStacks , Wavefronts , ALUInsts , FetchInsts , WriteInsts , LDSFetchInsts , LDSWriteInsts , ALUBusy , ALUFetchRatio , ALUPacking , FetchSize , CacheHit , FetchUnitBusy , FetchUnitStalled , WriteUnitStalled , FastPath , CompletePath , PathUtilization , ALUStalledByLDS , LDSBankConflict WriteBuffer ,     1 , , ,       219.35086 , ,    262144.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , WriteBuffer ,     2 , , ,       170.11653 , ,    262144.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , WriteImage2D ,     3 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , RoadRunner__k1_Cypress1 ,     4 , {    512     512     512} , {   64     1     1} ,       141.96102 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.58 ,        12.50 ,        38.00 ,    270610.31 ,        98.38 ,         5.85 ,         5.85 ,        51.11 ,    524301.88 ,         0.00 ,       100.00 ,         0.00 ,         0.00 WriteImage2D ,     5 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , RoadRunner__k1_Cypress1 ,     6 , {    512     512     512} , {   64     1     1} ,       142.21886 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.50 ,        12.50 ,        38.00 ,    269159.88 ,        98.27 ,         5.75 ,         5.75 ,        50.83 ,    524302.38 ,         0.00 ,       100.00 ,         0.00 ,         0.00 WriteImage2D ,     7 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  , RoadRunner__k1_Cypress1 ,     8 , {    512     512     512} , {   64     1     1} ,       136.65246 ,           0 , ,     7 ,     0 ,     1 ,   2097152.00 ,        25.00 ,         2.00 ,         1.00 ,         0.00 ,         0.00 ,         8.66 ,        12.50 ,        38.00 ,    265210.19 ,        99.15 ,         5.79 ,         5.79 ,        50.81 ,    524290.75 ,         0.00 ,       100.00 ,         0.00 ,         0.00 WriteImage2D ,     9 , , ,         0.00042 , ,      4680.00 , , ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,  ,

                             
                             
                             
                             
                             
                             
                             
                             
                             

The current version of my kernel is attached. Total runtime is 4.5 times longer than on a NVidia Quadro FX 5600. Can you give me any pointers why this is the case? Based on some experiments I would guess it has something to do with global memory access.

I'm working on a second kernel (for another problem) that only works on images and that one performs abound 2 times faster than the NVidia card.

Anyone?

 

I guess Quadro FX 5600 and cypress do not have similar compute power. IS that so? Can someone quote the ocmpute power of quadro FX 5600?

In addition the kernel doesn't seem to be optimized for AMD device? There seems to be large number of memory accesses and I think they can be vectorized which would be lot more efficient.

0 Likes