cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

MicahVillmow
Staff
Staff

cheapo GT120M outperforms HD5850 in prefix sum

After looking closer at the ISA in the case I pasted above, the problem is that ALU packing ratio is only about 2.17 out of 5. There is 302 ALU instructions and 139 ALU bundles. So you are only utilizing about 43% of the capacity of the chip. This is fairly low and you need to increase the parallelization of the kernel. If you remove barriers, this improves overall performance by about 10% and increases utilization to 2.9 out of 5, or about 58%. This is still pretty low as most video games average between 3.8 and 4.1.

Is it possible to have ThreadSum work on 2 or 4 or 8 data points in parallel? This would help with performance.
0 Likes
BarnacleJunior
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

Thanks for running that.  There is something wrong with the driver then, not my code.  5850 is obviously not 10x slower than 5870.  I am using your kernel code with the intrinsics:

GPU velocity: 260.555M
GPU velocity: 260.693M
GPU velocity: 260.320M
GPU velocity: 258.789M
GPU velocity: 259.763M
GPU velocity: 261.208M
GPU velocity: 261.393M
GPU velocity: 261.182M
GPU velocity: 261.354M
GPU velocity: 261.292M

OpenCL is reporting 18 for CL_DEVICE_MAX_COMPUTE_UNITS, and the performance in D3D11 is 3200M/sec.  Here's the version info from Catalyst Control Center:

Driver Packaging Version    8.681-091124a-092499C-ATI   
Catalyst™ Version    09.12   
Provider    ATI Technologies Inc.   
2D Driver Version    8.01.01.984   
2D Driver File Path    /REGISTRY/MACHINE/SYSTEM/ControlSet001/Control/CLASS/{4D36E968-E325-11CE-BFC1-08002BE10318}/0000   
Direct3D Version    8.14.10.0716   
OpenGL Version    6.14.10.9232   
Catalyst™ Control Center Version    2009.1214.1801.32312

 

0 Likes
BarnacleJunior
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

So interesting!  I'm on Win7 x64.  The 32bit builds, that I've been using until now, all perform at ~260M uints/sec.  I just tried linking to your 64bit libraries:

My kernel without intrinsics:

GPU velocity: 3822.475M
GPU velocity: 3913.279M
GPU velocity: 3915.547M
GPU velocity: 3918.406M
GPU velocity: 3911.325M
GPU velocity: 3911.484M
GPU velocity: 3910.957M
GPU velocity: 3908.489M
GPU velocity: 3912.629M
GPU velocity: 3905.520M

 

Your kernel with intrinsics:

GPU velocity: 3826.765M
GPU velocity: 3919.825M
GPU velocity: 3918.040M
GPU velocity: 3917.697M
GPU velocity: 3916.497M
GPU velocity: 3914.409M
GPU velocity: 3911.509M
GPU velocity: 3911.349M
GPU velocity: 3913.832M
GPU velocity: 3912.431M

This is encouraging.  I'll just exclusively use the 64bit builds until Catalyst 9.13

 

0 Likes
MicahVillmow
Staff
Staff

cheapo GT120M outperforms HD5850 in prefix sum

I've reported the performance delta issue between the 32bit and 64bit dll's so maybe we can figure out exactly what is going wrong. Enjoy the new year!
0 Likes
BarnacleJunior
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

For comparison purposes, equivalent prefix sum in

OpenCL 32bit:

GPU velocity: 260.555M
GPU velocity: 260.693M
GPU velocity: 260.320M
GPU velocity: 258.789M
GPU velocity: 259.763M
GPU velocity: 261.208M
GPU velocity: 261.393M
GPU velocity: 261.182M
GPU velocity: 261.354M
GPU velocity: 261.292M

 

OpenCL 64bit:

 

GPU velocity: 3822.475M
GPU velocity: 3913.279M
GPU velocity: 3915.547M
GPU velocity: 3918.406M
GPU velocity: 3911.325M
GPU velocity: 3911.484M
GPU velocity: 3910.957M
GPU velocity: 3908.489M
GPU velocity: 3912.629M
GPU velocity: 3905.520M

D3D11 32bit:

GPU velocity: 3733.582M
GPU velocity: 3910.886M
GPU velocity: 3910.342M
GPU velocity: 3911.228M
GPU velocity: 3910.527M
GPU velocity: 3910.804M
GPU velocity: 3911.896M
GPU velocity: 3911.685M
GPU velocity: 3909.983M
GPU velocity: 3906.159M

D3D11 64bit:

GPU velocity: 3741.084M
GPU velocity: 3911.670M
GPU velocity: 3912.159M
GPU velocity: 3912.828M
GPU velocity: 3912.276M
GPU velocity: 3913.058M
GPU velocity: 3911.648M
GPU velocity: 3912.201M
GPU velocity: 3912.554M
GPU velocity: 3909.024M

0 Likes
BarnacleJunior
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

Ok Micah. Thanks for your help.  Have a good new years too.

0 Likes
apollo_maverick
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

weird,  followed by my result:

GPU velocity: 4507.252M
GPU velocity: 4802.451M
GPU velocity: 4789.285M
GPU velocity: 4801.221M
GPU velocity: 4784.729M
GPU velocity: 4785.236M
GPU velocity: 4784.610M
GPU velocity: 4781.841M
GPU velocity: 4775.066M
GPU velocity: 4797.403M

significant fast than results which submit by you guys with same chip model 5870, and i wanna know this's why, thanks

btw: my system config: Windows 7 x64; i7 965 EE @3.6GHz; 12GB DDR3 @1600MHz, dual SATA HDD RAID 0; HD 5870 1GB

and program compiled target is x64

0 Likes
BarnacleJunior
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

Originally posted by: apollo_maverick weird,  followed by my result:

 

GPU velocity: 4507.252M GPU velocity: 4802.451M GPU velocity: 4789.285M GPU velocity: 4801.221M GPU velocity: 4784.729M GPU velocity: 4785.236M GPU velocity: 4784.610M GPU velocity: 4781.841M GPU velocity: 4775.066M GPU velocity: 4797.403M

 

significant fast than results which submit by you guys with same chip model 5870, and i wanna know this's why, thanks

 

btw: my system config: Windows 7 x64; i7 965 EE @3.6GHz; 12GB DDR3 @1600MHz, dual SATA HDD RAID 0; HD 5870 1GB

 

and program compiled target is x64

 

The numbers you just posted are consistent with the ones I had been getting in D3D and 64bit OpenCL: your card is about 25% faster and so are your results.

However since then I have written truly optimized prefix sum (and nearly done with an extremely efficient radix sum) for D3D11.  Getting 6000M uints/sec for small arrays (512k uints) and more than 7000M for arrays of 4M elements.

I think the OpenCL drivers just suck.  I've got optimized prefix sum and have nearly finished an optimized radix sort for D3D11.  The prefix sum is doing 6000M uints/sec for 1 million element array, and 7000M+ uints/sec for 4 million element arrays.

Also that code I posted is pretty stupid, because the first pass outputs the scanned array to the UAV.  The card is very much write bandwidth limited.  In my current code (it's a mess of macros so I won't paste it here yet) I only write on the second pass.

Have you run any memory bandwidth tests?  Seems my HD5850 does 100GB/s read and 42GB/s write.  They are concurrent though, so if you read and write in the same shader the effective bandwidth is 42GB.

0 Likes
apollo_maverick
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

BarnacleJunior, i've run your memory bandwidth test code in this thread, i got result of about 56GB/s

0 Likes
jeff_golds
Staff
Staff

cheapo GT120M outperforms HD5850 in prefix sum

I ran the test on both 32- and 64-bit and am getting about 56 GB/s on a HD5870.  Note that since you are doing reads *and* writes you should count the total bandwidth used.  Thus, you're actually hitting 112 GB/s.

0 Likes