cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

BarnacleJunior
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

Originally posted by: jeff_golds I ran the test on both 32- and 64-bit and am getting about 56 GB/s on a HD5870.  Note that since you are doing reads *and* writes you should count the total bandwidth used.  Thus, you're actually hitting 112 GB/s.

 

The reason for the performance was a bug in the catalyst driver.  However as far as counting bandwidth (I do most my work in D3D11 which seems to have more reliable drivers, but far from perfect), the read bandwidth (on HD5850) is 111 GB/s and the write bandwidth is 42GB/s.  It doesn't appear to involve a sum in a very clear way.  I can achieve 111 GB/s even when writing a significant amount of data.  I don't know but I think the read and write operate more or less independently, and you are bottlenecked by the slower one.  I've managed to write a radix sort that does 329M pairs/sec and 408M uints/sec by being exploiting wavefront concurrency, not doing any divergent branching, and being very parsimonious when it comes to global writes.  Reads are way, way cheaper.

0 Likes
jeff_golds
Staff
Staff

cheapo GT120M outperforms HD5850 in prefix sum

Reads and writes both use bandwidth.  Thus, when you are doing both, you need to count the total bandwidth used.

      double elapsed = period * (end.QuadPart - begin.QuadPart);

      double velocity = NumElements * (NumLoops / elapsed);
      printf("GPU velocity: %1.3fGB/s\n", velocity * 4 / (1<< 30));

You need to add a "*2" someplace in your equation to report the bandwidth used, otherwise you are misrepresenting how much bandwidth you are using.  Maybe something like this:

    int readBytes = sizeof(cl_uint);

    int writeBytes = sizeof(cl_uint);

      double elapsed = period * (end.QuadPart - begin.QuadPart);

      double velocity = NumElements * (NumLoops / elapsed);
      printf("GPU velocity: %1.3fGB/s\n", velocity * (readBytes + writeBytes) / (1<< 30));

You can't expect to achieve the same write bandwidth while doing reads and writes as you achieve with writes only.

0 Likes
hocheng
Staff
Staff

cheapo GT120M outperforms HD5850 in prefix sum

I'm intresting at this topic too.  I testing on my machine: HD 5870+ I7 2.67G Q-core cpu + win7 64OS, and got GPU velocity: 4704.873M
GPU velocity: 4929.441M
GPU velocity: 4930.918M
GPU velocity: 4941.142M
GPU velocity: 4932.678M
GPU velocity: 4937.237M
GPU velocity: 4942.717M
GPU velocity: 4931.638M
GPU velocity: 4906.104M
GPU velocity: 4911.905M

BTW, There was almost no change when I use the automics kernel from MicahVillmow instead.

0 Likes
noxnet
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

I ran the benchmarks on a HD5750 with following results

Copy:

32-Bit
GPU velocity: 36,518GB/s
GPU velocity: 47,213GB/s
GPU velocity: 46,405GB/s
GPU velocity: 47,128GB/s
GPU velocity: 47,504GB/s
GPU velocity: 46,260GB/s
GPU velocity: 47,171GB/s
GPU velocity: 47,129GB/s
GPU velocity: 47,339GB/s
GPU velocity: 46,899GB/s

64-Bit
GPU velocity: 17,410GB/s
GPU velocity: 26,471GB/s
GPU velocity: 25,673GB/s
GPU velocity: 26,436GB/s
GPU velocity: 26,473GB/s
GPU velocity: 26,417GB/s
GPU velocity: 26,516GB/s
GPU velocity: 26,512GB/s
GPU velocity: 26,138GB/s
GPU velocity: 26,556GB/s

Scan:
32-Bit
GPU velocity: 116,712M
GPU velocity: 121,473M
GPU velocity: 121,298M
GPU velocity: 121,555M
GPU velocity: 121,409M
GPU velocity: 121,598M
GPU velocity: 120,943M
GPU velocity: 119,926M
GPU velocity: 121,391M
GPU velocity: 120,498M

64-Bit
GPU velocity: 95,510M
GPU velocity: 121,505M
GPU velocity: 121,431M
GPU velocity: 121,573M
GPU velocity: 119,621M
GPU velocity: 121,465M
GPU velocity: 121,393M
GPU velocity: 120,325M
GPU velocity: 120,980M
GPU velocity: 120,731M

When running SDK Sample BinominalOption with suggested parameters (BinomialOption.exe -x 1048576 -i 10 -q -t)
my screen gets black and then i get a windows message that the display driver has stopped working correctly
and has recovered. When running with other parameters (BinomialOption.exe -x 524288 -i 10 -q -t) i get the following result

Executing kernel for 10 iterations
-------------------------------------------
Option Samples           Time(sec)                KernelTime(sec)          Options/sec

524288                  8.58333                  8.0323                  61082.1

Some ideas why the performance of the HD5750 is so bad on my system?


System:
----------------------------------------------------------------
Winows 7 64-Bit
Stream SDK 2.0.1 64-Bit
Visual Studio VC++ Express
Catalyst 10.3


OpenCL Query results:
----------------------------------------------------------------
Platform Name:   ATI Stream
Platform Version:  OpenCL 1.0 ATI-Stream-v2.0.1
Vendor:   Advanced Micro Devices, Inc.
Device Name:   Juniper

Profile:   FULL_PROFILE
Supported Extensions:  cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics

Local Mem Type (Local=1, Global=2): 1
Local Mem Size (KB):    32
Global Mem Size (MB):   256
Global Mem Cache Size (Bytes):  0
Clock Frequency (MHz):   700
Max Work Group Size:   256
Address Bits:    32
Max Compute Units:   9

Vector type width for: char =  16
Vector type width for: short =  8
Vector type width for: int =  4
Vector type width for: long =  2
Vector type width for: float =  4
Vector type width for: double =  0

0 Likes
noxnet
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

very strange

when running the same program (scan) on the same system but instead of the HD5750 i'm now using an ATI HD5450.

I got an average GPU velocity of 143M (64-Bit).

So to me it looks like the HD5450 is performing right. As an HD 58xx calculates about 3000 M uint/sec the HD5450 is about 20 times slower.

Only considering the amout of streaming processors 80:1600 the GPU velocity of 143M seems quite ok.

0 Likes
noxnet
Journeyman III

cheapo GT120M outperforms HD5850 in prefix sum

Can anyone please try these benchmarks on 5750 or similar?

The benchmarks in these thread can be used by simply copy and paste.

0 Likes