cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ibird
Adept I

Global Memory BandWidth

I have a kernel memory bounded, that on the ATI 5870 reach about 70GB/s very far from 153GB/s, all possible optimization has been done regarding coalescing, occupancy, workgroup size. ( On C2050 i reach ~ 120-130GB/s) The kernel use 57 vGPRs so 256 Active thread 4 Wavefronts (Enought to hide latency ?)     

On the other hand i have tested my 5870 with the AMD benchmark GlobalBandWidth

Where i get for uncached read: 77GB/s, reading the kernel for uncached (but coalesced) read, i can say that the read method used by this kernel is similar

to those on my kernel. As far as i know uncached read should measure the bandwidth of the GDDR5 so i should get something more near to 153 GB/s. So the question is someone with an 5870 can confirm this number or there is something i am missing ?

0 Likes
8 Replies
ibird
Adept I

LDS is only 3KB and workgroup 128

0 Likes

I would be good , if you can share your approach. I guess all the channels must be busy to get 153GBps. Also profiling may also help you, check if you are getting any channel/bank conflicts.

0 Likes

Sure

Access is of type float2 or float4, this mean that each wavefront access to all adjacent byte of 2 or 4 (different) channels

(CODE attached is not simple to understand and complex to isolate bacause its structure, so you need to have faith, access are adjacent for geometric property of the tables and storage used)

On the other hand

From other observations the accesses can be assumed uncached and linear from wavefront point of view. So just like linear uncache test into AMD GlobalMemoryBandwidth test code

Now, the AMD GlobalMemoryBandwidth for linear uncache ( DATAType = float4 ) perform only 77GB/s (similar to my code), this do not sound good for me

I am expecting 130GB/s or something similar

Undestanding why this test perform so poor i can fix the problem on my code.

So the question is:

Why  AMD GlobalMemoryBandwidth for linear uncache on AMD 5870 do not reach 130GB/s ?

0 Likes

My own tests do much better.  On HD5870, I have hit 143 GB/s uncached read speed.  Even on HD6870, I can hit 121 GB/s out of a peak of 134 GB/s.

What does the profiler tell you?  Maybe you are getting bank and/or channel collisions.

0 Likes

Global Memory Read

AccessType      : single

VectorElements  : 4

Bandwidth       : 1061.48 GB/s

Global Memory Read

AccessType      : linear

VectorElements  : 4

Bandwidth       : 618.043 GB/s

Global Memory Read

AccessType      : linear(uncached)

VectorElements  : 4

Bandwidth       : 77.0595 GB/s

Global Memory Write

AccessType      : linear

VectorElements  : 4

Bandwidth       : 153.615 GB/s

# ProfilerVersion=2.4.1314

# Application=/opt/AMDAPP/samples/opencl/bin/x86_64/GlobalMemoryBandwidth

# ApplicationArgs=

# Device Cypress PlatformVendor=Advanced Micro Devices, Inc.

# Device Cypress PlatformName=AMD Accelerated Parallel Processing

# Device Cypress PlatformVersion=OpenCL 1.1 AMD-APP (831.4)

# Device Cypress CLDriverVersion=CAL 1.4.1646

# Device Cypress CLRuntimeVersion=OpenCL 1.1 AMD-APP (831.4)

# Device Cypress NumberAppAddressBits=32

# Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz PlatformVendor=Advanced Micro Devices, Inc.

# Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz PlatformName=AMD Accelerated Parallel Processing

# Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz PlatformVersion=OpenCL 1.1 AMD-APP (831.4)

# Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz CLDriverVersion=2.0

# Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz CLRuntimeVersion=OpenCL 1.1 AMD-APP (831.4)

# Device Intel(R) Core(TM)2 Quad  CPU   Q8200  @ 2.33GHz NumberAppAddressBits=64

# OS=Ubuntu 11.04 \n \l

Method , ExecutionOrder , ThreadID , CallIndex , GlobalWorkSize , WorkGroupSize , Time , LocalMemSize , VGPRs , SGPRs , ScratchRegs , FCStacks , Wavefronts , ALUInsts , FetchInsts , WriteInsts , LDSFetchInsts , LDSWriteInsts , ALUBusy , ALUFetchRatio , ALUPacking , FetchSize , CacheHit , FetchUnitBusy , FetchUnitStalled , WriteUnitStalled , FastPath , CompletePath , PathUtilization , LDSBankConflict

read_linear_uncached__k3_Cypress1 ,  2638 , 2853 , 13297 , {1048576       1       1} , {  256     1     1} ,         6.97622 ,           0 ,     8 , NA ,     0 ,     0 ,     16384.00 ,        46.00 ,        32.00 ,         1.00 ,         0.00 ,         0.00 ,         2.54 ,         1.44 ,        86.52 ,    524288.00 ,         0.00 ,        23.04 ,        17.30 ,        54.48 ,     16385.00 ,         0.00 ,       100.00 ,         0.00

This is from my kernel ( uncached only 9% )   similar percentage on Fetch Busy Stalled Stalled  (3K local memsize)  19 (????)  VGPRS (profiler inside APP SDK) i remeber 57 from the other profiler  ( Profiler downloaded separately ).  57 is more realistic counting the active wavefronts

DslashKernelEO__k5_Cypress1 ,   436 , 3514 , 14644 , {  65536       1       1} , {  128     1     1} , 0.49078 ,        3584 ,    19 , NA ,     0 ,     0 ,      1024.00 ,       235.00 ,        65.00 ,         6.00 ,         3.00 , 7.00 ,        10.93 ,         3.62 ,        80.09 ,     37885.19 , 9.69 ,        27.73 ,        16.99 ,        49.95 ,      3072.25 ,         0.00 ,       100.00 ,         0.00

Do not sound good

0 Likes

Possibly related: i recently noticed that running MemoryOptimizations (from AMD APP samples) on my gpu produces entirely different results under windows 7 than linux (RHEL 6.2), see attachments.

@ ibird: perhaps you could try compiling & running GlobalMemoryBandwith under windows to see if it makes any difference ?

0 Likes

This is the result on windows, linear uncache is a little better 80 but far fro 140GB/s

Platform 0 : Advanced Micro Devices, Inc.

Platform found : Advanced Micro Devices, Inc.

Selected Platform Vendor : Advanced Micro Devices, Inc.

Device 0 : Cypress Device ID is 0000000002054400

Build Options are : -D DATATYPE=float4 -D OFFSET=16384

Global Memory Read

AccessType      : single

VectorElements  : 4

Bandwidth       : 1061.29 GB/s

Global Memory Read

AccessType      : linear

VectorElements  : 4

Bandwidth       : 617.82 GB/s

Global Memory Read

AccessType      : linear(uncached)

VectorElements  : 4

Bandwidth       : 80.9219 GB/s

Global Memory Write

AccessType      : linear

VectorElements  : 4

Bandwidth       : 151.803 GB/s

0 Likes
elizabethswell
Journeyman III

I like your approach because you are working well now a days.....

0 Likes