I have a kernel memory bounded, that on the ATI 5870 reach about 70GB/s very far from 153GB/s, all possible optimization has been done regarding coalescing, occupancy, workgroup size. ( On C2050 i reach ~ 120-130GB/s) The kernel use 57 vGPRs so 256 Active thread 4 Wavefronts (Enought to hide latency ?)
On the other hand i have tested my 5870 with the AMD benchmark GlobalBandWidth
Where i get for uncached read: 77GB/s, reading the kernel for uncached (but coalesced) read, i can say that the read method used by this kernel is similar
to those on my kernel. As far as i know uncached read should measure the bandwidth of the GDDR5 so i should get something more near to 153 GB/s. So the question is someone with an 5870 can confirm this number or there is something i am missing ?
LDS is only 3KB and workgroup 128
I would be good , if you can share your approach. I guess all the channels must be busy to get 153GBps. Also profiling may also help you, check if you are getting any channel/bank conflicts.
Sure
Access is of type float2 or float4, this mean that each wavefront access to all adjacent byte of 2 or 4 (different) channels
(CODE attached is not simple to understand and complex to isolate bacause its structure, so you need to have faith, access are adjacent for geometric property of the tables and storage used)
On the other hand
From other observations the accesses can be assumed uncached and linear from wavefront point of view. So just like linear uncache test into AMD GlobalMemoryBandwidth test code
Now, the AMD GlobalMemoryBandwidth for linear uncache ( DATAType = float4 ) perform only 77GB/s (similar to my code), this do not sound good for me
I am expecting 130GB/s or something similar
Undestanding why this test perform so poor i can fix the problem on my code.
So the question is:
Why AMD GlobalMemoryBandwidth for linear uncache on AMD 5870 do not reach 130GB/s ?
My own tests do much better. On HD5870, I have hit 143 GB/s uncached read speed. Even on HD6870, I can hit 121 GB/s out of a peak of 134 GB/s.
What does the profiler tell you? Maybe you are getting bank and/or channel collisions.
Global Memory Read
AccessType : single
VectorElements : 4
Bandwidth : 1061.48 GB/s
Global Memory Read
AccessType : linear
VectorElements : 4
Bandwidth : 618.043 GB/s
Global Memory Read
AccessType : linear(uncached)
VectorElements : 4
Bandwidth : 77.0595 GB/s
Global Memory Write
AccessType : linear
VectorElements : 4
Bandwidth : 153.615 GB/s
# ProfilerVersion=2.4.1314
# Application=/opt/AMDAPP/samples/opencl/bin/x86_64/GlobalMemoryBandwidth
# ApplicationArgs=
# Device Cypress PlatformVendor=Advanced Micro Devices, Inc.
# Device Cypress PlatformName=AMD Accelerated Parallel Processing
# Device Cypress PlatformVersion=OpenCL 1.1 AMD-APP (831.4)
# Device Cypress CLDriverVersion=CAL 1.4.1646
# Device Cypress CLRuntimeVersion=OpenCL 1.1 AMD-APP (831.4)
# Device Cypress NumberAppAddressBits=32
# Device Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz PlatformVendor=Advanced Micro Devices, Inc.
# Device Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz PlatformName=AMD Accelerated Parallel Processing
# Device Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz PlatformVersion=OpenCL 1.1 AMD-APP (831.4)
# Device Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz CLDriverVersion=2.0
# Device Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz CLRuntimeVersion=OpenCL 1.1 AMD-APP (831.4)
# Device Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz NumberAppAddressBits=64
# OS=Ubuntu 11.04 \n \l
Method , ExecutionOrder , ThreadID , CallIndex , GlobalWorkSize , WorkGroupSize , Time , LocalMemSize , VGPRs , SGPRs , ScratchRegs , FCStacks , Wavefronts , ALUInsts , FetchInsts , WriteInsts , LDSFetchInsts , LDSWriteInsts , ALUBusy , ALUFetchRatio , ALUPacking , FetchSize , CacheHit , FetchUnitBusy , FetchUnitStalled , WriteUnitStalled , FastPath , CompletePath , PathUtilization , LDSBankConflict
read_linear_uncached__k3_Cypress1 , 2638 , 2853 , 13297 , {1048576 1 1} , { 256 1 1} , 6.97622 , 0 , 8 , NA , 0 , 0 , 16384.00 , 46.00 , 32.00 , 1.00 , 0.00 , 0.00 , 2.54 , 1.44 , 86.52 , 524288.00 , 0.00 , 23.04 , 17.30 , 54.48 , 16385.00 , 0.00 , 100.00 , 0.00
This is from my kernel ( uncached only 9% ) similar percentage on Fetch Busy Stalled Stalled (3K local memsize) 19 (????) VGPRS (profiler inside APP SDK) i remeber 57 from the other profiler ( Profiler downloaded separately ). 57 is more realistic counting the active wavefronts
DslashKernelEO__k5_Cypress1 , 436 , 3514 , 14644 , { 65536 1 1} , { 128 1 1} , 0.49078 , 3584 , 19 , NA , 0 , 0 , 1024.00 , 235.00 , 65.00 , 6.00 , 3.00 , 7.00 , 10.93 , 3.62 , 80.09 , 37885.19 , 9.69 , 27.73 , 16.99 , 49.95 , 3072.25 , 0.00 , 100.00 , 0.00
Do not sound good
Possibly related: i recently noticed that running MemoryOptimizations (from AMD APP samples) on my gpu produces entirely different results under windows 7 than linux (RHEL 6.2), see attachments.
@ ibird: perhaps you could try compiling & running GlobalMemoryBandwith under windows to see if it makes any difference ?
This is the result on windows, linear uncache is a little better 80 but far fro 140GB/s
Platform 0 : Advanced Micro Devices, Inc.
Platform found : Advanced Micro Devices, Inc.
Selected Platform Vendor : Advanced Micro Devices, Inc.
Device 0 : Cypress Device ID is 0000000002054400
Build Options are : -D DATATYPE=float4 -D OFFSET=16384
Global Memory Read
AccessType : single
VectorElements : 4
Bandwidth : 1061.29 GB/s
Global Memory Read
AccessType : linear
VectorElements : 4
Bandwidth : 617.82 GB/s
Global Memory Read
AccessType : linear(uncached)
VectorElements : 4
Bandwidth : 80.9219 GB/s
Global Memory Write
AccessType : linear
VectorElements : 4
Bandwidth : 151.803 GB/s
I like your approach because you are working well now a days.....