cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dertomas
Adept I

Max Read Bandwidth of Trinity APU

Hi,

I have got a bandwidth question.

I got an A10-5800k Trinity APU and I use OpenCL with the integrated GPU Radeon HD 7660D.

Also, I use the Asus F2 A85-M Pro motherboard and 4x 8GB 1866 MHz ram. OS is Windows 7.

I used the AMD BufferBandwidth OpenCL example and ran it on my GPU.

There I got about 25 GB/s for reading. All fine.

Then I changed the BufferBandwidth code to use not nThreads (some calculated number) but maxThreads (maximal threads for the input data). I also changed the input data from 32 MB to 128 MB.

Now I get 70 to 80 GB/s for reading. This sounds somehow too much.

My first theoretical calculations  were:

1.866 GT/s (Mem freq) * 256 bit (Radeon Memory Bus) * 2 (Dual Channel)  = 111 GB/s

There, 70 to 80 GB/s sound plausible. But normal memory only has 64 bit wide buses. With Dual Channel, that would be only 128 bit meaning only 55 GB/s.

Can anyone clarify the used buses for me, please? Do you really have 80 GB/s on an integrated CPU or is there a bug in the AMD OpenCL BufferBandwidth test? Are there some resources to do some further reading?

Thanks a lot!

0 Likes
1 Solution

Just profile the two kernels using CodeXL. IN default case nThreads=8192, and i get 0% cache hit. When I make nThreads=maxthreads (2197152), i see a cache hit of 94%, which results in that high read bandwidth throughput. sorry for the late reply here.

You can also run globalmemorybandwidth sample to check the uncached bandwidth for your device.

View solution in original post

0 Likes
15 Replies
himanshu_gautam
Grandmaster

Can you publish the buffer bandwidth output here?

0 Likes

>BufferBandwidth.exe -nb 134217728

Platform found : Advanced Micro Devices, Inc.

Device  0            Devastator

Build:               _WINxx release

GPU work items:      8388608

Buffer size:         134217728

CPU workers:         1

Timing loops:        20

Repeats:             1

Kernel loops:        20

inputBuffer:         CL_MEM_READ_ONLY

outputBuffer:        CL_MEM_WRITE_ONLY

Host baseline (naive):

Timer resolution     269.93  ns

Page fault           856.44  ns

Barrier speed        139.41  ns

CPU read             4.58 GB/s

memcpy()             4.93 GB/s

memset(,1,)          7.62 GB/s

memset(,0,)          7.59 GB/s

AVERAGES (over loops 2 - 19, use -l for complete log)

--------

1. Host mapped write to inputBuffer

      clEnqueueMapBuffer(WRITE):  0.019561 s [     6.86 GB/s ]

                       memset():  0.017087 s       7.85 GB/s

      clEnqueueUnmapMemObject():  0.020841 s [     6.44 GB/s ]

2. GPU kernel read of inputBuffer

       clEnqueueNDRangeKernel():  0.038843 s      69.11 GB/s

                 verification ok

3. GPU kernel write to outputBuffer

       clEnqueueNDRangeKernel():  0.085660 s      31.34 GB/s

4. Host mapped read of outputBuffer

       clEnqueueMapBuffer(READ):  0.018308 s [     7.33 GB/s ]

                       CPU read:  0.029805 s       4.50 GB/s

                 verification ok

      clEnqueueUnmapMemObject():  0.000079 s [  1702.97 GB/s ]

Passed!

0 Likes

Just few cents of knowledge sharing:

As far as CPU main-memory bandwidth goes, the theoretical bandwidth cannot be reached by 1 thread.

Multiple threads may be needed depending on how many memory-links are there.

If you have 2 memory links then you need to load both the memory-links ; you also need to have 2 DIMMs minimum to realize that bandwidth.

On APUs, the system memory is shared between CPU and GPU. You can find some good explanations in the URL slide below.

Check this out: http://amddevcentral.com/afds/assets/presentations/1004_final.pdf


As far as bloated numbers, You say you have changed the BufferBandwidth() code. Have you made sure you have changed all relevant places? Can you post the modified sample?

0 Likes

Hi,

I tested how much I need to change to see these high bandwidths. Turns out, I need only one additional line.

I added this line:

nThreads = maxThreads;

in BufferBandwidth.cpp at line 1283. That's it. When I run the program with the standard configuration (no arguments), I get around 70 GB/s read bandwidth.

Please, find the code in the attachment. Again: I tested it on Windows 7 with the A10-5800k APU.

0 Likes

Thanks for posting. Will look into it.

0 Likes

I do not see any drastic results by using your modified BufferBandwidth.cpp file.  Hope you have just sent the other files for completeness purpose and not modified them. Although I did this test on a DGPU.

Can you share your bufferBandwidth sample output, when you see 70GBps speed. Also mention the cmdline options specified.

As per the http://amddevcentral.com/afds/assets/presentations/1004_final.pdf , the transfer bandwidths should not be more than 13GBPs. Although it might have improved to some extent for trinity.

0 Likes

Yes, I sent the other files only for completeness. The only change I did was the one line, as written before. Maybe it would be good to use all files, so that there are no problems with different code due to different SDK versions.

In my post earlier (22/04/2013), I already posted the BufferBandwidth output:

2. GPU kernel read of inputBuffer

     clEnqueueNDRangeKernel():  0.038843 s      69.11 GB/s

The command line options for the output were just: -nb 134217728

However, I see similar effects with the default options.

Do you mean an discrete GPU with DGPU? I think my question is quite specific for a Trinity APU or better to the integrated GPU Radeon HD 7660D (Devastator). Is it possible for you to test on a Trinity APU to check if you see the same effects?

Thanks a lot.

0 Likes

Yes , I meant a discrete GPU by DGPU.

I just tested it on the first test machine I could get my hands on (for a quick repro).

I do understand that you are looking at an APU.

Will try and let you know soon. Thanks for your patience,

0 Likes

BTW: I just tested it with an ubuntu linux (on the same APU) and I also see 69 GB/s read bandwidth.

0 Likes

While we wait:

Is it at least theoretically feasible to reach such high Bandwidths withintegrated GPUs (with the RMB and dual channel RAM with 1866 MHz), or is it far from anything possible (so we should be trying to find a bug in the implementation)?

0 Likes

My personal answer:

I don't think you can reach such high bandwidth using APUs - as long as the APU's memory is carved out of System Memory (i.e RAM). Thats what the attached  PDF is also saying.

May be, this will change in future....I have no idea. But chances are there that what is seen is incorrect reporting from the sample...

I missed this thread a bit...I will run this on APU and see why this is happening...

0 Likes

Just profile the two kernels using CodeXL. IN default case nThreads=8192, and i get 0% cache hit. When I make nThreads=maxthreads (2197152), i see a cache hit of 94%, which results in that high read bandwidth throughput. sorry for the late reply here.

You can also run globalmemorybandwidth sample to check the uncached bandwidth for your device.

0 Likes

Thanks a lot, caching makes perfect sense.
I didn't realize that the GPU does automatic caching. I always thought the programmer has to do that explicitly.

I ran the GlobalMemoryBandwidth sample and got these results:

Device 0 : Devastator Device ID is 0000000002FB0090

1. Global Memory Read (single) = 303.891 GB/s
2. Global Memory Read (linear) = 281.221 GB/s
3. Global Memory Read (linear, uncached) =  16.4991 GB/s
4. Global Memory Write (linear) =  41.3259 GB/s
5. Global Memory Read (random) =  27.6856 GB/s
6. Global Memory Read (unCombine_unCache) =  19.2543 GB/s

So I guess the 3rd and 4th test case should be similar to the  BufferBandwidth test with the cache-hit rate 0% (unmodified  BufferBandwidth)?
For the original buffer bandwidth test I get 24,62 GB/s for reading and  21,10 GB/s for writing. That is quite different to case 3 and 4.
Is this caused by different memory access patterns and different float/int computing capabilities or do I miss something? (all tests show  0% cache hit in CodeXL).

I have got the full results of GlobalMemoryBandwidth.exe and BufferBandwidth.exe attached.

0 Likes

Thanks for the tests.

I am puzzled when you say there are no cache-hits listed in CodeXL and you still get ~300GBps for certain tests.

I think CodeXL is talking only about L2 Hits. The accesses must still be hitting the L1.

Understandable, as Not all Globalmembwidth tests the raw bandwidth. Some of them explicitly test the cache.

The linear-uncached test gives you the correct numbers that measures your raw bandwidth between compute-device and global-memory.

For more details, Check the kernels used by GlobalmemBandwidthTests.

I am open for a discussion on the individual kernels used in globalMemBwidth tests here.

0 Likes

The kernels used in the two samples are quite different, and were written by different people at quite different times. I will raise a bug on these samples, as indeed both the samples should return very similar values for uncached reads/writes.

0 Likes