Hi,
Recently I did some tests about GDDR5 memory bandwidth.
In a few words:
It launches numberOfCUes*4 waveFronts. Each WF is reading 1024 bytes from a random aligned location in a large buffer.
The whole kernel reads 100GB total. The ideal bandwidth is 1.850GHz*256bit/8*4=236.8GBps
I'm varying the buffer size from 1KB to 4GB and got the following results:
You can easily spot to 2 cache levels on it. But I've found a third something above 1GB.
Anyone has an explanation what is that?
I have 8GB and I only tested this up until 4GB, but it seems like if I use the whole memory, the bandwidth will settle at 1/3 of the nominal bandwidth.
This behavior with PC memory would be unacceptable for sure. But what's the case with graphics memory?
Thank You for the answers!
Hi there!
Just to venture a guess, could it be the TLB thing?
So, the default page size is 4K (I think), and if you read memory at random, you'll be hitting different pages a lot. The page will either be in the TLB or not; if not, it will incur a significant slowdown as it'll have to update the TLB, and that's a very slow operation on the CPUs as well. I think this has been an issue in the mining community a while ago, then AMD shipped so-called blockchain drivers that increased the page size to 64K (again, I think...), and everything started to work fine for them after that. As far as I know, there are two TLB levels, so there should or could be another step in your bandwidth graph somewhere.
For instance, see this release, and look for the large page info -> https://support.amd.com/en-us/kb-articles/Pages/AMDGPU-Pro-Driver-17.40-for-Linux.aspx
If you're on Windows, the Crimson or Adrenaline drivers should provide a switch between the graphics and compute workloads. For example, see this page -> https://support.amd.com/en-us/kb-articles/Pages/DH-024.aspx I assume the compute workloads mean large pages, though I can't be sure.
I'm sorry if the cause of your problem is different and/or if I'm wrong, as I don't want to give you bad advice, but I think running your test with large pages is worth a shot.
Finally, would you mind sharing the model of your card and your memory benchmark kernel?
Best,
sp314