Hello,
we are using the Radeon Pro SSG[0] to read files from the SSG into OpenGL Buffers of large size (e.g. SSBOs with size 1 GiB, up to the maximum size of 2147483647 bytes (~2047.99 MiB) per SSBO).
We encounter several bugs when using SSG functionality, namely: (1) reading the end of files on SSG, (2) uninvolved GL buffers seem to interfere with reading files from SSG, (3) ensuring asynchronous operations on SSG files.
You can find a Visual Studio Solution with a working minimal example and system information below.
The example code works on two example files with sizes of 1024 and 1008 bytes, which are attached for convenience, although their content does not matter.
The code expects the SSG to be the "G:\" drive on windows, but this can be overriden via command line argument (e.g. executing ' .\miniSSG.exe H:\' in PowerShell instead of '.\miniSSG.exe' to load files from H:\).
** (1) Reading the end of files
We encounter bugs with the OpenGL SSG Extensions when reading the end of a file via glReadFileAMD.
The SSG User Manual[1] requires reads to the end of a file to be aligned to a given block size.
Quote from the Manual: "If the file size is not a multiple of the block size, read the end of the file by aligning the read size with the next block multiple beyond the file size."
But we simply can not get glReadFileAMD to read the end of a file because the GL driver reports GL_INVALID_VALUE for any combination of function parameters reading the end of the file, unless the file itself has a size that is a multiple of the block size.
However, inflating our input files to a multiple of the SSG block size is not a practical solution.
So there seems to be a bug in the driver when reading the end of files from the SSG?
In Code:
GLuint dstBuffer = makeBuffer();
GLFileHandleAMD fileHandle = openFile(); // fileSize is not multiple of block size
glReadFileAMD(dstBuffer, fileHandle, 0/*bufferOffset*/, 0/*fileOffset*/, fileSize/*read size*/, nullptr/*GLsync*/); // => GL_INVALID_VALUE
** (2) Uninvolved GL buffers interfere with glReadFileAMD
The User Manual suggests to create GL buffers "using glNamedBufferStorage with the GL_MAP_READ_BIT | GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT flag set" for best performance and possibility of asynchronous reads of files.
A bug arises when a spare OpenGL Buffer Object is created with this method, but not used with glReadFileAMD to read from files.
When such a buffer of a certain (small or large) size exists, it seems to interfere with the functionallity of following glReadFileAMD operations, leading to GL_INVALID_OPERATION errors - altough those glReadFileAMD operations operate on a different buffer object.
In Code:
GLbitfield flags = GL_MAP_READ_BIT | GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT;
GLuint unusedBuffer = makeBuffer(flags); // buffer size is 513 bytes
GLuint dstBuffer = makeBuffer(flags); // buffer size is 1024 bytes
GLFileHandleAMD fileHandle = openFile();
glReadFileAMD(dstBuffer, fileHandle, 0/*bufferOffset*/, 0/*fileOffset*/, acceptableReadSize/*read size*/, nullptr/*GLsync*/); // => GL_INVALID_OPERATION
When buffers are created WITHOUT the GL_MAP_PERSISTENT_BIT, no GL_INVALID_OPERATION errors occur. When 'unusedBuffer' is created with size of 512 bytes, no errors occur.
** (3) Async SSG file read on large GL buffers
The User Manual states that file "read/write operations work in asynchronous mode" on GL buffers when using the bit flags given above, such that "the buffer is created in local visible video memory".
Since local visible video memory is only a few hundred MB, our SSBO allocations of 1-2 GiB will not fit into that, especially when we allocate several such buffers.
Quote (Manual pages 14+15):
"Access to local visible memory enables the highest performance, but this memory is only 256 MB and the system reserves most of it.
The application can only allocate about 100 MB; attempts to allocate more than the unallocated local visible memory will fail.
… Only when the buffer is created in local visible video memory (using glNamedBufferStorage with the GL_MAP_READ_BIT | GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT flag set)
will read/write operations work in asynchronous mode. Otherwise, the driver will ignore the sync object. "
So it is clear that large buffers will not lead to best pferformance, but the important question is: do we still get asynchronous file read/write operations for buffers of size 1-2 GiB?
The wording in the User Manual suggests that only (small) buffers in local visible memory get asynchronous mode.
But when allocating large buffers with the bit flag combination for local visible memory, buffer creation does not fail. So do we get async mode for such buffers?
Best,
Florian Frieß
------------------------------------------------------------
Visualization Research Center (VISUS)
University of Stuttgart
[0] https://www.amd.com/en/products/professional-graphics/radeon-pro-ssg
[1] https://www.amd.com/Documents/ssg-api-user-manual.pdf pages 12+13, 14+15
=== System Information
-> Overview
Radeon Pro und AMD FirePro Software-Version - 18.Q3
Radeon Pro und AMD FirePro Software Edition - Radeon Pro Software Enterprise Edition
Grafik-Chipsatz - Radeon Pro SSG
Größe des High Bandwidth-Cache - 16368 MB
Art des High Bandwidth-Cache - HBM2
Systemtaktrate - 1500 MHz
Windows-Version - Windows 10 (64 bit)
Systemspeicher - 32 GB
CPU-Typ - AMD Ryzen 7 1800X Eight-Core Processor
-> Software
Version der Radeon Pro- und AMD FirePro-Einstellungen - 2018.0814.1443.24654
Treiber-Paketversion - 18.20.24.03-180814a-332362C-RadeonProEnterprise
Anbieter - Advanced Micro Devices, Inc.
2D-Treiberversion - 8.1.1.1634
Direct3D® Version - 9.14.10.01350
OpenGL® Version - 24.20.11000.14565
OpenCL™ Version - 24.20.12024.3003
AMD Mantle-Version - Not Available
AMD Mantle API-Version - Not Available
AMD Audio-Treiberversion - 10.0.1.6
Vulkan™ Driver Version - 2.0.33
Vulkan™ API Version - 1.1.73
-> Hardware
Grafikkarten-Hersteller - Designed and built by AMD
Grafik-Chipsatz - Radeon Pro SSG
Geräte-ID - 6862
Anbieter-ID - 1002
Subsystem-ID - 0B1E
Subsystem-Anbieter-ID - 1002
Revisions-ID - 00
Bustyp - PCI Express 3.0
Aktuelle Buseinstellungen - PCI Express 3.0 x16
BIOS-Version - 016.001.001.000
BIOS-Teilenummer - 113-D0690103-102
BIOS-Datum - 2017/09/21 16:12
Größe des High Bandwidth-Cache - 16368 MB
Art des High Bandwidth-Cache - HBM2
Taktrate des High Bandwidth-Cache - 945 MHz
Systemtaktrate - 1500 MHz
Bandbreite des High Bandwidth-Cache - 483 GByte/s
Speicherbitrate - 1.89 Gbps
2D-Treiberpfad - /REGISTRY/MACHINE/SYSTEM/ControlSet001/Control/Class/{4d36e968-e325-11ce-bfc1-08002be10318}/0001
Hello, thanks for your reporting!
I managed to build your application, I will try to investigate the problem, probably i need to co-work with other engineers. Thank you for your patience.
Hello,
As far as I can tell, there's some driver defect for the 1st problem (1) Reading the end of files , and we will fix it. Thanks for your reporting.
And for the remaining problems, we need more time for investigation and debugging to figure out the root-cause. I will keep you updated.
Hello,
For (3) Async SSG file read on large GL buffers:
The sync object is only useful when the buffer is allocated successfully from local visible memory, if the buffer is allocated from local invisible, the user defined sync object will be ignored, because AMD OGL driver has already defined a internal sync object used for staging buffer( the data can’t be read directly from SSD to local invisible memory), data R/W path would go like: NVME ssd <=> staging buffer <=> invisible video memory.
1~2G buffer should be allocated from invisible memory and no sync object is needed in this case.
Hello,
thank you for the extensive explanation of the R/W path.
So that means that we do not get async reads, since we use the invisible video memory?
For us it would be great to read multiple big files asynchronously from the SSD, i.e. without the readFile function blocking the application. Is there a way to do that?
Hello,
We suggest you use two smaller buffers to fill the larger buffer. Allocate several buffers, of 40MB each, to make sure they do fit into the DGMA aperture, then alternate reading asynchronously into them, and then once each buffer finished reading (by waiting for the Async event to be notified), copy that buffer into the larger buffer. This should work pretty fast because the copy would be done in HBM memory and 40MB should be really fast, while at the same time, the second buffer is reading from the file, eliminating any latency.
The whole idea is to try to keep all the information into video memory and not go through the system PCIe bus and system memory.
Let me know if you have any further question, thanks.
Thank you again for the explanation, we will implement the copy operation as you described.
One more question, does the 40 MB copy work since we could not copy anything that is not evenly divisible by 512 (see 1st problem (1))? Does the new driver 18.Q3.1 already contain the fix for our 1st problem? I could not find anything in the documentation.
Hello,
friessfn wrote:
Thank you again for the explanation, we will implement the copy operation as you described.
One more question, does the 40 MB copy work since we could not copy anything that is not evenly divisible by 512 (see 1st problem (1))? Does the new driver 18.Q3.1 already contain the fix for our 1st problem? I could not find anything in the documentation.
I'm afraid no, the fix for the 1st problem related to glReadFileAMD() is still under internal testing, and there'll be a long testing cycle for any driver release. It's possible that we have already missed the next coming driver release. I will try to push the fix to be promoted to the release branch if possible. I will let you know if any progress, thanks for your patience.
For the small buffer size for your test, I would double-confirm with the developers and give you back.