Perhaps, there's something that I'm not seeing in the docs, so I apologize in advance.
I've got 16 dwords in scalar registers s16-s31. I need to copy that data from the scalar registers to GDS at the GDS base address + 64 bytes offset. The best way I see so far to do this is to
- mask all lanes but one in the wave with the exec;
- move the data from the scalar to the vector registers;
- issue a bunch of ds_write_bNNN gds instructions;
- re-enable all lanes with the exec.
This sounds cumbersome. Is there a better way to store the data from a bunch of scalar registers to GDS?
I've attached a screenshot of the concept of the code that I have (sorry for blanking out the tiny part that's under the NDA, I promise it doesn't matter for this question). This code may not compile or anything, this is a conceptual explanation of what's going on. For the purposes of limiting and simplifying this question, assume that we only have one wave - wave 0 - going over the entire GPU. If it matters, the platform is ROCm on Linux, and the card is Vega 64.
Surely, there must be a better way to do this, so I must be missing something? What is it? What's the best way of copying the data from scalar registers to GDS?
(Thank you in advance.)
I forwarded your query to the compiler team and here is their reply:
"We have no current plan to support GDS. It’s also worth pointing out that it’s not just a matter of the ISA instructions. Setting up GDS requires runtime involvement and trap handler support and those are not currently planned either. "
Hi, dipak, thank you for your quick reply.
Just to clarify and make sure, they're talking about the OpenCL compiler, and not ROCm assembler, right? (because GDS is working just fine under ROCm at the moment).
Also - just out of curiosity - why would trap handler support be needed for this?
Indeed it is a lot of waste of arithmetic power there.
As you can do a 64byte write into the gds with 1 instruction, the problem can be reduced this:
How to transfer 16 adjacent sregs into the first 16 lanes of a single vreg?
So if you have GCN3, then you can use 4x S_(BUFFER)_STORE_DWORD to write the 16 sregs into the L1 cache, and then you can mask the lower 16 lanes and read it into a vreg, then finally write into GDS.
Because of the required s_waits it will also take a while, but meanwhile you can do some extra overlapped arithmetics.