Archives Discussions

adubinsky · ‎08-30-2016

I would like this question posted in the OpenCL forum, since although I am using the ROCm/HSA stack, there is a lot more activity there particularly on the topic of GCN assembly. Pretty please.

I've written a memcpy kernel in GCN assembly for the LLVM assembler bundled with ROCm that I test on Fiji. This is an experiment. My real goal is to write a fast kernel that needs to read ~1MB of data. On larger transfer sizes, such as 32MB, my kernel achieves reasonable performance (240 us, which translates to 280 GB/s), although the performance varies a lot. However, when reading 1MB the performance is poor (45-50 us). A very simple kernel can run in about 10us, so I don't believe the problem is overhead. Measurement is done with the CodeXL/HSA profiler.

I've attached the code, but I'll briefly describe how the kernel works. 256 wavefronts are launched. The `flat_load_dword` instruction reads a continuous 64K into registers across all wavefronts. This instruction is repeated 16 times until all 1MB is loaded into registers. `s_waitcnt 0` waits for the loads to finish. Then, memory is written out. When experimenting with 32MB transfers, 8192 wavefronts are launched, but each kernel still does 16 loads and 16 stores. The kernel doesn't currently need many registers, but eventually I will have low occupancy. Configuring the kernel to use 256 registers does not change the result for the 1MB copy, but the performance of the 32MB copy becomes quite interesting. Some runs finish in 260us, while others take up to 400us.

There's a lot of things that could be at issue, and the OpenCL Guide doesn't yet describe the memory channels and banks of Fiji, but perhaps someone here has run into this before and can offer some advice. My suspicion is that the initial loading of parameters causes the wavefronts to go out of sync, so that even theoretically a perfect memory access pattern seems out of reach. Thoughts?

adubinsky · ‎08-30-2016

Update: `flat_load_dwordx4` helped performance. The runtime was reduced to 33 us from 45 us. Using `dwordx4` isn't the ideal solution for me, but I'm glad there's progress. I also tried `buffer_load_dwordx4`, which gave a slight improvement taking 31 us and giving more consistent performance (perhaps because there's fewer instructions involved). Would have been interesting to try a `dwordx8` or `dwordx16`.

It looks like what I hoped would happen isn't happening. I hoped that 256-byte successive accesses by successive wavefronts would act as one large access and get split perfectly among all of the memory channels.

Archives Discussions

Quickest way to read 1MB in GCN assembly