I'm manipulating a texture resource ('RWTexture2D'; not declared 'globallycoherent') using a compute shader on a Radeon HD 5770, and in doing so, I'm having a bandwidth bottleneck. The shader is using CompletePath.
This is not necessary as every texel is read from and written to only once. Reads and writes between two thread groups never overlap and threads never read from the texture after it has been written. (This is, however, hard for a compiler or driver to figure out as the R/W is heavily scattered.)
What can I do to make the driver choose FastPath?
What I already figured out:
- The driver will choose FastPath for reading and writing if I use a 'RWStructuredBuffer' resource instead of 'RWTexture2D'. The overall performance, however, degrades — there are other shaders which require a texture memory layout and perform awful with buffer memory layout; furthermore, a 2-dimensional memory layout simplifies the shader code.
- The driver will always choose FastPath for read-only textures ('Texture2D'). So, if I allocate a second texture, create a shader resource view from the original texture, bind it as 'Texture2D' and write to the new texture (bound as 'RWTexture2D') instead of the original one, the driver will choose FastPath for reading. Although writing is still done via CompletePath, this comes with a great performance benefit — but I'm short on VRAM and I just cannot afford to double the amount of required memory.
- Adding the 'globallycoherent' storage class does not change the shader's assembly code, so the driver somehow treats it as globallycoherent all the time.