I'm manipulating a texture resource ('RWTexture2D'; not declared 'globallycoherent') using a compute shader on a Radeon HD 5770, and in doing so, I'm having a bandwidth bottleneck. The shader is using CompletePath.
This is not necessary as every texel is read from and written to only once. Reads and writes between two thread groups never overlap and threads never read from the texture after it has been written. (This is, however, hard for a compiler or driver to figure out as the R/W is heavily scattered.)
What can I do to make the driver choose FastPath?
What I already figured out:
Originally posted by: Shrinker You say OpenCL, so this subforum might yield an answer for you: http://forums.amd.com/devforum/categories.cfm?catid=390&entercat=y
No, I just mentioned I read the OpenCL programming guide (because there is no DirectCompute programming guide). This is a HLSL- / compute shader- / DirectCompute-only question.
The difference between RWByteAddressBuffer/RWStructuredBuffer (very fast) and RWBuffer/RWTexture (awfully slow) is that Direct3D assumes RWBuffer and RWTexture to be typed: Writing to RWBuffer or RWTexture may require data conversion (e.g. when writing a float to DXGI_FORMAT_R8_UNORM).
One possibility to bypass this (proposed by Microsoft for use in in-place image editing) is using the uint version of these objects — but not even that changes anything.
Obviously, typed accesses cause the shader compiler to always choose CompletePath, even if the shader accesses the resource without overlapping, or as read-only, write-only or through uint.
RWBuffer and RWTexture are so awfully slow on AMD hardware that they're practically unusable for real-time applications. Can someone tell me any rationale for this or should I just file a performance bug?