PC Graphics

phridrich · ‎01-07-2021

Hi,

I'm creating DirectX 12 application and using Radeon GPU Profiler for profiling on 5700 XT card. I'm using indirect drawing for rasterizing a scene, and using per instance vertex buffers to provide mesh-related data to shaders. Here is one of vertex shaders which use this principle:

void main(
    in  float3   in_position  : POSITION,   // per-vertex
    in  float4x4 in_transform : TRANSFORM,  // per-instance
    out float4   out_position : SV_Position
)
{
    float4 hdc_position = mul( float4( in_position, 1.0f ), in_transform );
    out_position = float4( hdc_position.xyz, 1.0f );
}

According to RGP, this results in folowing Radeon ISA code:

s_inst_prefetch 0x3                                                                                 // 000000000000: BFA00003
s_getpc_b64 s[0:1]                                                                                  // 000000000004: BE801F80
s_mov_b32 s0, s5                                                                                    // 000000000008: BE800305
s_load_dwordx8 s[4:11], s[0:1], 0x0                                                                 // 00000000000C: F40C0100 FA000000
v_add_nc_u32_e32 v0, s2, v0                                                                         // 000000000014: 4A000002
v_add_nc_u32_e32 v1, s3, v3                                                                         // 000000000018: 4A020603
s_waitcnt lgkmcnt(0)                                                                                // 00000000001C: BF8CC07F
tbuffer_load_format_xyz v[2:4], v0, s[4:7],  format:74, 0 idxen                                     // 000000000020: EA522000 80010200
s_clause 0x3                                                                                        // 000000000028: BFA10003
tbuffer_load_format_xyz v[5:7], v1, s[8:11],  format:74, 0 idxen offset:16                          // 00000000002C: EA522010 80020501
tbuffer_load_format_xyz v[8:10], v1, s[8:11],  format:74, 0 idxen                                   // 000000000034: EA522000 80020801
tbuffer_load_format_xyz v[11:13], v1, s[8:11],  format:74, 0 idxen offset:32                        // 00000000003C: EA522020 80020B01
tbuffer_load_format_xyz v[14:16], v1, s[8:11],  format:74, 0 idxen offset:48                        // 000000000044: EA522030 80020E01
s_waitcnt vmcnt(3)                                                                                  // 00000000004C: BF8C3F73
v_mul_f32_e32 v1, v3, v5                                                                            // 000000000050: 10020B03
v_mul_f32_e32 v5, v3, v6                                                                            // 000000000054: 100A0D03
v_mul_f32_e32 v3, v3, v7                                                                            // 000000000058: 10060F03
s_waitcnt vmcnt(2)                                                                                  // 00000000005C: BF8C3F72
v_mac_f32_e32 v1, v2, v8                                                                            // 000000000060: 3E021102
v_mac_f32_e32 v5, v2, v9                                                                            // 000000000064: 3E0A1302
v_mac_f32_e32 v3, v2, v10                                                                           // 000000000068: 3E061502
s_waitcnt vmcnt(1)                                                                                  // 00000000006C: BF8C3F71
v_mac_f32_e32 v1, v4, v11                                                                           // 000000000070: 3E021704
v_mac_f32_e32 v5, v4, v12                                                                           // 000000000074: 3E0A1904
v_mac_f32_e32 v3, v4, v13                                                                           // 000000000078: 3E061B04
s_waitcnt vmcnt(0)                                                                                  // 00000000007C: BF8C3F70
v_add_f32_e32 v0, v14, v1                                                                           // 000000000080: 0600030E
v_add_f32_e32 v1, v15, v5                                                                           // 000000000084: 06020B0F
v_add_f32_e32 v2, v16, v3                                                                           // 000000000088: 06040710
v_mov_b32_e32 v3, 1.0                                                                               // 00000000008C: 7E0602F2
exp pos0 v0, v1, v2, v3 done                                                                        // 000000000090: F80008CF 03020100
s_endpgm                                                                                            // 000000000098: BF810000

The thing which gets my attention here is that vector memory load instructions are used for loading per-instance data. According to my understanding, vertex shader groups always process vertices from a single instance, so it's possible to use scalar memory loads here. So here are my questions:

1. Is my assumption about single instance for vertex shader group is valid? If not, it's indeed not valid to use scalar memory lodas here, and everything is fine.

2. If my assumption is valid, is these vector memory loads are actually a big problem? I assume scalar loads would be better, but the difference may be hardly visible due to memory caching.

3. Maybe there are some other limitations which prevent compiler from using scalr loads here? Maybe, I don't provide some critical information on the CPU side? Or it's just a matter of driver implementation?

PC Graphics

DirectX 12 per instance data fetch