I've been doing some experimentation for trying to find the magic formula for hitting close to 4 tris/clock on rx480. Something rather odd came up!
Test scenario: 2048k instances of 17 different meshes, 34816 draws. 27M triangles in total, sub-pixel sized.
Instanced method: 1 vkCmdDrawIndexedIndirect command sourced from 17 VkDrawIndexedIndirectCommand entries, each with 2048 instances.
pre-z pass: 12.20ms (2214MTri/s)
g-buf pass: 12.25ms (2204MTri/s)
Unrolled method: 1 vkCmdDrawIndexedIndirectCountAMD command sourced from 34816 VkDrawIndexedIndirectCommand entries, each with 1 instance.
pre-z pass: 7.35ms (3673MTri/s)
g-buf pass: 7.40ms (3649MTri/s)
Not exactly what I expected. I then had a look at the two cases in RGP. In the instanced case it says shaded vertices was 23.8M, but in the unrolled case it is 17.3M. So there's where the performance diff comes from, but why is there any difference at all? What makes vertex caching work so poorly in the instanced case?