The GCN docs indicate the L1 data cache bandwidth is 64 bytes/clock, but don't provide any details as to how this is arbitrated between the 4 SIMD units in each CU.  Is it banked so each SIMD can get 16 bytes per clock, or does one SIMD at a time get a full cache line by some arbitration mechanism?