I think threads are assigned to thread-processors in quads, so in your case threads - 0,0 1,0 0,1 1,1 will be assigned to TP0; threads 2,0 3,0 2,1 3,1 will be assigned to TP1 and so on.
Broadcast is supported on Cypress from LDS as in my benchmark I get maximum bandwidth from broadcast and 32-bit linear reads, i.e read bandwidth is around 850GB/s for 5870.