Broadcasting is supported in the local memory of 5xx series, 32 threads reading the same memory location will get their request processed in 1 cycle.
Constant buffers give a bandwidth of 600 GB/s when all threads access same memory location - and when the index is dynamic.
Global memory gives a bandwidth of 250 GB/s as there is a bit of cache-reuse - global buffer memory operations translate to VFETCH instruction which says that the accesses are through L1 texture cache.
So you will definately get a higher-bandwidth when all threads are accessing the same memory location.
1) kernels that compress the first k levels (one long vector of nulls or addresses)
2) kernels that access the nodes starting from level k directly (from the vector, like a hash map)