1) LDS is on-chip and global is off-chip, so the throughput is different.
2) Neither global memory or LDS currently burst reads correctly. This is a SW issue and should be fixed in a future driver update.
3) The peak bandwidth of Global/LDS is card specific and the peaks can be tested with cal sample export_burst_perf(for Global) and ldsread/ldswrite in the samples/runtime directory of the CAL SDK.
4) This could be possible but is not currently supported. The only current way to turn of waterfall is to use _neighborExch flag, but this does a 4x4 transpose on reads.
5) There are certain applications that using LDS is beneficial and some that using LDS is not beneficial. The best way to determine this is to use the simple performance samples to see the peaks you will get for your card and determine which way is optimal. For example, if you are reading/writing 4 sequential float4's, then using global can get almost peak bandwidth and there is no performance reason for using LDS.
On a side note, there are known issues with performance while using LDS and these are being worked on which if we can fix them should bring a 4x speed improvement.