Most of the information on 5xxx and 6xxx series cards is out there - there's alot of models in each range, so I suggest checking them through yourself. Info like compute units you should be able to figure yourself. You can find them on AMD's product page. At the end of the day it's how much your willing to spend.
If you want to get some good idea about relative performances for each product, pop your OpenCL code into AMD's KernelAnalyzer and see what perf. figures it spits out for each card - this will give you the most realistic performance figures you can get without actually running your code on different devices.
In a nutshell - The top of the range single GPU products in each of the 5xxx series (5870) and 6xxx series (6970) are very similar. They have similar theoretical flops, but the 6970 has more compute units (24vs20) so your going to get better real world performance for your 'typical' code, particularly where branches diverge more frequently. The 6970 has slightly more bandwidth (176Gb/s vs 153Gb/s). In the 6970 the T-unit has been done away with and its functions merged with other stream processors for massive improvement on specific operations like floating point to integer conversion.
The 7xxx isn't out, so no-one here knows how it'll compare.
With regards all your other questions (3-9), things are very similar between the 5xxx and 6xxx, all of which has improved significantly over the 4xxx. Both support local atomics and feature hardware local shared memory (32kb for 5870 & 64kb for 6970). Global atomics are horrendously slow on any card as in most cases you have to serialise all memory access. I don't know if asynchronous DMA has been implemented yet. The 5870 features 2560bytes/clk LDS bandwidth, I'm not sure about the 6970, but I assume it will be similar. I don't think anyone's compared kernel latencies for different series, but your code should be written in a way so that kernel latency is low compared to execution time, whatever your programming for so I wouldn't concern yourself with this. Real world PCIe bandwidth is around the 4Gb/s mark, whatever your card.
As I mentioned, aside from actually testing your code, try AMD's KernelAnalyzer for a rough performance comparison.
Try doing searches and look at magazine articles about the hardware details if you want some pretty diagrams. Some of the better reviews also cover potentially important details like power and heat.
Other than that, I could say read the manual ...
Pretty much all the important stuff is in the AMD APP OpenCL Programming Guide.
Appendix D has tables of most of the numbers you're asking for for each device up to the HD 6970, and chapter 4 has lots of other details about memory.
PS I went looking for an update recently and the documentation index seems somewhat ... muddled. You want: http://developer.amd.com/sdks/AMDAPPSDK/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Thanx, AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf is the answer to my questions