Thank you Lihan for the insights.
But I really can not get my head around the cache hits I am observing, even after considering it to be L2 cache. Would be really helpful if you can shed some of your expertise (simple microbenchmarks: http://devgurus.amd.com/message/1287539#1287539). Could this be possible because of some weird way of calculating L2 cache hits ?
Himanshu: Thanks! Bringing in multiple lines on a miss could be a possible (though not entirely convincing, given the coalesced accesses). Btw the shared L1 cache is only for scalar data and instructions. Vector L1 Data cache is tied to One per CU.