as we are all impatiently waiting for new BUlldozers, I have noticed new SW Optmisation Guide for AMD Fam15h chips ( doc 47414 ).
It mentions that on new chips, just as with Fam10ha and older K8 etc, there are no separate PrefetchTLevel commands- or rather all opcodes behave as PrefetchT0- they prefetch into L1 cache.
Why is that ? Wouldn't it be advantageous to be able to prefetch into L2/L3 and leave it to core or HW prefetcher to actually demand the data later ?
Yes, L1-D is now 4-way associative ( as oposed to 2-way on K10) but still this can hurt if newly prefetched data has to be evicted from L1 just to be used later or if it evicts the data that is needed before that...