According to CodeAnalyst documentation, the L1 data cache access event includes all accesses to the data cache for load and store. It may also include some "scratchpad accesses" due to microcoded (vectorpath) instructions, but that should be very rare.
For Athlon 64 or Turion, each count represents an 8-byte access, even if only part of that is transferred. I don't know how that affects the 128-bit loads in Barcelona and Shanghai processors, though. (The CodeAnalyst documentation for family 10 processors seems missing.)
For non-cacheable, streaming store or write-combining accesses, use event 0x065 memory request by type.
Alright, sounds good.