1 Reply Latest reply on Oct 15, 2010 11:20 PM by aravinda

    Refill from SYS Mem

    aravinda

       

      I have a program that loops over an array of 128K entries with stride = 128. So it makes 1024 access each time.

      I have used perfctr to measure the cache hits and miss when the process is scheduled on the same AMD core and when scheduled on different cores in each iteration.

       

      1. AFAIK, SYS_REFILL should be equal to DATA_MISSES, because the first time the array is accessed, it has to be fetched from Main Memory.

         Why is SYS_REFILL so less compared to DATA_MISSES ?

      2. L2_REFILL is equal to DATA_MISSES. Does it mean, most of the data is prefetched into L2? If so, how can I measure the number of hardware data prefetch satisfied by System Memory?

      3. If theres no way to measure the number of hw data prefetch satisfied by main memory, how can I defeat the hw prefetch so I see SYS_REFILL = DATA_MISSES the first time the array is accessed?

       

      const u_int STRIDE = 128; const u_int ARRAY_LEN = 128 * 1024; // 128K entries or 512KB u_long array[ARRAY_LEN]; for (j = 0; j < 8; j++) { if (sched_core) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(j, &cpuset); sched_setaffinity(getpid(), sizeof(cpu_set_t), &cpuset); } gettimeofday(&begin[j], NULL); perf_read(&before[j]); for (i = 0; i < ARRAY_LEN; i+=STRIDE) { unsigned long d = array[i]; d++; } perf_read(&after[j]); gettimeofday(&end[j], NULL); } } I ran this program on a Quad core Opteron 2376 that has 2 CPUs with 4 cores each. Program: ./lattest -e 0x00410040 -e 0x00410041 -e 0x00411E42 -e 0x00411F43 CPU, TSC, DATA_ACCESS, DATA_MISSES, L2_REFILL, SYS_REFILL 4, 360390, 11590, 1027, 1018, 8 4, 37159, 8314, 1023, 1022, 0 4, 32848, 8314, 1021, 1020, 0 4, 32831, 8272, 1022, 1021, 0 4, 32823, 8272, 1020, 1019, 0 4, 32823, 8272, 1020, 1019, 0 4, 32820, 8272, 1021, 1020, 0 4, 32823, 8272, 1020, 1019, 0 Program: ./lattest --sched-core -e 0x00410040 -e 0x00410041 -e 0x00411E42 -e 0x00411F43 CPU, TSC, DATA_ACCESS, DATA_MISSES, L2_REFILL, SYS_REFILL 0, 354429, 11582, 1029, 1010, 20 1, 37357, 8288, 1029, 1010, 18 2, 37422, 8290, 1030, 1010, 22 3, 37486, 8288, 1029, 1010, 20 4, 38688, 8286, 1030, 1010, 18 5, 38427, 8290, 1030, 1010, 20 6, 38490, 8285, 1029, 1010, 18 7, 38859, 8292, 1030, 1010, 18

        • Refill from SYS Mem
          aravinda

           

          And also, I was expecting to see some difference in DATA_MISSES in the two cases when the each iteration runs on the same core as opposed to on different cores.

          But they look more or less the same for me, except for a few of the SYS_REFILL when they run on different cores. 

          That means, there was not much benefit in having them run on the same core, that can only happen when the walking the array pollutes the L1 cache just as much as having a cold cache on a different core?

           

          So, I ran the same program to walk over 2K entries @ stride=128 (16data accesses)

          Same core:

          There are almost no misses, does this mean all the data are prefetched into L1 ?

          Different core:

          Looks like all data misses are being filled from L3. But why would data be prefetched this time?

           

          Program: ./lattest -e 0x00410040 -e 0x00410041 -e 0x00411E42 -e 0x00411F43 CPU, TSC, DATA_ACCESS, DATA_MISSES, L2_REFILL, SYS_REFILL 0, 3639, 250, 13, 4, 8 0, 447, 239, 1, 0, 0 0, 436, 232, 1, 0, 0 0, 398, 202, 1, 0, 0 0, 398, 202, 1, 0, 0 0, 398, 202, 1, 0, 0 0, 399, 202, 1, 0, 0 0, 398, 202, 1, 0, 0 Program: ./lattest --sched-core -e 0x00410040 -e 0x00410041 -e 0x00411E42 -e 0x00411F43 CPU, TSC, DATA_ACCESS, DATA_MISSES, L2_REFILL, SYS_REFILL 0, 5134, 260, 19, 0, 17 1, 1143, 226, 20, 0, 19 2, 1072, 232, 19, 0, 17 3, 1147, 234, 20, 0, 17 4, 1800, 227, 20, 0, 19 5, 1777, 237, 20, 0, 17 6, 1366, 220, 21, 0, 19 7, 1414, 226, 20, 0, 19