Transpose/copy slow on AMD Opteron?

Performance comparison between Xeon/Opteron


The transpose code below was compiled with gcc (using either of -O0 or -O3).  It ran between 3 - 5 times slower on Dual Core Opteron 2220 as compared to Dual Core Xeon 5160.  This was observed on both 64-bit Linux and 32-bit XP. What could be the reason for this, and what is the way to speed up the code?

======= Sample Code ===============

#include <stdio.h>
#include <sys/time.h>
#include <memory.h>

double timing(void)
    struct timeval tv;      /* Structure for storing time    */
    gettimeofday(&tv, NULL);
    return((double)tv.tv_sec + (double)tv.tv_usec*1e-6);

double t;
float *a,*b;
float c, d;
int i,j,size;

size = 10000;
a = (float *)malloc(size*size*sizeof(float));
b = (float *)malloc(size*size*sizeof(float));

t = timing();

for(i = 0; i < size; i++){
 for(j = 0; j < size; j++){
  a[i*size + j] = b[i + j*size];

printf("Time taken for size %i is: %lf\n",size,timing()-t);

============End sample code==================