2 Replies Latest reply on Jun 12, 2009 12:31 AM by edward_yang

    __builtin_popcountl()

    david.cownie@amd.com
      opencc should compile this as a single instruction for amdfam10 like gcc 4.3

      With gcc 4.3 it puts out a single instriuction

      > gcc -march=amdfam10 -O2 -S -o test_popcnt.s  test_popcnt.c
      > grep "popcnt" test_popcnt.s
              .file   "test_popcnt.c"
              popcntq (%rdx), %rax
      > gcc -march=amdfam10 -O2 -o test_popcnt  test_popcnt.c
      dcownie@shanghai:~/cvs/benchmarks/popcount/example> ./test_popcnt
      sizeof uint64 is 8
      overhead  0.42 secs sum 1966981120
      elapsed  0.54 secs sum -1990967296
      popcount 0.23 ns elasped  0.12 secs sum -1990967296

      But opencc 4.2.2.1 is  putting out the asm sequence - which runs about a factor of ten slower...

      > opencc -march=barcelona -O2 -S  -o test_popcnt.s test_popcnt.c
      > grep "popcnt" test_popcnt.s
              # Compiling test_popcnt.c (/tmp/ccI#.9GaocO)
              .file   1       "/home/dcownie/cvs/benchmarks/popcount/example/test_popcnt.c"
       #  58    double tstart, elapsed, overhead, rate, ns_per_popcnt;
       #  95        // sum = sum + popcnt (a);
       #  96        // sum = sum + __popcnt64 (a
      );
       # 129    ns_per_popcnt = (elapsed * (double)1.0E9) / ((double)loops * (double)N);
       # 131    printf ("popcount %4.2f ns elasped  %4.2f secs sum %d\n", ns_per_popcnt, elapsed, (int)sum);
              .ident  "#Open64 Compiler Version 4.2.2.1 : test_popcnt.c compiled with : -O2 -march=barcelona -msse2 -msse3 -mno-3dnow -mno-sse4a -m64"
       
      > opencc -march=barcelona -O2 -S  -o test_popcnt test_popcnt.c
      > opencc -march=barcelona -O2 -o test_popcnt test_popcnt.c
      > ./test_popcnt
      sizeof uint64 is 8
      overhead  0.18 secs sum 1966981120
      elapsed  2.49 secs sum -1990967296
      popcount 4.51 ns elasped  2.31 secs sum -1990967296

      The asm sequence takes 4.51ns vs 0.23ns for the single instruction

      Its not often you can get a factor of ten speedup so adding this to the compiler would be a pleasant (and simple task) !

      Also I'd suggest  the -TARGrocssor should follow the gcc convention and use amdfam10 since it is going to be confusing to use -march=barcelona when in fact your processor is now a shanghai or istanbul... so the AMD internal code name should be replaced with the processor family archetecture number (various different processors share the same architecture as describribed in the SWOG - Software Optimization Guide)

      (It could recognise both -march=barcelona AND -march=amdfam10 with no loss of current functionality but gently moving to a more logical plan.)

      Thanks !