david.cownie@amd.com

__builtin_popcountl()

Discussion created by david.cownie@amd.com on Jun 11, 2009
Latest reply on Jun 12, 2009 by edward_yang
opencc should compile this as a single instruction for amdfam10 like gcc 4.3

With gcc 4.3 it puts out a single instriuction

> gcc -march=amdfam10 -O2 -S -o test_popcnt.s  test_popcnt.c
> grep "popcnt" test_popcnt.s
        .file   "test_popcnt.c"
        popcntq (%rdx), %rax
> gcc -march=amdfam10 -O2 -o test_popcnt  test_popcnt.c
dcownie@shanghai:~/cvs/benchmarks/popcount/example> ./test_popcnt
sizeof uint64 is 8
overhead  0.42 secs sum 1966981120
elapsed  0.54 secs sum -1990967296
popcount 0.23 ns elasped  0.12 secs sum -1990967296

But opencc 4.2.2.1 is  putting out the asm sequence - which runs about a factor of ten slower...

> opencc -march=barcelona -O2 -S  -o test_popcnt.s test_popcnt.c
> grep "popcnt" test_popcnt.s
        # Compiling test_popcnt.c (/tmp/ccI#.9GaocO)
        .file   1       "/home/dcownie/cvs/benchmarks/popcount/example/test_popcnt.c"
 #  58    double tstart, elapsed, overhead, rate, ns_per_popcnt;
 #  95        // sum = sum + popcnt (a);
 #  96        // sum = sum + __popcnt64 (a
);
 # 129    ns_per_popcnt = (elapsed * (double)1.0E9) / ((double)loops * (double)N);
 # 131    printf ("popcount %4.2f ns elasped  %4.2f secs sum %d\n", ns_per_popcnt, elapsed, (int)sum);
        .ident  "#Open64 Compiler Version 4.2.2.1 : test_popcnt.c compiled with : -O2 -march=barcelona -msse2 -msse3 -mno-3dnow -mno-sse4a -m64"
 
> opencc -march=barcelona -O2 -S  -o test_popcnt test_popcnt.c
> opencc -march=barcelona -O2 -o test_popcnt test_popcnt.c
> ./test_popcnt
sizeof uint64 is 8
overhead  0.18 secs sum 1966981120
elapsed  2.49 secs sum -1990967296
popcount 4.51 ns elasped  2.31 secs sum -1990967296

The asm sequence takes 4.51ns vs 0.23ns for the single instruction

Its not often you can get a factor of ten speedup so adding this to the compiler would be a pleasant (and simple task) !

Also I'd suggest  the -TARGrocssor should follow the gcc convention and use amdfam10 since it is going to be confusing to use -march=barcelona when in fact your processor is now a shanghai or istanbul... so the AMD internal code name should be replaced with the processor family archetecture number (various different processors share the same architecture as describribed in the SWOG - Software Optimization Guide)

(It could recognise both -march=barcelona AND -march=amdfam10 with no loss of current functionality but gently moving to a more logical plan.)

Thanks !

Outcomes