28 Replies Latest reply on Mar 7, 2011 4:37 AM by nervi

    Large openmp/bandwidth performance regression in newer open64 releases.

    sbike
      OpenMP shows poor performance after 4.2.2.1

      Source code for a popular openMP memory benchmark:

      http://www.cs.virginia.edu/stream/FTP/Code/stream.c

      Works great with:

      GNU gcc version 4.2.0 (Open64 4.2.2.1 driver)

      (I suggest increasing N by a factor of 10 to get more reliable timings despite cpuspeed and related variables).

      $ opencc -O4 -m64 -mp stream.c -o stream && ./stream
      ...

      Total memory required = 457.8 MB.
      ...

      Number of Threads requested = 16

      ...
      Function      Rate (MB/s)   Avg time     Min time     Max time
      Copy:       30546.3775       0.0110       0.0105       0.0145
      Scale:      30583.9645       0.0107       0.0105       0.0122
      Add:        29600.7575       0.0164       0.0162       0.0171
      Triad:      29507.9135       0.0165       0.0163       0.0174

      Now if I switch to 4.2.2.2:

      export PATH=/share/apps/open64-4.2.2.2/bin:$PATH
      export LD_LIBRARY_PATH=/share/apps/open64-4.2.2.2/lib

      $ opencc -V
      x86 Open64 Compiler Suite: Version 4.2.2.2
      ...
      opencc -O4 -m64 -mp stream.c -o stream && ./stream

      ...

      Function      Rate (MB/s)   Avg time     Min time     Max time
      Copy:       11630.9547       0.0277       0.0275       0.0278
      Scale:      11618.5706       0.0278       0.0275       0.0283
      Add:        13120.5255       0.0367       0.0366       0.0368
      Triad:      12992.2490       0.0370       0.0369       0.0371

      I also tried the newest 4.2.3 beta:

      $ export PATH=/share/apps/open64-4.2.3/bin:$PATH
      $ export LD_LIBRARY_PATH=/share/apps/open64-4.2.3/lib
      $ opencc -V
      Open64 Compiler Suite: Version 4.2.2.99
      $ opencc -O4 -m64 -mp stream.c -o stream && ./stream
      ...

      Number of Threads requested = 16
      ...

      Function      Rate (MB/s)   Avg time     Min time     Max time
      Copy:       11585.7750       0.0279       0.0276       0.0284
      Scale:      11610.6305       0.0279       0.0276       0.0284
      Add:        13131.6509       0.0367       0.0366       0.0369
      Triad:      12981.7771       0.0372       0.0370       0.0381

      I seem to recall some intel specific library that was mistakenly left out of 4.2.2.2, but was promised to be included again for the next release.  Maybe that was forgotten?

       

       

       

       

        • Large openmp/bandwidth performance regression in newer open64 releases.
          prao

          Thank you for pointing this out.

          We're looking into this and will get back at the earliest.

            • Large openmp/bandwidth performance regression in newer open64 releases.
              prao

              We tried this here, but were unable to replicate your observation.

              May I please request you to share the details of the machine you tested this on so that we may have a closer look at this.

              Thanks,

                • Large openmp/bandwidth performance regression in newer open64 releases.
                  sbike

                   

                  An intel nehalem E5530.  It's running Centos-5.4.  I believe the shared libs supporting the newer intel archictures was dropped, causing opencc to default to producing binaries for a 32 bit 386 or so.  Only -m64 will force the generation of a 64 bit binary.

                   

                   

                    • Large openmp/bandwidth performance regression in newer open64 releases.
                      dgilmore

                      Could you make a copy of /dev/cpuinfo and attach it to another post?

                      Thanks,

                      Doug

                       

                        • Large openmp/bandwidth performance regression in newer open64 releases.
                          sbike

                          I don't see a way to attach.  I'll just include the 1st of 8 CPUs for the E5520 and E5530.

                          E5520:

                          processor       : 0
                          vendor_id       : GenuineIntel
                          cpu family      : 6
                          model           : 26
                          model name      : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
                          stepping        : 5
                          cpu MHz         : 2261.076
                          cache size      : 8192 KB
                          physical id     : 0
                          siblings        : 8
                          core id         : 0
                          cpu cores       : 4
                          apicid          : 0
                          fpu             : yes
                          fpu_exception   : yes
                          cpuid level     : 11
                          wp              : yes
                          flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm : 4522.15
                          bogomips        : 4522.15
                          clflush size    : 64
                          cache_alignment : 64
                          address sizes   : 40 bits physical, 48 bits virtual
                          power management: [8]

                          E5530:

                          processor       : 0
                          vendor_id       : GenuineIntel
                          cpu family      : 6
                          model           : 26
                          model name      : Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
                          stepping        : 5
                          cpu MHz         : 1600.000
                          cache size      : 8192 KB
                          physical id     : 0
                          siblings        : 8
                          core id         : 0
                          cpu cores       : 4
                          apicid          : 0
                          fpu             : yes
                          fpu_exception   : yes
                          cpuid level     : 11
                          wp              : yes
                          flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
                          bogomips        : 4800.32
                          clflush size    : 64
                          cache_alignment : 64
                          address sizes   : 40 bits physical, 48 bits virtual
                          power management: [8]

                           

                           

                            • Large openmp/bandwidth performance regression in newer open64 releases.
                              dgilmore

                              For Intel "cpu family" 6 we only recognize up to model 23.  We'll fix this.

                              Thanks!

                              Doug

                               

                                • Large openmp/bandwidth performance regression in newer open64 releases.
                                  sbike

                                  Strange that 4.2.2.1 works so well.  Has a patch been checked in?  Can I check it out of version control?  Or do I have to wait until the next release?

                                  • Large openmp/bandwidth performance regression in newer open64 releases.
                                    prao

                                    Incidentally, I notice from your cpuinfo that the cpu clock is at 1.6GHz on the E5530 while its rated at 2.4GHz. I would recommend that you disable power management and clock scaling when making performance measurements.

                                    My earlier measurements were on the Shanghai (2P4C), Penryn and Nehalem (all at 8 threads) and no regressions were found. Could you please provide additional pointers to the (intel specific ?) libraries that you suspect were left out ?

                                    My measurements for 8 threads on the Nehalem, also family 6 and model 26, (at 2.93GHz, 8GB memory, SLES10SP2) are attached below.

                                    --

                                    GNU gcc version 4.2.0 (Open64 4.2.2.1 driver) opencc -O4 -m64 -mp stream.c -o stream ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 11492.6214 0.0279 0.0278 0.0279 Scale: 11176.3353 0.0287 0.0286 0.0287 Add: 10579.2097 0.0455 0.0454 0.0456 Triad: 10431.5918 0.0461 0.0460 0.0461 ------------------------------------------------------------- GNU gcc version 4.2.0 (Open64 4.2.2.3 driver) opencc -O4 -m64 -mp stream.c -o stream ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 12058.2283 0.0266 0.0265 0.0267 Scale: 11640.5377 0.0275 0.0275 0.0275 Add: 11440.0509 0.0420 0.0420 0.0421 Triad: 11256.4769 0.0427 0.0426 0.0428 ------------------------------------------------------------- GNU gcc version 4.2.0 (Open64 4.2.2.99 driver) opencc -O4 -m64 -mp stream.c -o stream ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 12061.3708 0.0266 0.0265 0.0266 Scale: 11645.6888 0.0275 0.0275 0.0276 Add: 11439.4659 0.0420 0.0420 0.0421 Triad: 11253.8342 0.0427 0.0427 0.0428 -------------------------------------------------------------

                                      • Large openmp/bandwidth performance regression in newer open64 releases.
                                        sbike

                                         

                                        Originally posted by: prao Incidentally, I notice from your cpuinfo that the cpu clock is at 1.6GHz on the E5530 while its rated at 2.4GHz. I would recommend that you disable power management and clock scaling when making performance measurements.


                                        Disabling it didn't help.

                                         

                                         

                                        My earlier measurements were on the Shanghai (2P4C), Penryn and Nehalem (all at 8 threads) and no regressions were found. Could you please provide additional pointers to the (intel specific ?) libraries that you suspect were left out ?



                                        Ah, one of the releases was missing wolfdale.so, all of the ones I'm trying have that file, so that isn't the problem.

                                         

                                        My measurements for 8 threads on the Nehalem, also family 6 and model 26, (at 2.93GHz, 8GB memory, SLES10SP2) are attached below.

                                        Single socket or dual?  Your numbers are terrible for a dual socket, or at least dual socket with 8 or 16 threads.

                                        My problem is quite repeatable:

                                        $ which opencc
                                        /share/apps/open64-4.2.2/bin/opencc
                                        $ opencc -O4 -m64 -mp stream.c -o stream && ./stream| grep ":"
                                        STREAM version $Revision: 5.9 $
                                        Copy:       30058.6151       0.0107       0.0106       0.0108
                                        Scale:      29869.9711       0.0108       0.0107       0.0109
                                        Add:        28154.4152       0.0171       0.0170       0.0172
                                        Triad:      27924.7936       0.0172       0.0172       0.0173
                                        $ export PATH=/share/apps/open64-4.2.2.2/bin:$PATH
                                        $ export LD_LIBRARY_PATH=/share/apps/open64-4.2.2.2/lib
                                        $ opencc -O4 -m64 -mp stream.c -o stream && ./stream| grep ":"
                                        STREAM version $Revision: 5.9 $
                                        Copy:       11613.9460       0.0278       0.0276       0.0280
                                        Scale:      11625.3132       0.0279       0.0275       0.0283
                                        Add:        13120.4400       0.0367       0.0366       0.0368
                                        Triad:      12988.7286       0.0371       0.0370       0.0373

                                          • Large openmp/bandwidth performance regression in newer open64 releases.
                                            prao

                                             

                                            Single socket or dual?  Your numbers are terrible for a dual socket, or at least dual socket with 8 or 16 threads.


                                            The numbers were for a single socket nehalem with 8 threads

                                             

                                             

                                             

                                            • Large openmp/bandwidth performance regression in newer open64 releases.
                                              prao

                                              Could you please post the output from

                                              $ numactl -H

                                              $ find /sys/devices/system/node/

                                              and

                                              $ numactl --interleave=all ./stream-xxx

                                              where, stream-xxx corresponds to STREAM compiled with open64 4.2.2.1 and 4.2.2.2 or later

                                              Thanks,

                                               

                                                • Large openmp/bandwidth performance regression in newer open64 releases.
                                                  sbike

                                                  numactl -H didn't work, hopefully this provides what you want:

                                                  $ numactl --show
                                                  policy: default
                                                  preferred node: current
                                                  physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
                                                  cpubind: 0 1
                                                  nodebind: 0 1
                                                  membind: 0 1
                                                  $ numactl --hardware
                                                  available: 2 nodes (0-1)
                                                  node 0 size: 12091 MB
                                                  node 0 free: 3467 MB
                                                  node 1 size: 12120 MB
                                                  node 1 free: 44 MB
                                                  node distances:
                                                  node   0   1

                                                    0:  10  20

                                                    1:  20  10

                                                   

                                                  Here's how I compiled:

                                                  #!/bin/bash
                                                  for i in 4.2.2 4.2.2.2 4.2.3; do
                                                   echo $i;
                                                   export PATH=/share/apps/open64-$i/bin:$PATH
                                                   export LD_LIBRARY_PATH=/share/apps/open64-$i/lib
                                                   opencc -O4 -mp stream.c -o stream-$i;
                                                  done

                                                  Here's how I ran:

                                                  #!/bin/bash

                                                  for i in 4.2.2 4.2.2.2 4.2.3; do
                                                   echo Compiled with open64 version $i;
                                                   export LD_LIBRARY_PATH=/share/apps/open64-$i/lib
                                                   ./stream-$i | egrep "Total memory|requested|:";
                                                  done

                                                  Here's the result:

                                                  $ ./run
                                                  Compiled with open64 version 4.2.2
                                                  STREAM version $Revision: 5.9 $
                                                  Total memory required = 457.8 MB.
                                                  Number of Threads requested = 16
                                                  Copy:       22073.5758       0.0145       0.0145       0.0146
                                                  Scale:      21902.8872       0.0146       0.0146       0.0147
                                                  Add:        23982.0022       0.0200       0.0200       0.0201
                                                  Triad:      24082.9529       0.0200       0.0199       0.0201
                                                  Compiled with open64 version 4.2.2.2
                                                  STREAM version $Revision: 5.9 $
                                                  Total memory required = 457.8 MB.
                                                  Number of Threads requested = 16
                                                  Copy:       11184.5094       0.0288       0.0286       0.0289
                                                  Scale:      11169.3204       0.0288       0.0286       0.0289
                                                  Add:        11935.2444       0.0404       0.0402       0.0406
                                                  Triad:      11929.9266       0.0403       0.0402       0.0404
                                                  Compiled with open64 version 4.2.3
                                                  STREAM version $Revision: 5.9 $
                                                  Total memory required = 457.8 MB.
                                                  Number of Threads requested = 16
                                                  Copy:       11295.4216       0.0285       0.0283       0.0288
                                                  Scale:      11294.2913       0.0286       0.0283       0.0287
                                                  Add:        11998.1961       0.0401       0.0400       0.0404
                                                  Triad:      11997.2976       0.0401       0.0400       0.0402

                                                  If I add the numactl --interleave=all before each stream run:

                                                  $ ./run
                                                  Compiled with open64 version 4.2.2
                                                  STREAM version $Revision: 5.9 $
                                                  Total memory required = 457.8 MB.
                                                  Number of Threads requested = 16
                                                  Copy:       18814.7722       0.0182       0.0170       0.0193
                                                  Scale:      18574.4795       0.0184       0.0172       0.0196
                                                  Add:        20701.3398       0.0247       0.0232       0.0269
                                                  Triad:      20734.4085       0.0245       0.0231       0.0260
                                                  Compiled with open64 version 4.2.2.2
                                                  STREAM version $Revision: 5.9 $
                                                  Total memory required = 457.8 MB.
                                                  Number of Threads requested = 16
                                                  Copy:        9354.2708       0.0344       0.0342       0.0346
                                                  Scale:       9279.9049       0.0346       0.0345       0.0347
                                                  Add:         9958.3117       0.0484       0.0482       0.0485
                                                  Triad:       9981.2750       0.0481       0.0481       0.0482
                                                  Compiled with open64 version 4.2.3
                                                  STREAM version $Revision: 5.9 $
                                                  Total memory required = 457.8 MB.
                                                  Number of Threads requested = 16
                                                  Copy:        9374.8175       0.0342       0.0341       0.0344
                                                  Scale:       9337.6292       0.0344       0.0343       0.0345
                                                  Add:        10021.3158       0.0481       0.0479       0.0485
                                                  Triad:      10046.0343       0.0479       0.0478       0.0480

                                                  Clearly there's some large regression.  From the above it sounded like it's not recognizing my CPU as a Nehalem.  Does that not explain the above?  Is your CPU a model 23 or below?

                                                    • Large openmp/bandwidth performance regression in newer open64 releases.
                                                      sbike

                                                      I suspect because open64 doesn't recognize my nehalem, it defaults to building 32 bit binaries.  I added -m64 and now get better numbers, but still see a large difference.

                                                      Without numactl:

                                                      $ ./run
                                                      Compiled with open64 version 4.2.2
                                                      STREAM version $Revision: 5.9 $
                                                      Total memory required = 457.8 MB.
                                                      Number of Threads requested = 16
                                                      Copy:       29051.4563       0.0131       0.0110       0.0187
                                                      Scale:      29183.4768       0.0133       0.0110       0.0186
                                                      Add:        30460.6457       0.0190       0.0158       0.0297
                                                      Triad:      30293.3526       0.0190       0.0158       0.0275
                                                      Compiled with open64 version 4.2.2.2
                                                      STREAM version $Revision: 5.9 $
                                                      Total memory required = 457.8 MB.
                                                      Number of Threads requested = 16
                                                      Copy:       10735.3571       0.0299       0.0298       0.0300
                                                      Scale:      10791.1148       0.0298       0.0297       0.0301
                                                      Add:        12933.4904       0.0373       0.0371       0.0375
                                                      Triad:      12860.3746       0.0376       0.0373       0.0378
                                                      Compiled with open64 version 4.2.3
                                                      STREAM version $Revision: 5.9 $
                                                      Total memory required = 457.8 MB.
                                                      Number of Threads requested = 16
                                                      Copy:       10722.0642       0.0299       0.0298       0.0300
                                                      Scale:      10802.7533       0.0298       0.0296       0.0300
                                                      Add:        12929.6696       0.0373       0.0371       0.0374
                                                      Triad:      12838.7236       0.0375       0.0374       0.0377

                                                      With numactl --interleave=all :

                                                      $ ./run
                                                      Compiled with open64 version 4.2.2
                                                      STREAM version $Revision: 5.9 $
                                                      Total memory required = 457.8 MB.
                                                      Number of Threads requested = 16
                                                      Copy:       16246.1693       0.0226       0.0197       0.0242
                                                      Scale:      14736.2459       0.0237       0.0217       0.0245
                                                      Add:        14842.7154       0.0347       0.0323       0.0366
                                                      Triad:      15146.6763       0.0334       0.0317       0.0355
                                                      Compiled with open64 version 4.2.2.2
                                                      STREAM version $Revision: 5.9 $
                                                      Total memory required = 457.8 MB.
                                                      Number of Threads requested = 16
                                                      Copy:       10992.8030       0.0297       0.0291       0.0300
                                                      Scale:      10933.5219       0.0300       0.0293       0.0303
                                                      Add:        12230.2972       0.0399       0.0392       0.0403
                                                      Triad:      11919.5870       0.0407       0.0403       0.0411
                                                      Compiled with open64 version 4.2.3
                                                      STREAM version $Revision: 5.9 $
                                                      Total memory required = 457.8 MB.
                                                      Number of Threads requested = 16
                                                      Copy:       11135.0740       0.0317       0.0287       0.0350
                                                      Scale:      11247.7983       0.0313       0.0285       0.0354
                                                      Add:        11861.7893       0.0425       0.0405       0.0461
                                                      Triad:      11714.1606       0.0429       0.0410       0.0466

                                                        • Large openmp/bandwidth performance regression in newer open64 releases.
                                                          s1974

                                                          according to your experience with STREAM what is the best opencc optimisation on Quad-Core AMD Opteron 8356 ?

                                                           

                                                          I tried opencc -O3 -mp stream.c -o stream_test, and the results are very bad: 

                                                           

                                                          Function      Rate (MB/s)   Avg time     Min time     Max time

                                                          Copy:        5760.0591       0.1113       0.1111       0.1114

                                                          Scale:       5759.2187       0.1112       0.1111       0.1114

                                                          Add:         6472.5769       0.1486       0.1483       0.1489

                                                          Triad:       6492.0083       0.1481       0.1479       0.1485



                                                           

                                                           

                                                           

                                                            • Large openmp/bandwidth performance regression in newer open64 releases.
                                                              santosh.zanjurne

                                                              To understand the issue better can you please provide following details ?

                                                              What DDR ram type you are using?
                                                              How many threads you are running (OMP_NUM_THREADS)?
                                                              How many sockets your machine has? and Do you use NUMA?

                                                              Regards,
                                                              Santosh

                                                                • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                  s1974

                                                                  I'm using a machine with:

                                                                   

                                                                  • 8 sockets and 8 quad-core AMD Opteron 8356 
                                                                  • PC2-5300 (DDR2-667)
                                                                  • OS: SLES10sp2
                                                                  • numactl -H


                                                                   

                                                                  available: 8 nodes (0-7)

                                                                  node 0 size: 32301 MB

                                                                  node 0 free: 134 MB

                                                                  node 1 size: 32320 MB

                                                                  node 1 free: 9 MB

                                                                  node 2 size: 32320 MB

                                                                  node 2 free: 707 MB

                                                                  node 3 size: 32320 MB

                                                                  node 3 free: 29898 MB

                                                                  node 4 size: 32320 MB

                                                                  node 4 free: 1068 MB

                                                                  node 5 size: 32320 MB

                                                                  node 5 free: 9 MB

                                                                  node 6 size: 32320 MB

                                                                  node 6 free: 9 MB

                                                                  node 7 size: 32320 MB

                                                                  node 7 free: 9 MB

                                                                  node distances:

                                                                  node   0   1   2   3   4   5   6   7 

                                                                    0:  10  20  20  20  20  20  20  20 

                                                                    1:  20  10  20  20  20  20  20  20 

                                                                    2:  20  20  10  20  20  20  20  20 

                                                                    3:  20  20  20  10  20  20  20  20 

                                                                    4:  20  20  20  20  10  20  20  20 

                                                                    5:  20  20  20  20  20  10  20  20 

                                                                    6:  20  20  20  20  20  20  10  20 

                                                                    7:  20  20  20  20  20  20  20  10



                                                                   

                                                                   

                                                                  -------------------------------------------------------------


                                                                   

                                                                  Running 2 threads
                                                                  Array size = 40000000, Offset = 0
                                                                  Total memory required = 915.5 MB.
                                                                  -------------------------------------------------------------
                                                                  Function Rate (MB/s) Avg time Min time Max time Copy: 4621.9738 0.1388 0.1385 0.1391 Scale: 4617.1403 0.1389 0.1386 0.1394 Add: 3864.0523 0.2490 0.2484 0.2493 Triad: 3865.5918 0.2486 0.2483 0.2490 -------------------------------------------------------------
                                                                  Running 8 threads
                                                                  Array size = 40000000, Offset = 0
                                                                  Total memory required = 915.5 MB.
                                                                  -------------------------------------------------------------
                                                                    Function      Rate (MB/s)   Avg time     Min time     Max time
                                                                    Copy:        5189.2042       0.1235       0.1233       0.1239
                                                                    Scale:       5191.2213       0.1234       0.1233       0.1236
                                                                    Add:         4265.3578       0.2254       0.2251       0.2257
                                                                    Triad:       4266.2887       0.2253       0.2250       0.2256
                                                                  -------------------------------------------------------------
                                                                  Running 16 threads
                                                                  Array size = 40000000, Offset = 0
                                                                  Total memory required = 915.5 MB.
                                                                  -------------------------------------------------------------
                                                                  Function      Rate (MB/s)   Avg time     Min time     Max time
                                                                  Copy:        7176.4976       0.0894       0.0892       0.0897
                                                                  Scale:       7206.5725       0.0890       0.0888       0.0895
                                                                  Add:         6659.6845       0.1443       0.1442       0.1447
                                                                  Triad:       6702.6868       0.1435       0.1432       0.1440
                                                                  -------------------------------------------------------------
                                                                  Running 32 threads
                                                                  Array size = 40000000, Offset = 0
                                                                  Total memory required = 915.5 MB.
                                                                  -------------------------------------------------------------
                                                                  Function      Rate (MB/s)   Avg time     Min time     Max time
                                                                  Copy:        5756.1189       0.1114       0.1112       0.1117
                                                                  Scale:       5759.2311       0.1113       0.1111       0.1115
                                                                  Add:         6467.0464       0.1486       0.1484       0.1488
                                                                  Triad:       6488.3260       0.1481       0.1480       0.1482
                                                                  -------------------------------------------------------------


                                                                    • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                      kvikrant

                                                                      Thanks for the info.

                                                                      Firstly it looks like you have a lot less free memory on some nodes (probably because of temporary buffers, other processes etc). May I suggest a reboot before a run.

                                                                      Then you can run as follows, this will bind the threads to unique nodes and use local memory for 1 thread run and interleaved memory for 8-thread run:

                                                                      1. one thread run

                                                                      export OMP_NUM_THREADS=1

                                                                      export O64_OMP_AFFINITY_MAP=0

                                                                      stream_omp_open64


                                                                      2. 8-thread run

                                                                      export OMP_NUM_THREADS=8

                                                                      export O64_OMP_AFFINITY_MAP=0,4,8,12,16,20,24,28

                                                                      numactl --interleave=all stream_omp_open64


                                                                      Thanks

                                                                      Vikrant

                                                                      -------------------------
                                                                      -------------------------
                                                                      The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

                                                                        • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                          nervi

                                                                          Any news about the performances of Open64?

                                                                          I'm planning to compile heavy number-crunching applications that should run in parallel (OpenMP or OpenMPI) and in 64 bit. Sources are in Fortran and partially in C/C++. I'd like to get the best running speed from my 4x6168 Opteron CPUs.

                                                                          Which version of AMD Open64 should I download? I alway suppose that the latest is somewhat better and produces the faster code, but...

                                                                          I would greately appreciate any suggestions...

                                                                          I forgot to mention that now I'm using Debian, but I' planni g to moove to Gentoo

                                                                            • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                              santosh.zanjurne

                                                                              We think that the latest release of the compiler 4.2.4 should give you the best performance numbers. You can download it from: http://developer.amd.com/cpu/open64/pages/default.aspx#four

                                                                               

                                                                              Open64 compiler officially supports SUSE and RHEL platforms, however you should be able to use it on other Linux flavors also.

                                                                              On Dabian/Gentoo you may face some issues,  In this case please build the compiler from the sources. Please go through HOWTO-INSTALL-OPEN64 in the compiler source documentation.

                                                                               

                                                                              We can help you if it’s possible for you to let us know which App/Benchmark you are using.

                                                                               

                                                                              We also think that you should get better numbers for the benchmark on SLES/RHEL platforms.  Is there any particular reason you are using Debian/Gentoo?

                                                                               

                                                                              Its known issue that the some of the Linux operating systems which use Magney cours processor, have node numbering issue which may greatly impact the performance numbers. Please make sure you don’t face this issue and apply the patch if necessary.

                                                                               

                                                                              We have already fixed one bug releated to using OpenMP with gcc-4.5 on debian but its not out yet.  Please use links below to know more about earlier problems.

                                                                               

                                                                              Debian:

                                                                              http://forums.amd.com/devforum/messageview.cfm?catid=373&threadid=134566&enterthread=y

                                                                               

                                                                              Gentoo:

                                                                              http://forums.amd.com/devforum/messageview.cfm?catid=373&threadid=128518&enterthread=y

                                                                               

                                                                              Regards,

                                                                              Santosh

                                                                               

                                                                                • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                                  nervi

                                                                                  Hello,

                                                                                  thank you for your reply and sorry for late answer.

                                                                                  I'm using Gentoo and I'd like to compile Quantum espresso using the fastest parallel code. I'm going to test both OpenMP and OpenMPI on my single motherboard.

                                                                                  Thanks again for suggestions and vey helpful hints. I'm working on it and ifI hope to get back soon. Thanks,

                                                                                   Carlo

                                                                                    • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                                      nervi

                                                                                      Hello again, I continue on this thread...

                                                                                      I downloaded both source and binaries of open64 4.2.4-1 and extracted into the directory /home/user/src/x86_open64-4.2.4

                                                                                      Following the instructions contained in the "INSTALL" file, and since I'm interested in only 64 bit executables I did:

                                                                                      [CODE]export TOOLROOT=/home/user/src/x86_open64-4.2.4
                                                                                      export PATH=${TOOLROOT}/bin:$PATH

                                                                                      make all MACHINE_TYPE=x86_64

                                                                                      make lib MACHINE_TYPE=x86_64 BUILD_COMPILER=OSP
                                                                                      make -C osprey/targx8664_x8664 BUILD_COMPILER=OSP[/CODE]

                                                                                      The last make resulted into the following error:

                                                                                      Any hints? - Thank you!

                                                                                      [CODE]make -C osprey/targx8664_x8664 BUILD_COMPILER=OSP . . ../../libcif/libcif.a(cif_conv.o): In function `Cif_Make_Cifconv': cif_conv.c:(.text+0x3d8a): warning: the use of `tempnam' is dangerous, better use `mkstemp' ../../libcif/libcif.a(cif_conv.o): In function `Cif_Cifconv': cif_conv.c:(.text+0x4579): warning: the use of `mktemp' is dangerous, better use `mkstemp' decorate_utils.o: In function `parse_decorate_script(char const*)': decorate_utils.cxx:(.text+0x1084): undefined reference to `std::ctype::_M_widen_init() const' f2c_abi_utils.o: In function `Check_FF2C_Script': f2c_abi_utils.cxx:(.text+0x854): undefined reference to `std::ctype::_M_widen_init() const' ../fe90/fe90.a(fold.o): In function `fold_operation__': fold.f:(.text+0xb88): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x1039): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x1960): undefined reference to `_gfortran_pow_i4_i4' fold.f:(.text+0x1d59): undefined reference to `_gfortran_pow_i4_i4' fold.f:(.text+0x1ff5): undefined reference to `_gfortran_pow_i4_i4' fold.f:(.text+0x2084): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x226b): undefined reference to `_gfortran_pow_i8_i8' fold.f:(.text+0x2341): undefined reference to `_gfortran_ishftc8' fold.f:(.text+0x2388): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x2655): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x2ec9): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x3058): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x3255): undefined reference to `_gfortran_ishftc4' ../fe90/fe90.a(fold.o):fold.f:(.text+0x3482): more undefined references to `_gfortran_ishftc4' follow ../fe90/fe90.a(fold.o): In function `fold_operation__': fold.f:(.text+0x34af): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x3529): undefined reference to `_gfortran_ishftc8' fold.f:(.text+0x38c4): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x38ee): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x390b): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x3aa7): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x3eee): undefined reference to `_gfortran_ishftc8' collect2: ld returned 1 exit status make[3]: *** [mfef95] Error 1 make[2]: *** [default] Error 2 make[1]: *** [first] Error 2 make: *** [default] Error 2[/CODE]

                                                                                        • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                                          nervi
                                                                                          I found the error during "make all MACHINE_TYPE=x86_64":

                                                                                          Update: I manually add the "-cpp" switch to the gfortran command (in the appropriate directory) and it works!.

                                                                                          gfortran -fsecond-underscore -m64 -m64 -mno-sse2 -I../../include -O2 -fno-strict-aliasing -D_MIPSEL -D_LONGLONG -D_MIPS_SZINT=32 -D_MIPS_SZPTR=64 -D_MIPS_SZLONG=64 -D_LP64 -MMD -c ../../../crayf90/fe90/fold.f Fatal Error: To enable preprocessing, use -cpp make[4]: *** [fold.o] Error 1 make[3]: *** [default] Error 2 make[2]: *** [first] Error 2 make[2]: Leaving directory `/home/nervi/src/Open64/x86_open64-4.2.4/osprey/targx8664_x8664/crayf90' make[1]: *** [mfef95] Error 2 make[1]: Leaving directory `/home/nervi/src/Open64/x86_open64-4.2.4' make: *** [build] Error 2

                                                                                            • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                                              nervi
                                                                                              Okay, I've only shifted the problem. Now errors appears after
                                                                                              "make -C osprey/targx8664_x8664 BUILD_COMPILER=OSP"

                                                                                              make first MAKE /home/nervi/src/Open64/x86_open64-4.2.4/osprey/targx8664_x8664/crayf90/libf90sgi/../../include make make_libdeps make mfef95 cf95.cat sh: hg: command not found GEN compiler_build_date.c C /home/nervi/src/Open64/x86_open64-4.2.4/osprey/targx8664_x8664/crayf90/sgi/compiler_build_date.c LD /home/nervi/src/Open64/x86_open64-4.2.4/osprey/targx8664_x8664/crayf90/sgi/mfef95 ../../libcif/libcif.a(cif_conv.o): In function `Cif_Make_Cifconv': cif_conv.c:(.text+0x3d8a): warning: the use of `tempnam' is dangerous, better use `mkstemp' ../../libcif/libcif.a(cif_conv.o): In function `Cif_Cifconv': cif_conv.c:(.text+0x4579): warning: the use of `mktemp' is dangerous, better use `mkstemp' decorate_utils.o: In function `parse_decorate_script(char const*)': decorate_utils.cxx:(.text+0x1084): undefined reference to `std::ctype<char>::_M_widen_init() const' f2c_abi_utils.o: In function `Check_FF2C_Script': f2c_abi_utils.cxx:(.text+0x854): undefined reference to `std::ctype<char>::_M_widen_init() const' ../fe90/fe90.a(fold.o): In function `fold_operation__': fold.f:(.text+0xb88): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x1039): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x1960): undefined reference to `_gfortran_pow_i4_i4' fold.f:(.text+0x1d59): undefined reference to `_gfortran_pow_i4_i4' fold.f:(.text+0x1ff5): undefined reference to `_gfortran_pow_i4_i4' fold.f:(.text+0x2084): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x226b): undefined reference to `_gfortran_pow_i8_i8' fold.f:(.text+0x2341): undefined reference to `_gfortran_ishftc8' fold.f:(.text+0x2388): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x2655): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x2ec9): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x3058): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x3255): undefined reference to `_gfortran_ishftc4' ../fe90/fe90.a(fold.o):fold.f:(.text+0x3482): more undefined references to `_gfortran_ishftc4' follow ../fe90/fe90.a(fold.o): In function `fold_operation__': fold.f:(.text+0x34af): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x3529): undefined reference to `_gfortran_ishftc8' fold.f:(.text+0x38c4): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x38ee): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x390b): undefined reference to `_gfortran_selected_real_kind' fold.f:(.text+0x3aa7): undefined reference to `_gfortran_ishftc4' fold.f:(.text+0x3eee): undefined reference to `_gfortran_ishftc8' collect2: ld returned 1 exit status make[3]: *** [mfef95] Error 1 make[2]: *** [default] Error 2 make[1]: *** [first] Error 2 make: *** [default] Error 2 make: Leaving directory `/home/nervi/src/Open64/x86_open64-4.2.4/osprey/targx8664_x8664'

                                                                          • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                            prao

                                                                             

                                                                            Clearly there's some large regression.  From the above it sounded like it's not recognizing my CPU as a Nehalem.  Does that not explain the above?  Is your CPU a model 23 or below?

                                                                             

                                                                            Mine's a model 26 Nehalem as well, but a single socket.

                                                                            Thanks for the info above. Could you also please post the output of

                                                                            $ find /sys/devices/system/node/

                                                                             

                                                                            Thanks,

                                                                              • Large openmp/bandwidth performance regression in newer open64 releases.
                                                                                sbike

                                                                                 

                                                                                Mine's a model 26 Nehalem as well, but a single socket.

                                                                                Ah, interesting, I assume by default your compiler produces a 64-bit binary?

                                                                                 

                                                                                 

                                                                                 

                                                                                Thanks for the info above. Could you also please post the output of

                                                                                 

                                                                                $ find /sys/devices/system/node/

                                                                                 

                                                                                $  find /sys/devices/system/node/

                                                                                /sys/devices/system/node/

                                                                                /sys/devices/system/node/node1

                                                                                /sys/devices/system/node/node1/cpu15

                                                                                /sys/devices/system/node/node1/cpu14

                                                                                /sys/devices/system/node/node1/cpu13

                                                                                /sys/devices/system/node/node1/cpu12

                                                                                /sys/devices/system/node/node1/cpu7

                                                                                /sys/devices/system/node/node1/cpu6

                                                                                /sys/devices/system/node/node1/cpu5

                                                                                /sys/devices/system/node/node1/cpu4

                                                                                /sys/devices/system/node/node1/distance

                                                                                /sys/devices/system/node/node1/numastat

                                                                                /sys/devices/system/node/node1/meminfo

                                                                                /sys/devices/system/node/node1/cpumap

                                                                                /sys/devices/system/node/node0

                                                                                /sys/devices/system/node/node0/cpu11

                                                                                /sys/devices/system/node/node0/cpu10

                                                                                /sys/devices/system/node/node0/cpu9

                                                                                /sys/devices/system/node/node0/cpu8

                                                                                /sys/devices/system/node/node0/cpu3

                                                                                /sys/devices/system/node/node0/cpu2

                                                                                /sys/devices/system/node/node0/cpu1

                                                                                /sys/devices/system/node/node0/cpu0

                                                                                /sys/devices/system/node/node0/distance

                                                                                /sys/devices/system/node/node0/numastat

                                                                                /sys/devices/system/node/node0/meminfo

                                                                                /sys/devices/system/node/node0/cpumap



                                                                                 

                                                                                 

                                                                                 

                                                                                Thanks,