2 Replies Latest reply on Aug 30, 2011 2:47 PM by milos

    Missing threads on OpenMP code compiled by openf95

    milos

      I am running a parallelized electromagnetic Fortran-90 code on 32 cores (4 8-core Opteron processors on 1 motherboard).

      If I use the Intel compiler, I run the code with OMP_NUM_THREADS=32 and all 32 cores are used.

      Since Intel compiler cannot bind threads to cores for the AMD processors, I would like to use the AMD Open64 compiler (openf95).  However, when I compile with this one, only the first 29 of 32 cores are used, and the other three sit idle.

      The code I'm running simply has a loop that is being split 32 ways, so I can't imagine why 29 cores would be very busy, and 3 cores would sit idle.

      Can someone help by explaining or pointing me to any literature on how the threads are allocated, i.e. if there are any initial "manager" threads that have to be skipped.. I recall having to use a "dplace" command to run jobs on SGI Altix machines to skip master threads in order to properly distribute and bind threads to cores.

      Thanks for any help!

      Milos

        • Missing threads on OpenMP code compiled by openf95
          santosh.zanjurne

          Hello Milos,

          We tried to reproduce the error you mentioned but could not see the difference you reported.  Would it be possible for you to share a testcse with us?  Also give us more information about the env e.g. AMD Part number, Open64 Compiler version, OS etc...

           

          Thanks & Regards,

          Santosh

           

            • Missing threads on OpenMP code compiled by openf95
              milos

               

              Originally posted by: santosh.zanjurne Hello Milos,

               

              We tried to reproduce the error you mentioned but could not see the difference you reported.  Would it be possible for you to share a testcse with us?  Also give us more information about the env e.g. AMD Part number, Open64 Compiler version, OS etc...

              Ran into some weird things so need to look into this further before I reply..  At first, I wasn't making the number of loop iterations an even multiple of 32 cores, so when dividing up I guess it may have had some cores free.  When I try to get close to (or just under) such a size, then almost all cores are busy.

              I am still confused though why all 3 of the last cores are 0% in the previous case, the 4th last is like 65%, and the rest are high.. >80 maybe 90+% and about equal..  Why would only the very last core not be less busy, since presumably OpenMP divides up iterations equally.. and I am binding threads to cores using O64_OMP_AFFINITY_MAP so they shouldn't be walking around.

              I am running on a 4-way AMD 6128 system on a Supermicro motherboard (8 x 4 = 32 cores).  64GB ram, way more than being used in the test.. but test is way bigger than cache (it's a few GB).

              Milos