4 Replies Latest reply on May 15, 2013 2:26 PM by ortizjavier

    ACML set num threads not working on trsm functions

    ortizjavier

      Hi community, I'm writing this because I have a strange behavior with acmlsetnumthreads function...

       

       

      I set the max number of threads using the following statement:

           acmlsetnumthreads(4);

       

       

      And then when I execute for example a DGEMM or a DGETF2 call, I can see through the 'top' command that only 4 processors are being used...

       

       

      When the call is a DTRSM function, I see that mostly all of my 32 processors are working... That makes me think that I'm probably having more than 4 threads in this function...

      I'm trying to make a Thread Pool with nested parallelism and affinity but I have a very low execution speed when two or more xTRSM are executing at the same time.

       

       

      Can somebody give me some light on this?

       

       

      I'm using ACML with GCC. openSuse... If there is another information that could be helpful, please tell me

       

       

      Thanks for reading!

      Javier

        • Re: ACML set num threads not working on trsm functions
          chipf

          Which version of the library?

          I tried a simple experiment with the  5.3.1 and 5.3.0 gfortran libraries, both will honor OMP_NUM_THREADS and calls by the program to omp_set_num_threads.  My program used OpenMP for threading, not pthreads.

           

          Only if OMP_NUM_THREADS is set to 32 did I see 32 threads.  Note if OMP_NUM_THREADS is not set, then the GCC openMP runtime interprets that to mean all available threads.

           

           

          Note that 5.3.1 has a new affinity feature that may work against you in this application.  It will always bind starting with the first core or core unit and add cores until the requested number of threads is bound.  If your application has two threads calling ACML, then they will both bind to the same threads.  Unfortuntely we did not put in a way to overrride this in the 5.3.1 release.

          Because of this you might need to stay with 5.3.0 for your application.

            • Re: ACML set num threads not working on trsm functions
              ortizjavier

              Hi Chip, thank you for your answer!

               

              I'm currently using the 5.3.0 C libraries with gcc v4.7.2 and linking with libacml_mp.so...

               

              My code looks like these:

               

              #include <omp.h>

              #include <stdio.h>

              #include <acml.h>

               

              int main(int argc, char *argv[]){

                  int M = 10000;

                  int i;

                 

                  double *A = (double *) malloc (M * M * sizeof(double));

                  double *B = (double *) malloc (M * M * sizeof(double));

                  double *C = (double *) malloc (M * M * sizeof(double));

                 

                  printf("Matrix Creation...\n");

                  for(i = 0; i < M * M; i++){

                      A[i] = 1;

                      B[i] = 1;

                      C[i] = 1;

                  }

                 

                  printf("OMP max threads: %d \n", omp_get_max_threads());

                  omp_set_nested(1);     

                  acmlsetnumthreads(4); 

                  printf("OMP max threads: %d \n", omp_get_max_threads());

                 

                  printf("Working\n");

                 

                  //dgemm('N', 'N', M, M, M, -1, A, M, B, M, 1, C, M);

                  dtrsm('R','U', 'N', 'N', M, M, 1.0, A, M, B, M);

                 

                  return 0;

              }

               

              The output is:

              Creating...

              OMP max threads: 32

              OMP max threads: 4

              Working

               

              But when the DTRSM function is executing, I see that there are at least 16 of my 32 processors working... When I use DGEMM, DGETF2 or DGETRF I see only 4 processors working... That should be the expected result but we can't figure why with DTRSM uses more threads than the maximum...

               

              Maybe we are doing something wrong... Any idea of what could be?

               

              Thanks again!

              Javier

                • Re: ACML set num threads not working on trsm functions
                  chipf

                  The call to set_omp_nested is causing problems for the case that you show. If I comment it out, I see only the number of threads that are specified in the acmlsetnumthreads call.

                   

                  Note that dtrsm and  dgeqrf will both call dgemm.  All of these subroutines have OMP threading enabled.  Our library assumes that nesting is disabled, so that when multiple threads of dtrsm call dgemm, each dtrsm threads only runs one dgemm thread.  When you enable nested parallelism, you break this assumption. With nesting enabled N threads of DTRSM will each spawn N threads of DGEMM, resulting in NxN threads.  

                   

                  When you just call DGEMM, there is no nesting involved, at least the way your test application works, so you won't see the same type of thread oversubscription.

                   

                  There are some odd things going on.  I tried this with only 2 and 3 threads and saw strange thread placement, for instance I didn't see 4 or 9 threads.  I think this points out that the GCC openmp runtime may not always get thread placement correct when using nesting.   I suspect the runtime is scheduling multiple threads to the same CPU, which is obviously bad.  Evidence for this is that as the program runs, some cpus finish their tasks and go to idle, but there are still a few left running at 100%.  The ones that stick around for a long time have multiple threads running on them.  This may also interact with the OS you are using, I'm running this on SLES 11 SP2. 

                   

                  In general we recommend against calling the OpenMP library from within a threaded program.  You might be able to do this by using numa and sched_setaffinity calls to set  the number of threads and CPU affinity for for each ACML call, but you should not enable nested parallelism within the ACML library.