Hi folks
I'm having serious problems with openMP paralleled nested loops of the following type
#pragma omp parallel for num_threads(THS) schedule(static, N_z/THS )
for(i=0;i<N_z;i++){
for(j=0;j<N_r;j++) B[i*N_r+j]+=sin(A[i*N_r+j]);
}
I'm using dual opteron moby by TYAN with two Opterons 6272 128 Gb DDR3 and 4 690 GeForce GTX-es. All this is running Linux Debian kernel: 3.2.0-4-amd64. Compiler is gcc 4.7.
I have a SLOW DOWN of 1.5 times with 16 threads (THS in code) compared to single threaded run. With pthreads i have normal ~10 speedup. On Nehalem server with 8 cores everything is okay not 8x but 4x speed up but it is a speed up not slow down! You are welcome to test my code on your systems. (see attachment)
attachment
*.c/*.h - source
test - code compiled on my system but for pthreads 16... see run script. so you can remove it.
compile - compile script
run - run script
run.log - output log for my system.
P.S. I used taskset and test in case when 0 2 4 6 ... cores are used. The result is the same - openMP does not work. All threads are at 100 of load but exec time is greater by 1.5 times than in single threaded version. While pthreaded version behaves okay.
Message was edited by: Daniil Fadeev
Hi there again
on openmp forum I got the solution to my problem. What I need is just to use internal j in my nested cycle.
#pragma omp parallel for num_threads(THS) schedule(static, N_z/THS )
for(i=0;i<N_z;i++){
int jl
for(j=0;j<N_r;j++) B[i*N_r+j]+=sin(A[i*N_r+j]);
}
But I'm still confused about this speed-down because all the time is spent on sine calculation in my code. So why possible conflict affect the performance so much. On Nehalem architecture this impact is not so dramatic! I'm waiting for a responce from openmp forum because I do not understand in details how openmp works.
here is the links to the thread in openmp forum:
OpenMP® Forum • View topic - Opteron 6200 issue - dramatic speed-down for nested loops
Message was edited by: Daniil Fadeev