We'll have to try and reproduce this. If you have a simple test case it would be helpful.
Are you multithreading in the application, and then calling DGEMM? Or is the application running a single master thread that expects DGEMM to multithread?
How many threads?
1 of 1 people found this helpful
I am actually using a mix of 4 ACML functions. The application is single threaded and takes advantage of the threaded ACML. I use anywhere between 8 and 64 threads, depending on the size of the matrices. The system is a 64-core Interlagos and can actually support 64 threads.
I'll try to reproduce it on a simple example, but until then here is the succession of ACML calls. Please let me know if you see a smoking gun. dsymm operates on large NxN matrices, dgemm on smaller 6xN ones.
N = 5370 while not_converged: dsyrk (NxN)
dsymv (NxN, N)
Thanks a lot,
I have started a simple test case using gfortran. DGEMM seems to work as expected, I'll add now cases for dsymm and also for the ifort compiler.
On thing you can try is the ACML_FAST_MALLOC_DEBUG envrionment variable as documented in the user guide. You might try reducing the number of threads to reduce the volume of messages and to see if that affects the problem.
Maybe this will tell us something useful.
I'm assuming this is a linux environment?
It turns out this happens with the call to dsyrk, which in turns calls DGEMM. I can demonstrate a problem with memory allocation, it just has to be debugged now.
It seems that with the problem sizes you are using dsyrk calls dgemm in a way that sometimes the fast alloc mechanism is effective and sometimes not, and this appears to be causing the leak.
There is a variable, ACML_FAST_MALLOC_CHUNK_SIZE that controls how much memory is allowed for an allocation that will be retained. It's set at 10MB, which is too small for this case. I was using N=6000, and I was able to effectively work around the issue by setting ACML_FAST_MALLOC_CHUNK_SIZE to 35000000.
I chose the size by grabbing debug output into a file called temp.
I found the largest allocation requested using:
grep "new malloc size" temp | sed -e"s/^.*size //" | sort -n
I just rounded that largest number up a bit and then memory usage no longer showed a leak. For 32 threads, memory use capped at 1.8G.
This doesn't resolve the actual bug, but it is an effective work around. With the smaller size, the fast malloc is essentially not working anyway, so there would be no performance benefit even if the bug is fixed.