We recently acquired a new machine with the 3990x Threadripper. I am upgrading a compute intensive application that is used for building an OCR classifier. This application runs for several weeks at a time using as many threads as are available on the computer.
As part of the upgrade, I identified and implemented some significant performance enhancements in the disk file access. The functions doing disk access now complete within 1/10th of the time earlier taken .. for a small run of the program, across 63 threads the total time in the disk access is reduced from 5300 seconds to 54 seconds.
However, all the gains in disk access are being offset by computation time in one function which does matrix multiplication. The function is pasted below.
Is there any limit to performing such floating point computations on multiple threads simultaneously?
Earlier versions of the program conditionally use ASM routines based on the type of cpu. However, for this CPU the function uses the C version as shown below.
Thanks
Kimman
/******************************************************************************/
FLOAT32 VLIB_ENTRY flt_off_dotp_i8xf32_0
(const INT8 *x, const FLOAT32 *y, INT32 m, INT32 n)
/******************************************************************************/
{
FLOAT64 sd;
sd = 0.0;
while (n--) {
sd += (FLOAT64)(*x) * (FLOAT64)(*(y++));
x += m;
}
return((FLOAT32)sd);
}