Server Processors

ripthis · ‎04-11-2021

Greetings. Testing the performance of the new chip and have found some significant performance problems and reductions relative to the Intel 7700K I was using previously.

Take this example code. There is a Boolean map (2D) called "finds" which has already been performed for a 2D array "data" searching for values. The number of true elements ("Nfinds") in the "finds" map is much smaller than the number of elements of the map itself (which is equal dimension to "data") - less than 1%.

Please now note the two commented out lines giving the #pragma directives. If uncommented, then of course the compiler will use OpenMP to parallelize the outer for loop. Inside the for loop, if "finds" is true at an element then some data needs to be stored and and indexer must be incremented. Note the "critical" directive which enforces that this section of code is accessed sequentially. Of course, this breaks complete parallelism, however, given that the frequency of true elements in "finds" is low then it is rare that threads would need to wait for each other to get through this section. On the Intel chip this wasn't a problem and for 5k x 5k arrays (25 megapixels) this code completed essentially instantaneously when parallelized.

However on the AMD chip this code section requires 21 seconds to complete. If I comment out the OpenMP directives, as shown in the code below, then this code executes in less than 1 second on a single thread.

Other parallelized code runs quickly and fine, as you would expect for a 48-thread system, and outperforms my old Intel 7700K with 8 threads. The Intel chip ran the code section below just fine as parallelized, but running it on the AMD system is really quite terrible. The AMD chip doesn't seem to "handle the logic" as well, or whatever. The "critical" section would only be accessed less than 1% of the time for the test data, and so threads really shouldn't get in each other's way, and even if they do, on the Intel system this code runs much faster as parallelized than single threaded and completes "instantaneously", whereas the AMD system requires 21 seconds here when parallelized.

The code is VS C++ .net and compiled in VS2019.

array<int, 2>^ result = gcnew array<int, 2>(Nfinds, 2);
Nfinds = 0;
//#pragma omp parallel for
for (int i = 0; i < data->GetLength(0); i++)
	for (int j = 0; j < data->GetLength(1); j++)
		if (finds[i, j])
		{
			//#pragma omp critical
			{
				result[Nfinds, 0] = i;//x
				result[Nfinds, 1] = j;//y
				Nfinds++;
			}
		}

Anonymous · ‎04-21-2021

Hi @ripthis, I don't have 3960X or 7700K, but I tried your code on AMD 3995WX and Intel dual 8280 (not exactly the same workload since I don't have the values in finds[i, j]). Compare 48 threads performance with single thread performance, I saw about 2X slowdown on AMD 3990X and 3X slowdown on Intel dual 8280.

I think the slowdown is reasonable, given "if (finds[i, j])" doesn't have much work besides fetching finds[i, j]. I understand the if statement is true for only 1% of the iterations, but it can still become a thread synchronization workload if the other 99% of the iterations have minimal work to do.

I'd suggest using the OpenMP clause num_threads() to run this loop with lower number of threads, and that may help with the performance.

Server Processors

Threadripper 3960X extremely poor performance with OpenMP parallelization