Greetings. Testing the performance of the new chip and have found some significant performance problems and reductions relative to the Intel 7700K I was using previously.
Take this example code. There is a Boolean map (2D) called "finds" which has already been performed for a 2D array "data" searching for values. The number of true elements ("Nfinds") in the "finds" map is much smaller than the number of elements of the map itself (which is equal dimension to "data") - less than 1%.
Please now note the two commented out lines giving the #pragma directives. If uncommented, then of course the compiler will use OpenMP to parallelize the outer for loop. Inside the for loop, if "finds" is true at an element then some data needs to be stored and and indexer must be incremented. Note the "critical" directive which enforces that this section of code is accessed sequentially. Of course, this breaks complete parallelism, however, given that the frequency of true elements in "finds" is low then it is rare that threads would need to wait for each other to get through this section. On the Intel chip this wasn't a problem and for 5k x 5k arrays (25 megapixels) this code completed essentially instantaneously when parallelized.
However on the AMD chip this code section requires 21 seconds to complete. If I comment out the OpenMP directives, as shown in the code below, then this code executes in less than 1 second on a single thread.
Other parallelized code runs quickly and fine, as you would expect for a 48-thread system, and outperforms my old Intel 7700K with 8 threads. The Intel chip ran the code section below just fine as parallelized, but running it on the AMD system is really quite terrible. The AMD chip doesn't seem to "handle the logic" as well, or whatever. The "critical" section would only be accessed less than 1% of the time for the test data, and so threads really shouldn't get in each other's way, and even if they do, on the Intel system this code runs much faster as parallelized than single threaded and completes "instantaneously", whereas the AMD system requires 21 seconds here when parallelized.
The code is VS C++ .net and compiled in VS2019.
array<int, 2>^ result = gcnew array<int, 2>(Nfinds, 2);
Nfinds = 0;
//#pragma omp parallel for
for (int i = 0; i < data->GetLength(0); i++)
for (int j = 0; j < data->GetLength(1); j++)
if (finds[i, j])
{
//#pragma omp critical
{
result[Nfinds, 0] = i;//x
result[Nfinds, 1] = j;//y
Nfinds++;
}
}