So far I agree - but I have broken down it to:
running without -fopen option - my code takes 70 sec.
running it with -fopen and 1 Thread - 335 sec. So this has nothing to do with AVX, or ... it is just a mess. It only appears with aocc (4.0.0). gcc (11.3) works fine.