The problem is somewhat solved... It looks like the GPU was overflooded with draw commands.
However after solving this, another problem arised with glMapBufferRange. Obviously I provide the GL_MAP_UNSYNCHRONIZED_BIT, and when the buffer gets full I map with GL_MAP_INVALIDATE_BUFFER_BIT (a.k.a. orphan it).
For a few seconds the performance is ok, but then suddenly drops to 4 fps, the reason being glMapBufferRange (like it's waiting for something). Perhaps the multiple buffer/fencesync approach would be preferable... : /
I mark this answer as correct, but I didn't really "solve" the problem...
The GPU being flooded was "my fault" (or more precisely, the difference between DX and GL implementation logics).
The other problem which I mentioned can be derived to the driver ignoring certain GL flags (like GL_MAP_UNSYNCHRONIZED_BIT). Orphaning only made things worse.
So I did the following: I mapped the buffer with the unsyncronized flag and when it got full I simply used a glFenceSync + glWaitSync. Using this approach the performance is still slower than DX9, but doesn't drop at least...
Another thought (and I'm going to open another discussion): unifrom buffer objects are INCREDIBLY SLOW (and I really mean it).
What is my reasoning? As I started to refactor the 3D part of the application I noticed that using uniform buffers with _every_ possible optimization still yields performance like ten times slower than the original GL 2.1 glUniformXXX approach. This is ridiculous. No matter the round-robin approach, no matter the fences, the driver _still_blocks_in_that_<flower>_glMapBufferRange_ ...