Ah, I think I understand now how one single instruction could be taking so much time.
Since I'm writing to random addresses, the cache is not utilized efficiently.
I'm a beginner in profiling, this is my best guess so far.
If that is the case, the problematic instruction should probably be at
0x1141F69 mov [edx+4],ebp, and not inc ebp at 0x1141F6C.