Ah, I think I understand now how one single instruction could be taking so much time.
Since I'm writing to random addresses, the cache is not utilized efficiently.
I'm a beginner in profiling, this is my best guess so far.
If that is the case, the problematic instruction should probably be at
0x1141F69 mov [edx+4],ebp, and not inc ebp at 0x1141F6C.
From the snap-shot, i assume you are running "CPU: Time Based Sampling" profile. This profile type should be used for identifying hotspots.
For analyzing instruction level attribution, you need to run "CPU: Instruction-based sampling" profile.