I could see the latency about global memory in manual.
But I don't see the private memory latency in manual, NVIDIA shows the latency "zero cycle or read after right(24cycle) " in its manual .
And how about AMD OpenCL ?
There's no such latency on AMD. In every cycle it can read 3 regs and write 1 reg.
The only penalty I know is when a vector instruction that writes into a scalar reg is followed by a scalar instruction. That could be 1 cycle penalty but the compiler will avoid this anyways.