In APP guide, there is a code snippet:
if (A>B) {
C += D;
} else {
C -= D;
}
It says that
In the first block of code, this translates into an IF/ELSE/ENDIF sequence of
conditional code, each taking ~8 cycles. If divergent, this code executes in
~36 clocks; otherwise, in ~28 clocks. A branch not taken costs four cycles
(one instruction slot); a branch taken adds four slots of latency to fetch
instructions from the instruction cache, for a total of 16 clocks. Since the
execution mask is saved, then modified, then restored for the branch, ~12
clocks are added when divergent, ~8 clocks when not.
Anyway I cannot get 36 or 28…… How many cycles each line takes in both case?
Hi,
I'm curious, how did you get clock cycles count ?
ty
AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.
Page 125.
I have asked some people about it. It is interesting to me too.
Any jump costs a minimum of 4 quad cycles to fetch instructions from the instruction cache. With the if-else case there is a jump in either side of the branch, so you’re guaranteed a 32 clock penalty if the code is divergent. Since there’s an add in each side, that’s another 8 clocks. So 40 clocks not counting the instructions to set up the conditional. If non-divergent, then you just get 1 jump and 1 add, so you would have 20 clocks.
Courtesy: Jeff Golds
I think that's very clear. But APP guides says 36 or 28.. Is that wrong?
Unfortunately yes. We will fix it soon there.