What exactly does the "ALU" field in the OpenCL Profiler mean?
For example, is this the number of raw ALU Operations OR is it the number of 5-wide ALU operations?
What I mean is if you have 5 ADDs, one in each stream core of one VLIW processor, is that 5 ALUs or just 1?
Thank you.
It is the number of 5-wide ALU ops processed. So it will report 1 in your example.
Ok, I was just curious because I have an app where after some transformation, the ALU count increases but so does the performance. I assumed this was the case because the ALU packing also increased. Thank you.
Originally posted by: bpurnomo It is the number of 5-wide ALU ops processed. So it will report 1 in your example.
bpurnomo,
I think I misunderstood you the first time, so if the ALU increases that means that the number of VLIW instructions has increased (not necessarily the number of actual ALU ops), correct?
Like I said before, I'm curious where this performance increase has come from because both the fetch and ALU has increased; however so has the ALU Packing.
I think I misunderstood you the first time, so if the ALU increases that means that the number of VLIW instructions has increased (not necessarily the number of actual ALU ops), correct?
If both the ALU and ALUPacking increase, then the number of actual ALU ops (scalar ops) must increase too. To calculate the scalar ops, use the following formula: ALU scalar ops = 5 * ALU * ALUPacking/100.
Like I said before, I'm curious where this performance increase has come from because both the fetch and ALU has increased; however so has the ALU Packing.
Hard to say without an access to the code. Performance is affected by many factors.
Interesting. Well, everything else looks to be the same. The same number of GPRs, fetches, etc...
Obviously, the ALU Busy has increased too.
I'm not really sure why the performance gain (~20% on large sizes, like 1024x1024).
The only thing is I know the CF decreases, but it would seem from another post by Micah that you only need ~5 WFs to hide all CF, and there are 16 WF (reported by profiler).
If anyone has any idea I'd be glad to hear it.
So here's a very related question: can control flow latency be hidden in a fetch bound kernel?
So basically, I'm just asking that if the kernel is fetch bound and control flow is reduced (even though ALU is increased, kernel is still fetch bound) will more control flow latency be hidden since there will be less clause switching?
Yes, I believe that you have mentioned the dummy registers before and cache thrashing.
I've seen some of this in some benchmarks where a decrease in register pressure actually decreases performance, particularly in the 5800s in CS mode.
However, like I said before, the GPR remains the same for both kernels, the only differences are what I have mentioned. The same number of WFs for both kernels (16), so I don't think cache thrashing is an issue here.
The algorithm has a 1.06 ALU Busy, which to me makes it seem like even with 16 WFs running the kernel is still probably fetch bound.
So what I'm asking is this: If the kernel is fetch bound (AT RUN TIME despite the number of simul WFs, meaning there aren't enough WFs to hide all fetch latency) is it possible that a reduction in Control Flow can increase performance?
This makes sense to me but I was hoping to get some verification.
Micah,
I could IM or email you the code if you like, I'm not going to post it here. I'm actually somewhat confused looking at the profiling:
ALU, ALU Packing, ALU Busy, ALUFetchRatio and ALUStalledByLDS are the only profiler parameters that change. They all increase from Kernel 1 to Kernel 2, the performance in Kernel 2 is about 20% better. There is less CF in Kernel 2.
What I can also say is that the performance increases with an increase in thread size from small (256) ~9% improvement to (1024) ~20%.