Archives Discussions

ryta1203 · ‎03-24-2010

What exactly does the "ALU" field in the OpenCL Profiler mean?

For example, is this the number of raw ALU Operations OR is it the number of 5-wide ALU operations?

What I mean is if you have 5 ADDs, one in each stream core of one VLIW processor, is that 5 ALUs or just 1?

Thank you.

bpurnomo · ‎03-24-2010

It is the number of 5-wide ALU ops processed. So it will report 1 in your example.

ryta1203 · ‎03-24-2010

Ok, I was just curious because I have an app where after some transformation, the ALU count increases but so does the performance. I assumed this was the case because the ALU packing also increased. Thank you.

ryta1203 · ‎03-24-2010

Originally posted by: bpurnomo It is the number of 5-wide ALU ops processed. So it will report 1 in your example.

bpurnomo,

I think I misunderstood you the first time, so if the ALU increases that means that the number of VLIW instructions has increased (not necessarily the number of actual ALU ops), correct?

Like I said before, I'm curious where this performance increase has come from because both the fetch and ALU has increased; however so has the ALU Packing.

bpurnomo · ‎03-24-2010

I think I misunderstood you the first time, so if the ALU increases that means that the number of VLIW instructions has increased (not necessarily the number of actual ALU ops), correct?

If both the ALU and ALUPacking increase, then the number of actual ALU ops (scalar ops) must increase too. To calculate the scalar ops, use the following formula: ALU scalar ops = 5 * ALU * ALUPacking/100.

Like I said before, I'm curious where this performance increase has come from because both the fetch and ALU has increased; however so has the ALU Packing.

Hard to say without an access to the code. Performance is affected by many factors.

ryta1203 · ‎03-25-2010

Interesting. Well, everything else looks to be the same. The same number of GPRs, fetches, etc...

Obviously, the ALU Busy has increased too.

I'm not really sure why the performance gain (~20% on large sizes, like 1024x1024).

The only thing is I know the CF decreases, but it would seem from another post by Micah that you only need ~5 WFs to hide all CF, and there are 16 WF (reported by profiler).

If anyone has any idea I'd be glad to hear it.

ryta1203 · ‎03-28-2010

So here's a very related question: can control flow latency be hidden in a fetch bound kernel?

So basically, I'm just asking that if the kernel is fetch bound and control flow is reduced (even though ALU is increased, kernel is still fetch bound) will more control flow latency be hidden since there will be less clause switching?

MicahVillmow · ‎03-28-2010

ryta,
In the case of a fetch bound kernel, the control flow latency would be hidden by having multiple wavefronts executed in parallel. This is also how you would hide the fetch latency. There really is two ways of dealing with control flow issues. Decreasing the control flow to ALU instruction ratio, or executing more wavefronts per SIMD. However, one of the problems with executing more wavefronts per SIMD is that you hit the memory in a different way and that can cause bank conflicts or cache thrashing. In the SGEMM optimization slide set from the documentation page, we showed that increasing the number of wavefronts executed in parallel actually caused a fairly large performance hit and how we got around that issue.

So back to your original question on possible performance improvements, the increase in SIMD utilization from a lower CF count could be a reason, but as Budi mentioned, it is hard to tell without looking at the ISA/Algorithm.

ryta1203 · ‎03-29-2010

Yes, I believe that you have mentioned the dummy registers before and cache thrashing.

I've seen some of this in some benchmarks where a decrease in register pressure actually decreases performance, particularly in the 5800s in CS mode.

However, like I said before, the GPR remains the same for both kernels, the only differences are what I have mentioned. The same number of WFs for both kernels (16), so I don't think cache thrashing is an issue here.

The algorithm has a 1.06 ALU Busy, which to me makes it seem like even with 16 WFs running the kernel is still probably fetch bound.

So what I'm asking is this: If the kernel is fetch bound (AT RUN TIME despite the number of simul WFs, meaning there aren't enough WFs to hide all fetch latency) is it possible that a reduction in Control Flow can increase performance?

This makes sense to me but I was hoping to get some verification.

ryta1203 · ‎03-30-2010

Micah,

I could IM or email you the code if you like, I'm not going to post it here. I'm actually somewhat confused looking at the profiling:

ALU, ALU Packing, ALU Busy, ALUFetchRatio and ALUStalledByLDS are the only profiler parameters that change. They all increase from Kernel 1 to Kernel 2, the performance in Kernel 2 is about 20% better. There is less CF in Kernel 2.

What I can also say is that the performance increases with an increase in thread size from small (256) ~9% improvement to (1024) ~20%.

MicahVillmow · ‎03-30-2010

ryta,
Please send the code via streamdeveloper email alias and have them redirect it to me and I'll take a look.

Archives Discussions

OpenCL Profiler ALU Field Question