cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

philips
Journeyman III

Stream Profiler interpretation

Can you see, where my problem is?

Hi.

The code I ported from CUDA is very slow on an ATI 5870. Slower than on a NVIDIA GTX280.

Unfortunately I'm not familar with ATI hardware, so I am not sure how to improve it.

I have now done a testrun with the Stream Profiler. I have attached two lines from the profiler output.

The Algorithm works in two passes. The first kernel launch is less complex than the second one. The two lines are both passes from a representative iteration of the algorithm.

I was hoping you might be able to explain, where the problem is.

Method  ExecutionOrder  GlobalWorkSize  GroupWorkSize  Time  LDSSize  DataTransferSize  GPRs  ScratchRegs  FCStacks  Wavefronts  ALUInsts  FetchInsts  WriteInsts  LDSFetchInsts  LDSWriteInsts  ALUBusy  ALUFetchRatio  ALUPacking  FetchSize  CacheHit  FetchUnitBusy  FetchUnitStalled  WriteUnitStalled  FastPath  CompletePath  PathUtilization  ALUStalledByLDS  LDSBankConflict
renderKernel_07F0BF28 9795 {  24832       2       1}  {   32     2     1} 1,886153072 52057761680,8938,32124,997,1516,3543,8638,811304,52,711,30,020363,7501002,440,53
renderKernel_07F0BF68 9800 { 393216       2       1}  {   32     2     1} 16,702523072 5905122881434,5339,37133,94824,8436,4441,6237614,191,072,980,1203084,2501003,671,3


 

 

Thank you for reading

 

 

0 Likes
8 Replies
n0thing
Journeyman III

Is your code vectorized?

Your ALU packing efficiency is low which gives the hint that vectorizing your code will improve your performance if its not already.

 

0 Likes

In the first kernel 16% for ALUBusy implies to me that you have lots of IF statements and/or loops where each work item follows a different path from its neighbours.

The GPRs count of 52 means 4 hardware threads can be supported per SIMD core.

Together these two things imply to me that the ALUs are mostly idle because the GPU spends most time working out which control flow path to take.

In ATI control flow incurs additional latency - it takes 40 cycles of extra latency for each branch point (so if-then-else has 3 branch points and a basic for loop has 2 branch points). This latency cannot be hidden when there's only 4 hardware threads on the SIMD. This is because the maximum control flow latency that 4 hardware threads can hide is only 32 cycles. Also, if your kernel has lots of branching it is likely that there are only, say, 5 cycles of latency hiding per hardware thread between branch points.

0 Likes

Where is all this documented? Is there a writeup on how to interpret the Stream KernelAnalyser results? Or do I have to dig through hardware specs to learn that 52 GPRs mean 4 hardware threads  per SIMD core (are there 128 registers per core?). Where can I read up on the whole latency hiding bussiness? 

0 Likes

Chapter 4 of the Stream SDK OpenCL Programming Guide rev 1.05, the most recently published version.

0 Likes

Thank you.

I don't quite understand all that yet, but it helps.

 

Unfortunately the code is not vectorized and I have neither the time or the skill to do so. How much speed can you gain by vectorizing on the GPU?

Since it's a raycaster, I think it would be rather complicated to vectorize and you would have to handle a lot of those SIMD things manually (e.g. when one ray is finished and the other three are not)

 

 

Is there anything specific to the ATI architecture that would make it less suited for this kernel (just looking at the profiler info)? I mean besides 64 wide SIMD and the vector thing

 

 

0 Likes

Since a SIMD core is, in itself, a vectorised processing unit, you are already dealing with the issues of control flow incoherence.

There are loads of projects, published papers and discussions of GPU-based ray-tracing based techniques out there.

Since you don't have time, I suggest you abandon the ATI implementation and just focus on the NVidia version.

0 Likes

Originally posted by: Jawed Since a SIMD core is, in itself, a vectorised processing unit, you are already dealing with the issues of control flow incoherence.


 

Sure, it's already SIMD, but if I wanted to vectorize it, I would have to do all this SIMD stuff manually, wouldn't I? (e.g. manually do branching if one of four components doesn't follow the same path)

 

You are probably right about the focusing on something else. The goal was to assess how well the algorithm is suited to different architectures (nvidia, ati, cpu). It's unfortunate that I can't really say that much about the ati and cpu performance without vectorization.

 

Maybe those two profiler lines above are enough to get a picture of how well the algorithm works on an Ati GPU. I would imagine vectorization would not really help with the latency hiding and the ALUs being mostly idle.

 

 

 

 

 

0 Likes

Agreed, it's unlikely vectorisation of the existing kernel would help.

The problem is too much control flow.

Perhaps you might try with only 16 work items per work group. For your amusement you might also want to try 1 and 4 work items per work group.

0 Likes