Archives Discussions

philips · ‎08-19-2010

Can you see, where my problem is?

Hi.

The code I ported from CUDA is very slow on an ATI 5870. Slower than on a NVIDIA GTX280.

Unfortunately I'm not familar with ATI hardware, so I am not sure how to improve it.

I have now done a testrun with the Stream Profiler. I have attached two lines from the profiler output.

The Algorithm works in two passes. The first kernel launch is less complex than the second one. The two lines are both passes from a representative iteration of the algorithm.

I was hoping you might be able to explain, where the problem is.

Method	ExecutionOrder	GlobalWorkSize	GroupWorkSize	Time	LDSSize	DataTransferSize	GPRs	ScratchRegs	FCStacks	Wavefronts	ALUInsts	FetchInsts	WriteInsts	LDSFetchInsts	LDSWriteInsts	ALUBusy	ALUFetchRatio	ALUPacking	FetchSize	CacheHit	FetchUnitBusy	FetchUnitStalled	WriteUnitStalled	FastPath	CompletePath	PathUtilization	ALUStalledByLDS	LDSBankConflict
renderKernel_07F0BF28	9795	{ 24832 2 1}	{ 32 2 1}	1,88615	3072		52	0	5	776	1680,89	38,32	1	24,99	7,15	16,35	43,86	38,81	1304,5	2,71	1,3	0,02	0	363,75	0	100	2,44	0,53
renderKernel_07F0BF68	9800	{ 393216 2 1}	{ 32 2 1}	16,70252	3072		59	0	5	12288	1434,53	39,37	1	33,94	8	24,84	36,44	41,62	37614,19	1,07	2,98	0,12	0	3084,25	0	100	3,67	1,3

Thank you for reading

n0thing · ‎08-19-2010

Is your code vectorized?

Your ALU packing efficiency is low which gives the hint that vectorizing your code will improve your performance if its not already.

Jawed · ‎08-19-2010

In the first kernel 16% for ALUBusy implies to me that you have lots of IF statements and/or loops where each work item follows a different path from its neighbours.

The GPRs count of 52 means 4 hardware threads can be supported per SIMD core.

Together these two things imply to me that the ALUs are mostly idle because the GPU spends most time working out which control flow path to take.

In ATI control flow incurs additional latency - it takes 40 cycles of extra latency for each branch point (so if-then-else has 3 branch points and a basic for loop has 2 branch points). This latency cannot be hidden when there's only 4 hardware threads on the SIMD. This is because the maximum control flow latency that 4 hardware threads can hide is only 32 cycles. Also, if your kernel has lots of branching it is likely that there are only, say, 5 cycles of latency hiding per hardware thread between branch points.

dschwen · ‎08-19-2010

Where is all this documented? Is there a writeup on how to interpret the Stream KernelAnalyser results? Or do I have to dig through hardware specs to learn that 52 GPRs mean 4 hardware threads per SIMD core (are there 128 registers per core?). Where can I read up on the whole latency hiding bussiness?

Jawed · ‎08-19-2010

Chapter 4 of the Stream SDK OpenCL Programming Guide rev 1.05, the most recently published version.

philips · ‎08-19-2010

Thank you.

I don't quite understand all that yet, but it helps.

Unfortunately the code is not vectorized and I have neither the time or the skill to do so. How much speed can you gain by vectorizing on the GPU?

Since it's a raycaster, I think it would be rather complicated to vectorize and you would have to handle a lot of those SIMD things manually (e.g. when one ray is finished and the other three are not)

Is there anything specific to the ATI architecture that would make it less suited for this kernel (just looking at the profiler info)? I mean besides 64 wide SIMD and the vector thing

Jawed · ‎08-19-2010

Since a SIMD core is, in itself, a vectorised processing unit, you are already dealing with the issues of control flow incoherence.

There are loads of projects, published papers and discussions of GPU-based ray-tracing based techniques out there.

Since you don't have time, I suggest you abandon the ATI implementation and just focus on the NVidia version.

philips · ‎08-19-2010

Originally posted by: Jawed Since a SIMD core is, in itself, a vectorised processing unit, you are already dealing with the issues of control flow incoherence.

Sure, it's already SIMD, but if I wanted to vectorize it, I would have to do all this SIMD stuff manually, wouldn't I? (e.g. manually do branching if one of four components doesn't follow the same path)

You are probably right about the focusing on something else. The goal was to assess how well the algorithm is suited to different architectures (nvidia, ati, cpu). It's unfortunate that I can't really say that much about the ati and cpu performance without vectorization.

Maybe those two profiler lines above are enough to get a picture of how well the algorithm works on an Ati GPU. I would imagine vectorization would not really help with the latency hiding and the ALUs being mostly idle.

Jawed · ‎08-19-2010

Agreed, it's unlikely vectorisation of the existing kernel would help.

The problem is too much control flow.

Perhaps you might try with only 16 work items per work group. For your amusement you might also want to try 1 and 4 work items per work group.

Archives Discussions

Stream Profiler interpretation