philips

Stream Profiler interpretation

Discussion created by philips on Aug 19, 2010
Latest reply on Aug 19, 2010 by Jawed
Can you see, where my problem is?

Hi.

The code I ported from CUDA is very slow on an ATI 5870. Slower than on a NVIDIA GTX280.

Unfortunately I'm not familar with ATI hardware, so I am not sure how to improve it.

I have now done a testrun with the Stream Profiler. I have attached two lines from the profiler output.

The Algorithm works in two passes. The first kernel launch is less complex than the second one. The two lines are both passes from a representative iteration of the algorithm.

I was hoping you might be able to explain, where the problem is.

 

Method  ExecutionOrder  GlobalWorkSize  GroupWorkSize  Time  LDSSize  DataTransferSize  GPRs  ScratchRegs  FCStacks  Wavefronts  ALUInsts  FetchInsts  WriteInsts  LDSFetchInsts  LDSWriteInsts  ALUBusy  ALUFetchRatio  ALUPacking  FetchSize  CacheHit  FetchUnitBusy  FetchUnitStalled  WriteUnitStalled  FastPath  CompletePath  PathUtilization  ALUStalledByLDS  LDSBankConflict
renderKernel_07F0BF28 9795 {  24832       2       1}  {   32     2     1} 1,886153072 52057761680,8938,32124,997,1516,3543,8638,811304,52,711,30,020363,7501002,440,53
renderKernel_07F0BF68 9800 { 393216       2       1}  {   32     2     1} 16,702523072 5905122881434,5339,37133,94824,8436,4441,6237614,191,072,980,1203084,2501003,671,3


 

 

Thank you for reading

 

 

Outcomes