Archives Discussions

erman_amd · ‎07-13-2011

Hi,

I have a kernel with ALU:Fetch 4.13 (it is highlighted green for Radeon HD 6450) in KernelAnalyzer. For one other GPU, it is highlighted red with ALU:Fetch value below 1.

What does it mean? Anyone can help me explain this?

The other one, I tried to compile my kernel to 5870 assembly. It looks like below

Where can I find information about the assembly code. I mean I want to know what is the mean of MEM_RAT_CACHELESS, VFETCH, TEX:, ALU:, etc.

Thanks

... 03 TEX: ADDR(178) CNT(1) 23 VFETCH R0.__x_, R2.w fc173 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 04 ALU: ADDR(153) CNT(11) 24 x: ASHR ___, R3.z, (0x00000017, 3.222986468e-44f).x ... 05 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R2].x___, R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

maximmoroz · ‎07-13-2011

All the answers are here: AMD Accelerated Parallel Processing OpenCL Programming Guide

erman_amd · ‎07-13-2011

Thank you, maximoroz

But I can not found why it is highlighted green and why it is highlated red, and what is the mean if it is red and when it is green. Only the definition of ALU fetch.

For the the Assembly language, I found a documentation on its format. I think if I can understand the assembly language, maybe I can understand why my kernel performance is so poor.

maximmoroz · ‎07-13-2011

My guess is that the kernel is highlighted green if amount of ALU operations are most probably enough to cover global memory access latency. And red if not enough to cover global memory access. For that particular GPU chip. But I wouldn't rely on this color coding.

himanshu_gautam · ‎07-13-2011

Hi ermen,

Generally its good to have a high value for ALU:Fetch ratio. A higher value(IMHO >10) ensures that the ALU dont waste time wating for the data to arrive.

You should be able to find a lot in Chapter 4 of OpenCL Programming guide to understand why your kernel is slow.And you are always free to share it here and someone might give you some usefult suggestions.

erman_amd · ‎07-13-2011

Hi,

I have very simple kernel,

__kernel void getValFromTable(__global float *output, __global *float input, __global uint *index, uint row, uint col)

{

uint x = get_global_id(0);

output = input[row * col + index];

}

Using kernel analyzer: GPR = 2, ALU=10, Fetch=2, Write=1, ALU:Fetch=1.25.

I look at the assembly code:

There are instructions as below:

VFETCH R0._x__, R0.z, fcl73 MEGA(4)

FETCH_TYPE(NO_INDEX_OFFSET)

...

MEM_RAT_CACHELESS_STORE_RAW ...

I read from Ch.4 OpenCL guide section Global Memory Optimization

vfetch means vertex fetch, load uses L1 cache, it also uses FastPath. In the profiler, the PathUtilization counter is 100, CacheHit 86.64.

The question:

How actually the memory read process?

output = input[row * col + index]

In this scenario, col is constant value (ex. 512) and, for example, row = 2, index[0] = 1, index[1] = 1, index[2] = 123, ... .

so

output[0] = input[1025]

output[1] = input[1025]

output[2] = input[1147]

...

My thought is output[0] and output[1] read the same location and it is conflict. Is it correct? But the path utilization counter shows 100% (optimal). I'm not sure if they related or not. CMIIW.

Anyone can help me to explain how the memory read/write process in this kernel?

Any comments and suggestion maybe to improve the kernel is appreciated.

maximmoroz · ‎07-14-2011

Again, read the programming guide. You wrongly assume what path utilization is.

erman_amd · ‎07-14-2011

Thank you, maximmoroz.

About this part,

output[0] = input[1025]

output[1] = input[1025]

Is the read process from the same address is serialized or something?

maximmoroz · ‎07-14-2011

As far as I know, yes, reading from the same address in global memory (not local one) means conflicts and serilizaed access.

But you have rather high Cache Hit, 86.64%, which might be the reason for these conflicts to be rare. What are global worksize and the size of buffer "input"?

erman_amd · ‎07-14-2011

The global work size is 65536. The size of input is also 65536.

Archives Discussions

ALU:Fetch in KernelAnalyzer