9 Replies Latest reply on Jul 14, 2011 6:29 AM by erman_amd

    ALU:Fetch in KernelAnalyzer

    erman_amd

      Hi, 

      I have a kernel with ALU:Fetch 4.13 (it is highlighted green for Radeon HD 6450) in KernelAnalyzer. For one other GPU, it is highlighted red with ALU:Fetch value below 1.  

      What does it mean? Anyone can help me explain this?

       

      The other one, I tried to compile my kernel to 5870 assembly. It looks like below

       

       

       

       

      Where can I find information about the assembly code. I mean I want to know what is the mean of MEM_RAT_CACHELESS, VFETCH, TEX:, ALU:, etc.

       

      Thanks

      ... 03 TEX: ADDR(178) CNT(1) 23 VFETCH R0.__x_, R2.w fc173 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 04 ALU: ADDR(153) CNT(11) 24 x: ASHR ___, R3.z, (0x00000017, 3.222986468e-44f).x ... 05 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R2].x___, R0, ARRAY_SIZE(4) MARK VPM END_OF_PROGRAM

          • ALU:Fetch in KernelAnalyzer
            erman_amd

            Thank you, maximoroz

            But I can not found why it is highlighted green and why it is highlated red, and what is the mean if it is red and when it is green. Only the definition of ALU fetch.

            For the the Assembly language, I found a documentation on its format. I think if I can understand the assembly language, maybe I can understand why my kernel performance is so poor.

             

              • ALU:Fetch in KernelAnalyzer
                maximmoroz

                My guess is that the kernel is highlighted green if amount of ALU operations are most probably enough to cover global memory access latency. And red if not enough to cover global memory access. For that particular GPU chip. But I wouldn't rely on this color coding.

                  • ALU:Fetch in KernelAnalyzer
                    himanshu.gautam

                    Hi ermen,

                    Generally its good to have a high value for ALU:Fetch ratio. A higher value(IMHO >10) ensures that the ALU dont waste time wating for the data to arrive.

                    You should be able to find a lot in Chapter 4 of OpenCL Programming guide to understand why your kernel is slow.And you are always free to share it here and someone might give you some usefult suggestions.

                      • ALU:Fetch in KernelAnalyzer
                        erman_amd

                        Hi,

                        I have very simple kernel,

                        __kernel void getValFromTable(__global float *output, __global *float input, __global uint *index, uint row, uint col)

                        {

                           uint x = get_global_id(0);

                           output[x] = input[row * col + index[x]];

                        }

                        Using kernel analyzer: GPR = 2, ALU=10, Fetch=2, Write=1, ALU:Fetch=1.25.

                        I look at the assembly code: 

                        There are instructions as below:

                        VFETCH R0._x__, R0.z, fcl73 MEGA(4)

                        FETCH_TYPE(NO_INDEX_OFFSET)

                        ...

                        MEM_RAT_CACHELESS_STORE_RAW ...

                        I read from Ch.4 OpenCL guide section Global Memory Optimization

                        vfetch means vertex fetch, load uses L1 cache, it also uses FastPath. In the profiler, the PathUtilization counter is 100, CacheHit 86.64.

                        The question:

                        How actually the memory read process? 

                        output[x] = input[row * col + index[x]]

                        In this scenario, col is constant value (ex. 512) and, for example, row = 2, index[0] = 1, index[1] = 1, index[2] = 123, ... . 

                        so  

                        output[0] = input[1025]

                        output[1] = input[1025]

                        output[2] = input[1147]

                        ...

                        My thought is output[0] and output[1] read the same location and it is conflict. Is it correct? But the path utilization counter shows 100% (optimal). I'm not sure if they related or not. CMIIW. 

                        Anyone can help me to explain how the memory read/write process in this kernel?

                        Any comments and suggestion maybe to improve the kernel is appreciated.