2 Replies Latest reply on Mar 21, 2013 12:10 AM by himanshu.gautam

    Understanding ISA and IL Dumps

    aj_guillon

      Hi everyone,

       

      I'm trying to understand the ISA and IL dumps, in order to understand how the compiler has generated my programs.  Unfortunately, I haven't been able to find good tutorials or information.  I understand that the IL and ISA have documentation available, and I have looked at them.  What is missing, however, is an understanding of the way that data is passed into the kernels for the initial call.  If this were a CPU, I would expect function arguments to be on the stack.  I don't have a good way of connecting the arguments to the kernel with the setup code that appears to be required for a kernel to run.  This makes it very hard for me to trace my program.

       

      Are there any good tutorials on the ISA and IL to understand what is going on?  For the purposes of understanding register allocation, I think that the ISA is preferable to the IL.

        • Re: Understanding ISA and IL Dumps
          realhet

          Hi,

           

          I'm a few years into this, but I haven't found any tutorials yet. :S

           

          You can understand how your 'parameters' are passed to kernels from the ISA docs:

          They're basically buffers. The host sends them via the pcie bus to the gpu ram, and later the gpu can acces them in 2 ways:

          - with general memory reads using L2, L1 data caches. (VFETCH instructions on evergreen, t_buffer_load_* on GCN)

          - or with constant reads 

             * on the Evergreen there is KCache thing. Different sections of it can be mapped to each ALU Clauses and the ALU instructions inside that clause can read from them (a small number of bits is enough to access data from the mapped sections).

             * on GCN you can read constants with s_buffer_load_* scalar instructions (It has a separate cache) and later you can use the loaded scalar registers in vector arithmetics.

           

          On the IL level those 2 kinds of memory are accessed with UAV-s and Constant buffers.

           

          Also there is a command processor thing in GPUs which is kinda undocumented (I've found some infos in old R7xx documentation).

          That handles HOST<->GPU memory transfers (your parameters of kernel binary code), kernel executuion.

           

          Register allocation:

          - on Evergreen it is allocated per clauses. Also there are Clause temporary regs for short time register needs.

          - on Gcn the regs are allocated for the whole kernel. There are separate allocation for scalar registers but in my experience it never was a bottleneck that I always allocate all 105 S registers. Only then V reg allocation is important for kernel efficiency.

           

          All these things are explained in detail in the ISA manuals, maybe they're just scary to read for the first sight.

           

          "I think that the ISA is preferable to the IL." I prefer it too Let me tell why:

          IL is as far from ISA as OpenCL.

          A few years ago IL had the advantage over OpenCL that it had some extra instructions (for example the multimedia ones) which you can't reach from OpenCL.

          Also there was some new ISA instructions that you can't issued from IL. (A year later they implemented those in IL and OCL levels, but the hardware knew it already in machine code)

          But nowadays I think OpenCL can issue almost all the extra things in IL. So imo there's no benefit using IL over OCL. And IL handles the Evergreen ISA really good.

          But GCN is different. There are many good things in it that you can only reach from ISA at the moment:

          - true jumps (like in x86) (no need to unroll EVERYTHING, you can reorganize code into subroutines to fit into the Instruction Cache.)

          - int adder with carry in+out in a single cycle (On Evergreen you had to use another instruction to get or add the carry, this is really a boost for high precision math over the Evergreen ISA)

          - IO which wastes less ALU cycles:

             * you can start read/write 16bytes in from ram (that's not new) in 1 cycle

              * you can start r/w 2*8byte from/to LDS

              * you can have an LDS parameter in a vector instruction

              * you can start reading 64bytes via constant cache with a single scalar instruction.

          - you can use registers as an array which is indexed by a scalar register (1 cycle) (for example: you can make a little stack for subroutines)