I'm a few years into this, but I haven't found any tutorials yet. :S
You can understand how your 'parameters' are passed to kernels from the ISA docs:
They're basically buffers. The host sends them via the pcie bus to the gpu ram, and later the gpu can acces them in 2 ways:
- with general memory reads using L2, L1 data caches. (VFETCH instructions on evergreen, t_buffer_load_* on GCN)
- or with constant reads
* on the Evergreen there is KCache thing. Different sections of it can be mapped to each ALU Clauses and the ALU instructions inside that clause can read from them (a small number of bits is enough to access data from the mapped sections).
* on GCN you can read constants with s_buffer_load_* scalar instructions (It has a separate cache) and later you can use the loaded scalar registers in vector arithmetics.
On the IL level those 2 kinds of memory are accessed with UAV-s and Constant buffers.
Also there is a command processor thing in GPUs which is kinda undocumented (I've found some infos in old R7xx documentation).
That handles HOST<->GPU memory transfers (your parameters of kernel binary code), kernel executuion.
- on Evergreen it is allocated per clauses. Also there are Clause temporary regs for short time register needs.
- on Gcn the regs are allocated for the whole kernel. There are separate allocation for scalar registers but in my experience it never was a bottleneck that I always allocate all 105 S registers. Only then V reg allocation is important for kernel efficiency.
All these things are explained in detail in the ISA manuals, maybe they're just scary to read for the first sight.
"I think that the ISA is preferable to the IL." I prefer it too Let me tell why:
IL is as far from ISA as OpenCL.
A few years ago IL had the advantage over OpenCL that it had some extra instructions (for example the multimedia ones) which you can't reach from OpenCL.
Also there was some new ISA instructions that you can't issued from IL. (A year later they implemented those in IL and OCL levels, but the hardware knew it already in machine code)
But nowadays I think OpenCL can issue almost all the extra things in IL. So imo there's no benefit using IL over OCL. And IL handles the Evergreen ISA really good.
But GCN is different. There are many good things in it that you can only reach from ISA at the moment:
- true jumps (like in x86) (no need to unroll EVERYTHING, you can reorganize code into subroutines to fit into the Instruction Cache.)
- int adder with carry in+out in a single cycle (On Evergreen you had to use another instruction to get or add the carry, this is really a boost for high precision math over the Evergreen ISA)
- IO which wastes less ALU cycles:
* you can start read/write 16bytes in from ram (that's not new) in 1 cycle
* you can start r/w 2*8byte from/to LDS
* you can have an LDS parameter in a vector instruction
* you can start reading 64bytes via constant cache with a single scalar instruction.
- you can use registers as an array which is indexed by a scalar register (1 cycle) (for example: you can make a little stack for subroutines)
thanks for the information. I was really interesting to know all that is possible from ISA level