Archives Discussions

Raistmer · ‎04-18-2014

I mean how kernel binary passed to GPU (on context creation, before kernel launch and so forth)?

Where it stored? (CPU uses usual system memory for code storage but has separare L1 instruction cache to speed up instruction fetching. How about GPU? Does kernel binary stored in GPU global memory or in special limited-size buffer? How instructions are fetched? Do they go through common cache with data or constant cache used or smth special instead?)

How big possible performance impact for code bloating? (I mean if one uses few different special kernels to do similar work instead of one more slow/complex but universal one - how increase in total number of kernels and increase in total kernels binary size will impact performance? )

There is info regarding data memory handling on GPU but not so much if any regarding code data handling. Worth to discuss?

realhet · ‎04-19-2014

Hi,

On GCN hardware it's quiet similar to x86+SSE except it has 2048 bit SSE.

The binary is stored in the GPU's global memory. The instruction cache is 32KB. Instructions are either 32bit or 64bit encoded. 32bit instructions can have an extra 32bit literal constant. The instructions are fetched by a scheduler which can handle approx (in my measurements) 12bytes of instruction stream per clock, it can handle and dispatch many different kind of instructions: vector, scalar, LDS, GDS, vector-memory, and special ones.

There are 3 low level caches: code(32k), scalar data(32k), vector data(16k). Code- and scalar cache are readonly and shared accross 4 CU.

"How big possible performance impact for code bloating?" -> The only important thing is that the inner loops of your program where 99% of processing is done have to fit into the 32KB code cache. Rather one big multifunctional kernel than many tiny kernels. Kernel launches has penalties, it can take fractions of a milliseconds. I think long running worker kernels are the best. No need for complicated dispatches and every wavefront has to be launched and initialized only once, so the more time you've got to utilize the ALU.

Memory handling/code data handling: GCN has a 64bit flat memory model. Optionally (and OpenCL does it mainly) it can use ResourceConstants with wich you can access a block of the flat memory virtually with range checking and easier 32bit addressing.

View solution in original post

realhet · ‎04-19-2014

Hi,

On GCN hardware it's quiet similar to x86+SSE except it has 2048 bit SSE.

The binary is stored in the GPU's global memory. The instruction cache is 32KB. Instructions are either 32bit or 64bit encoded. 32bit instructions can have an extra 32bit literal constant. The instructions are fetched by a scheduler which can handle approx (in my measurements) 12bytes of instruction stream per clock, it can handle and dispatch many different kind of instructions: vector, scalar, LDS, GDS, vector-memory, and special ones.

There are 3 low level caches: code(32k), scalar data(32k), vector data(16k). Code- and scalar cache are readonly and shared accross 4 CU.

"How big possible performance impact for code bloating?" -> The only important thing is that the inner loops of your program where 99% of processing is done have to fit into the 32KB code cache. Rather one big multifunctional kernel than many tiny kernels. Kernel launches has penalties, it can take fractions of a milliseconds. I think long running worker kernels are the best. No need for complicated dispatches and every wavefront has to be launched and initialized only once, so the more time you've got to utilize the ALU.

Memory handling/code data handling: GCN has a 64bit flat memory model. Optionally (and OpenCL does it mainly) it can use ResourceConstants with wich you can access a block of the flat memory virtually with range checking and easier 32bit addressing.

Raistmer · ‎04-24-2014

Thanks a lot for this overview.

Do you know some details about earlier architectures (I have no GCN in my disposal still) ?

realhet · ‎04-25-2014

Older VLIW is somewhat more complicated:

There are 5 exec units: X,Y,Z,W for somple operations and one transcendent T unit for the complicated ones like sin(). (on HD6xxx there are only 4 universal units: XYZW)

Program code is divided to parts like:

- ALU Instruction Word: 64bit only

- Very Large Instruction Word: 1..5 instruction telling the corresponding XYZWT unit what to do. Additionally 2x64 bit literal constant. So the lagrest VLIW instr can be 7*64bit long.

- ALU Clause: It's a VLIW stream up to 128*64bit.

- Program flow: This is the highest level: It references ALU clauses, and specifies LOOPS and IF/THEN/ELSE blocks. In addition to the ALU clauses there are various special things it can handle like terture fetches, memory exports, interpolated vertex data imports.

The binary starts with the program flow code (instruction size here is 128,256bits as I remember), and then followed by the ALU clauses.

Instruction cache is 48KB in contrast to GCN 32KB. This is because in GCN there are 32bit compressed instructions also, and long time average instruction size is around 48bits, not 64bits.)

There are dedicated constant cache: every ALU clause can use 2 specified pages from that cache. One page contains 256(?) constant registers or something like this.

So this is quiet unique hardware. I wouldn't dare to program it manually. That 3 memory port thing with the vec_123 thing scared me, lol. But as I know, everything in the Evergreen architecture is exposed to OpenCL very well, so no need to do it with asm. There were times in the past when some rare multimedia instructions wasn't accessible from higher level, so it required some binary patching. IMO VLIW assembly is extremely hard for humans (or at least for me ).

Archives Discussions

How GPU handles its code?