Archives Discussions

hisense1 · ‎01-17-2011

Which way for best performance ?

So here is my story: I wrote my code totally optimised using IL the result is, my code is about 50% more slow than huge and 4 times more big cal++ IL output in real ASM system my code will get 30% more performance but as I can see IL is a only trick the final ISA is generated from my code by first optimising it so from experiments I see the point is to not write optimised IL assembly but very messy and huge code which give a better way for assembly it into fully utilized xywzt ISA.

And now question - so how I can get real control on hardware and is any way to optimise my code as I can see generated ISA assembly in many stories doing all in very lazy way by picking up too much "easy" instructions on the start and later leaving only "hard" instructions on the end which with integers consume one clock cycle without utilizing fully x y z w t I think this is some optimisation way copied from GCC cuz ISA assembler the most like and goodly use GCC optimised code from cal++ output.

I know is easy for control how xywzt is utilized with short code but I speak about code which have 800-1200 instructions. Just now for me this is big unknown and more lucky than really coding and optimising when ISA assembler cannot be controled. Also is _prec and _precmask can be used to control how code is optimised on ISA assembler cuz this is documented in verider way.

hazeman · ‎01-17-2011

I think short answer is there is no way to directly control generated ISA.

I can give you only few advices and clear few misunderstandings.

1. On CPU code written in assembler directly converts to binary code executed on CPU ( second generation programming language ). This isn't the case with IL and ISA. For GPU IL is high level programming language. There is no direct translation between ISA and IL.

2. Because IL is high level programming language optimising IL register usage doesn't make sense. IL compiler will pack used (!) registers to hardware ISA registers. So you can create code with 100 IL registers using x component and code with 25 registers using xyzw components and both will use exactly the same number of ISA registers.

3. CAL++ overhead are usually extra mov instruction. IL compiler ( remember it's high level language ) have really no problem removing those. Extra registers used also are of no importance ( point 2 ).

4. CAL++ doesn't use gcc for optimization. The code written in CAL++ is directly emited to IL.

5. I haven't seen kernel which can't be as efficiently written in CAL++. As CAL++ kernel is much more easier to write it really doesn't make sense to use IL directly ( huge waste of time ).

6. Some IL instructions ( for example ddiv ) are converted to many ISA instructions ( ddiv on 4xxx cards give >40 ops ).

7. IL compiler is sometimes really bad/stupid. There are situation where he makes optimisations giving huge increase in ISA registers usage.

8. Usually you can trick IL compiler to generate more efficient code by changing order of instructions.

9. There is standard trick to increase instruction slot usage. Do more work in one software thread. When you do work on 5 elements at the same time you have guaranteed full slot usage. ( Usually it's enough to work on 2-4 ).

I hope it will help you .

Archives Discussions

Totally lost.