cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

byly
Journeyman III

anybody using assembly language (ISA, non-IL)?

I try to assemble code disassembled with calclDisassembleImage().

I used calclAssembleObject() which resulted in the error string: "Function used properly but currently not supported".

Based on comments in sample code (samples/cal/common/Samples.cpp) I tried to switch from linux to win (xp, 32 bit) Stream SDK just to get the same error.

Simple IL programs (both ps and cs) gave the same result.

So I'm obviously doing something wrong. Can you help please if you have ATI assembly language experience?

P.S. According to CAL Programming Guide (march 2010), section 2.2.2 it's well possible.

0 Likes
17 Replies

AFAIK, assembler supported only for 4XXX family and below. When 5XXX was released year ago ATI totally abandon assembler feature, so lowest level possible now is IL. I've asked that time is ATI planning to add assembler support for 5XXX but got answer from Micah that no such plans exists. Doubt it'll change with 6XXX.

0 Likes

Thank you very much for your responce, empty_knapsack, you saved my mental health

But don't you think It's ridiculous to have fantastic hardware and no chance to use it's full capability? <mode: rhetoric>

0 Likes

Can someone tell me what IL please?

0 Likes

http://en.wikipedia.org/wiki/Intermediate_language

0 Likes

The assembler only works correctly for 3XXX and below chips in pixel shader mode.
0 Likes

It would be great, Micah, to reflect this in docs.

0 Likes

byly,

Since CAL support is deprecated, don't expect the documentation to get updated but do expect it to disappear.

0 Likes

Hi,

But in CAL version 1.4.1457, there is no support for 3XXX as a CALtarget , since in cal.h the CALtargetEnum started from CAL_TARGET_600 (R600 GPU ISA)!! I was wondering whether we can find any available version of CAL to support calclAssembleObject for 3XXX? Please point to link for download.

Thanks

0 Likes

The 3XXX chips use the R600 GPU ISA.

0 Likes

Thank Micah. But calclAssembleObject still returns "Function used properly but currently not supported" even for CAL_TARGET_600 as the target ISA parameter!! in CAL version 1.4.1457. Particularly, I was wondering which version of CAL can support calclAssembleObject as the working function for R600?

0 Likes

Hi,

Try to compile it with calClCompile()! At least you can locate from where the error is originated.

Also try it with pixel shader, maybe it just doesn't like CS ->

MicahVillmow wrote:

The assembler only works correctly for 3XXX and below chips in pixel shader mode.

(PS instead of CS, and g[] global buffer instead of uav <- all of these runs perfectly on my R770, but maybe the R600 prefers the 'gameish' approach)

0 Likes

Thanks realhet. As you have mentioned and also CAL's manual claims the calClCompile must accept ISA as an input language; the manual states: "Only the ATI IL and the stream processor-specific Instruction Set Architecture

(ISA) are supported as the runtime programming interfaces by calclCompile.", page 29, rev2.01, March 2010.

But calclCompile returns an error when the assembly code (extracted from calclDisassembleImage) is used as the input language. The error says "Failed to compile program with IL front-end compiler!" which seems that calclCompile expects only IL as a single input language, as opposed to the fact that manual states! I have tried with all version of devices, and redirected the extracted assembly code produced by calclDisassembleImage, without any changes, but all faced with the same error.

It seems that ATI tool-chain is not happy for binary generation out of an ISA  code, neither with calclCompile nor with calclAssembleObject, unlike the claims mentioned in manuals. I would be grateful if someone could point to a "worked" example of a simple ISA to binary generation with "any" tool-chain of ATI for "any" target ATI's device.

Thanks!

0 Likes

Sorry for misunderstanding you, I was really thought you wanted to compile IL.

For the ISA compiler: I never knew that there was one. (Maybe because  I started lately on R770, and afaik there was only disasm for that)

I think that the CAL's IL compiler does an excellent job when it comes to optimizing (I can learn tricks from it ), but on the GNC architecture there are more possibilities: The S alu has computing power too, and it's inefficient to use it only for getting constants values from memory and give them to the V alu's.

For example: I have an inner loop which is using lots of precalculated values. Those values are invariant  for the whole loop and the 16 vector lanes in it.

What CalCl does?

It puts all those precalculated values into vector registers. And because of the 128+ used Vregs, only one wavefronts can sit in each SIMD engines, this leads to long wait times between consequential wavefronts (-30% perf).

What I wanna do ?

Put the precalculated values into Sregs (there are 105 of them), and decrease the number of Vregs below 128, so there will be always 2 waves in the SIMD engines.

Further possibilities:If running out of the 105 SRegs, then that is also possible to calculate these scalar values inside the loop with the S alu (only integer math). In simple integer math, the S alu works on 1/8 performance of the V alu, and there is mul32 at 1/16 performance (hope it's not the DP mul unit). And with a properly ordered instruction stream, the whole kernel's execution time will not decrease when using the S alu.

So I really hope that low_level will not be 'deprecated', as the hardware gets more and more complicated (and this way more effective) over time. And also it gets harder to make an automated compiler for it.

Another example (GCN isa): You can do 64byte reads with only 2 S instructions. It's 'super effective' for a bigint multiply algo, and while the V alu is busy with calculations, the S alu can do some other things.

0 Likes

"and there is mul32 at 1/16 performance (hope it's not the DP mul unit)"

Well it would make sense to be, particularly considering bottom of page 44 of this:

http://developer.amd.com/afds/assets/presentations/2620_final.pdf

24 BIT INT MUL/MULADD/LOGICAL/SPECIAL @ full SP rates

– Heavy use for Integer thread group address calculation

– 32-bit Integer MUL/MULADD @ DPFP Mul/FMA rate

Which at least means the higher end parts will be faster at it ...

IMHO GCN looks a lot easier to compile/assemble for vs the vliw so i'm not sure why you would consider it the opposite.  Actually that was another take-away from that particular talk - see page 19, and I get the impression it was one of the goals of GCN in the first place.

0 Likes

I did some testing, and found out that the scalar unit's mul32 is indeed runs at full SP rate.

v_mad_i32_i24 v0,v0,v1,v2

s_mul_i32 s0,s0,s1

v_mad_i32_i24 v0,v0,v1,v2

s_mul_i32 s0,s0,s1

v_mad_i32_i24 v0,v0,v1,v2

s_mul_i32 s0,s0,s1

v_mad_i32_i24 v0,v0,v1,v2

Interleaving 4 v_mad24 with 3 s_mul32 produces 2.33% more mul32 TOps/sec without slowing down the V alu's. (the S alu's mul32 performance was 88GOps/s on the 7970 (stock clock))

The 4th s_mul can't be used in this situation, because it gives too much work to the instruction decod/arbitratoin units.

I think the maximum capacity of the instr decoders is something like 2.5 dwords/clock. I guess that there are different amount of decoders working on different types of instructions (SOP1, SOP2, VOP1, VOP3, VOP3, etc), and those are shared between the exec units. In this example, there are too much 2dword instructions (v_mad has 64bit opcode) to reach the ideal 1:1 V:S ratio (it actually decreased the total performance by -6.3%).When those decoders are highly utilized, then it's important to have not more than 64Vregs, and that way 4 waves can share those decoders with less stalls. But that's just my theory, I'd love to know how it exactly works.

0 Likes

But due to particular research I need to tweak the code at ISA level. Is there any way to get binary out of ISA with current ATI tool-chain or any available previous version? BTW, CAL manual mentions "Only the ATI IL and the stream processor-specific Instruction Set Architecture (ISA) are supported as the runtime programming interfaces by calclCompile."

Thanks!

0 Likes

AMD does not provide any tools or support to do what you want. It is possible using tools like objdump and elfdump, but it is an unsupported path. In this case, the documentation is wrong. ISA as input is not supported.

0 Likes