cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

sgratton
Adept I

Assembler for HD69xx

A work-in-progress...

 

Hi there,

 

Over the past little while I've been developing an assembler for HD69XX cards.  Just recently I've noticed a post or two on the forum mentioning an interest in coding in assembler.  So, while my "asm69" is still a work-in-progress, I thought I'd mention it in case anybody would like to take a look.  Unfortunately I don't have much time to spend on this now so it may well remain a work-in-progress for some time to come!

 

It uses lex/yacc, and needs mingw for use on windows.  There isn't much documentation yet I'm afraid. 

 

Some points to note about the assembly are:

 

1/ No numbering of instructions is necessary

2/ One can use labels in looping constructs

3/ Some "vectorized" instructions are available, e.g. VECMULADD R3,R0,R1,R2.

 

One prepares two text files, one containing the assembly and the other containing "setup" info.  The assembler then takes these as input and constructs an elf file that one can load into one's own program using CAL.

 

I've been using SDK2.4 and catalyst 11.5 on windows.

 

I'm afraid that things aren't quite working perfectly yet, and of course one should only try these things at one's own risk! 

 

So far only pixel shaders are supported; I've found it too difficult to understand how the setup stuff works for compute shaders!  Older cards could in principle be supported too, but the latest, being VLIW4 not VLIW5, seemed the simplest ones to start with.

Please see the link on:

 

http://www.ast.cam.ac.uk/~stg20/amdstream/index.html

 

 

Any feedback much appreciated! 

 

Best wishes,

Steven.

 

 

0 Likes
6 Replies
corry
Adept III

I wish you were farther along...

That said, the one suggestion I might make is to make a binary rather than conforming to the easier CAL interface, make it so it fits into the OpenCL interface.  Shouldn't be too hard from what I understand, but AMD is really trying their hardest to kill the low level interfaces, and despite OpenCL relying on CAL, they are coding all of their tools to hook only the OpenCL calls, rather than the CAL calls which the OpenCL calls make, and report errors when trying to use CAL. 

The frustration level with this is through the roof...Although I'm fairly close to the max theoretical speed with my kernel now, which is significantly faster than the only released competitor (I'm curious to see Intel's latest when it comes out, I think it *might* be able to come close to a 6990, assumuing the x86 code I have for the same kernel I have now scales by GHz, and multiplys across cores...)

I want to the AMD's FSA succeed, but I also want them to drop the attitude and hubris on their compiler, and let us do our jobs/hobbies.

0 Likes

 

Hi there,

 

I think it would in principle be possible to generate an opencl binary too; from what I remember from the last time I looked at this, the .text section of the opencl ELF file is a cal ELF file.

 

It is a shame about CAL.  Indeed, I am concerned that I might not even be able to get the assembler polished before CAL is removed!  Oh well!

 

I should perhaps have posted a bit of example code to give people a flavour of what assembly might look like.  Seeing as it can support vector operations it isn't too much more difficult than IL.  Indeed, one of the main motivations for doing this was a feeling that the transformations the amd shader compiler was doing to my IL code were actually hurting not helping.  (Also, it didn't seem to want to emit burst write operations.)  One matrix multiplication code previously used about 92 registers; the assembler version uses 58.   This is very important of course in hiding latency.

 

Here are excerpts of the above-mentioned matrix-multiply code...

 

ALU KB0:0 KCADDR0:0 KC0_LOCK_1
FLOOR R1.x R0.x
FLOOR R1.y R0.y L

MULADD_IEEE R1.x PV.x ALU_SRC_LITERAL.x ALU_SRC_0_5
MULADD_IEEE R1.y PV.y ALU_SRC_LITERAL.y ALU_SRC_0_5 L
4.0f 8.0f

MOV R1.z ALU_SRC_0_5
MOV R1.w ALU_SRC_0_5 L

....

MEM RD_SCATTER R26 R40 BCNT:4
MEM RD_SCATTER R30 R41 BCNT:4
LOOP_START_DX10  ADDRLAB:2
LAB:1
ALU KB0:0 KCADDR0:0 KC0_LOCK_1
PRED_SETGE R0 KC0.0.x R1.z U UP L NOWRITE

TC VPM
SAMPLE R34 R1.zy00 T0 S0  0.0f,0.0f,0.0f
SAMPLE R35 R1.zy00 T0 S0  0.0f,1.0f,0.0f
....

VECMULADD R32 R41.w R56 R32
VECMULADD R33 R41.x R45 R33
VECMULADD R33 R41.y R49 R33
VECMULADD R33 R41.z R53 R33
VECMULADD R33 R41.w R57 R33
ALU
ADD R1.z R1.z ALU_SRC_LITERAL.x
ADD R1.w R1.w ALU_SRC_LITERAL.y L
1.0f 4.0f

LOOP_END ADDRLAB:1
LAB:2
ALU KB0:0 KCADDR0:0 KC0_LOCK_1
FLT_TO_INT R42.x R1.x
FLT_TO_INT R42.y R1.y L
....

MEM_EXPORT R30 R41  WRITE_IND BCNT:4 ES4 VPM MARK COMPMASK:F
EXPORT_DONE R0 PIX0
END NB

and here is the associated setup file:

USE_GLOB
USE_VWIN
USE_LOOPS
NUM_GPRS=58
STACK_SIZE=2
#USE_OUTPUTS: o0
USE_INPUTS: i0 i1
#USE_UAVS: uav0 uav2 uav7
USE_CBS: cb0[1]

 

Finally, I should also mention that I've now posted a version that seems to work better on linux too; see:

 

http://www.ast.cam.ac.uk/~stg20/amdstream/index.html

 

Best wishes,

Steven.

 

 

 

 

0 Likes

Interesting.  I had wondered if there was some aritecture specific thing it was doing that would actually be faster despite the number of registers.  I had been debating on starting a new topic about it, but no one seems to like us pesky low level programmers, so I figured I'd either hold off, and post later, or just try "fixing" the ISA code the IL compiler generated, and see if my "fixed" version was faster.  I suspect it would be, but it looks like I may have 2 seperate architectures to support, with the possibility of more being added all the time, so I'm left trying to decide balance of performance and maintainability...Me I say all performance, but those paying the bill they probably want some balance 🙂  I may still decide to make use of your assembler for the caymans though...

0 Likes

I probably should just get off these forums, but I was pretty happy to get some more information about the caymans...I don't know if you were following my IL curiosity thread or not, if not, check it out, http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=155639&enterthread=y  Some really really good information about the lower level architecture of the caymans.  Looks like getting groups of 4 instructions is meaningless.  Very very very interesting!

0 Likes

 

Hi there,

 

Thanks for the link; the SIMDed VLIW nature of these chips indeed makes them very interesting to think about! 

 

On one message you mentioned "fixing" the ISA code the compiler generates; unless one is literally changing one ALU instruction, this is actually very hard to do without an assembler like asm69.  You can't just add instructions or use extra registers for three reasons.  First, within the ISA itself there are hardcoded addresses and alignment constraints.  Second, the "metadata" about the program stored in the calprograminfo note has fields for things like the length of the program, the number of registers used, the stack depth and so on.  Third, it all has to be properly packaged in an ELF image.  So one has to be careful. 

 

Having said that, an il program and its shader-compiled ISA provide a good starting point for any attempt with asm69...

 

Best wishes,

Steven.

0 Likes

Originally posted by: sgratton  

Hi there,

 

Thanks for the link; the SIMDed VLIW nature of these chips indeed makes them very interesting to think about! 

 

On one message you mentioned "fixing" the ISA code the compiler generates; unless one is literally changing one ALU instruction, this is actually very hard to do without an assembler like asm69.  You can't just add instructions or use extra registers for three reasons.  First, within the ISA itself there are hardcoded addresses and alignment constraints.  Second, the "metadata" about the program stored in the calprograminfo note has fields for things like the length of the program, the number of registers used, the stack depth and so on.  Third, it all has to be properly packaged in an ELF image.  So one has to be careful. 

 

Having said that, an il program and its shader-compiled ISA provide a good starting point for any attempt with asm69...

 

Best wishes,

Steven.

Actually, the interesting part was that they *are not* simd, and haven't been for some time!  Very interesting indeed.  It can work as an SIMD, but the instruction decoder is more complex than that, its not even really MIMD, its more of a 4 instruction parallel dispatcher assuming no dependencies, but that's not even really true, in reality, I guess it sounds like in hardware its a 256 instruction parallel dispatcher...

Yes I spoke of "fixing" the ISA code, but thats really unnecessary because the system is not SIMD, so even though I may have ixor r1, r1, r2\n iadd r3, r3, r4\n, r1's x and y may be calculated using 2 instruction slots of one VLIW word, along with r3's z and w, then in the next vliw instruction word, it can do the rest without penalty, thus there is nothing to fix.  It's a pretty cool architecture...

I figured most of the hardware discussion there would help you with your assembler...knowing what's under the hood always helps 🙂

For now, I'll probably be sticking to IL...that may change, but I think for now it's pretty good knowing what Lee said about the hardware...

0 Likes