Archives Discussions

foomanchoo · ‎01-06-2013

when is the gcn assembler coming?

opencl is driving me nuts again.

it thinks that precalculating 20 offsets into LDS in the outer loop is more important than staying

below 128 registers.

and i didnt even enable optimization.

please can i at least disable all "optimizations"?

cguenther · ‎01-06-2013

For the fakt that the precalculating should be more important, i think you should check your kernel with the AMD codeCL profiler and see the diagrams, which resources limits the number of active threads.

I promote your point the the cl compiler does various uninfluenceable optimizations, which lead to undetermined compiler output. So i would really suggest some methods like register reusage, when the are encapsulated inside { } for an example.

I don't know why, but the encapsulation with these braces, gives a very undetermined usage of the SGPRS and VGPRS. So i can't find a way to define the specific usage of them with standard OpenCL without going to assambler.

heman · ‎01-06-2013

Hi foomanchoo,

It will be nice if you can attach a small example, for which this unsuitable optimization happens. Please also furnish details about the driver, APP SDK, CPU & GPU you are using.

Anyways, have you tried using "-cl-opt-disable" in the build options for kernel. This might help in disabling the optimizations.

jacksonfurrier · ‎03-05-2013

Yeah I agree. It would be really nice to have a PTX like programming feature in OpenCL with AMD cards
Is there any chance that AMD will do it?

himanshu_gautam · ‎03-05-2013

Hi,

Can you please give some key points to support your request? I can forward them to relevant people, but i myself have no idea about PTX.

realhet · ‎03-06-2013

Hi,

This is another thread that is about inline assembly in OpenCL.

I'd also like to have this feature. So I only had to write a good inner loop in asm and the other things can be in well maintainable OpenCL language.

And the inlined asm wouldn't be AMD_IL, but rather VLIW or GCN asm, because only there you have full control over register usage and over pretty much everything (On GCN: s_registers, or calling subroutines in order to stay inside the Instruction Cache).

But thinking of it, how hard would it be to implement inline asm that goes through opencl -> llvmir -> amd_il and finally reaches the desired low level. Al levels are doing optimizations that should make optimization-decisions based on the register usage of the inline asm sections...

jacksonfurrier · ‎03-06-2013

Yes, I'd vote on the GCN ASM
I know that, it would be really hard to implement but I think the guys at AMD could solve it.
In my university they did the same with the old IBM Cell CPU, they wrote the "average" things in C and the other "algorithm" parts in Cell ASM and now they have lots of prime records in the pocket.

realhet · ‎03-08-2013

Of course "the guys at AMD could solve it", but does that effort worth it for AMD? I doubt so...

BTW: I have a weird plan: Not inline asm, but inline high level sections inside the gcn asm.

With my script lang I can already 'unroll' and optimize (constant calculation elimination with handling commutativity) arithmetic operations. But have to make it to produce gcn V code, and that's not that straightforward, like SSE.

I'm afraid of things like MAD with clamp/negate double/quadruple/halve modifiers.

In my last project I used NASM-like macros to improve gcn asm. And soon as my asm code reached 2-300 lines I realized that I can't handle registers manually (I've used aliases mapped to physical regs, but there's much chance to make a mistake and reuse an already used reg). So I've made enter/leave blocks with temporary register tracking and allocation inside the block.

From there this high level function thing can be the next step but it's kinda complicated.

function add(a,b:int):int;begin result:=a+b; end;

This can be easily translated to v_add_i32 result, vcc, a, b

But what if:

- b is scalar or constant -> Compiler have to exchange operands (and know if it can exchange or not)

- both a, b are scalar or constant -> Compiler have to insert a v_mov_b32 to provide VOP2's operand requirements.

And there are so many things like this, I'm not even dare think of.

foomanchoo · ‎03-09-2013

wait - hetpas produces elf files - they should be executable on linux.

well that is all i need.

is there a chance to get the source code for the assembler so that i can port it to linux? (excluding the

pascal part and the IDE)

realhet · ‎03-09-2013

A year ago I used it to develop cal.elf kernels on Win, and then executed it on Linux. Probably it will work for ocl.elf too.

I don't want to put up the whole thing and the assembler is integrated with the script engine so badly. So I attach the relevant parts of the assembler only, I guess you can still dig up something useful from it.

Note that there are lot of stuff missing: for example int64, float64, and images.

jacksonfurrier · ‎03-14-2013

If it is worth it for Nvidia, it would worth it for AMD also I think. Inline-ASM would be much better then high-level code for ASM.

sayantandatta · ‎05-15-2013

Hello realhet,

I'm trying to manually modify the GCN binaries for corresponding attached isa. The instructions in the isa file is only 4168 bytes but the net code length in is 60668 bytes. I have checked the 2nd .text section of the inner elf(1st .text section is IL) is indeed 60668 bytes. I can identify the instructions in the binary using their bytecode but what is the rest (60688-4168) bytes accounts for ? I'm guessing the rest is reserved for kernel setup and memory addresses etc. So, I might be able to manually replace or add new instructions in proper place(assuming I have done the necessary editing of elf header), but how do I manually modify the rest of the section. Is there anything that explains what exactly is happening in the remainder of bytes ?

Regards,

Sayantan

realhet · ‎05-15-2013

Hi,

It's a broken disasm, that you've attached.

It seems like a bug. The disassembler seems to cut the textfile at 64KB... (It's just a rough guess of mine)

Here's another thread about this -> http://devgurus.amd.com/thread/159462

Disassembly in .isa intermediate file from clBuildProgram cut short

Unfortunatelly there was no solution in that thread.

Every GCN program must end with s_endpgm, but your example disasm is seemed like to cut after 65536 bytes.

Let's hope Himansu can send this to the right people

But there is a bigger newborn problem -> -save-temp is broken in 13.4 with binary kernels. Than means you can patch whatever you want in your .elf, you'll not get the feedback of it from the world's most accurate disassembler.

And there's another thing that if you wan't to rollback to an earlyer driver version, then the openlc part will be broken. So I ended up at 13.4 and not using the disassembler at all

sayantandatta · ‎05-15-2013

Oh,thanks for clarifying, I thought it was due to some other codes generated by the compiler. I'll try codeXL for disassembly.

Regards,

Sayantan

kd2 · ‎05-15-2013

I believe the elf format has changed slightly over the course of issuing new Catalyst drivers (which is probably why AMD doesn't release a asm compiler for the instruction code, reserving the right to update as needed... even though hacking together my own compiler to bypass clCreateProgramWithSource() didn't take long). I'll call the inner elf file the ATI CAL elf. The current drivers produce three programs within the ATI CAL elf. See Section B2 of http://developer.amd.com/wordpress/media/2012/10/AMD_CAL_Programming_Guide_v2.0.pdf. The first program is just the 20-byte CALEncodingDictionaryEntry. The second program is Note segment (see section B.3.1). It's a bunch of CALNoteHeader followed by 32-bit ints. Some of those ints are what you find at the bottom of your *.isa file. The third program is the compiled asm code as you see at the far right of your *isa file. It's actually fairly compact. As for the outer elf, you just need the basic sections .shstrtab (strings of the section headers), .strtab (required strings of the symbols.. compile ptiosn, kernel name with "metadata", "kernel" and "header"), .symtab, .rodata, (required.. can't omit), and .text (the ATI CAL elf). I delete the .llvmir and .comment sections.

realhet · ‎05-15-2013

They have shortened the internal cal.elf file by taking out the binary CAL program from it.

The new cal elf contains only 6 sections not 10.

A frined also told that with the new kernel he was failed to run old precompiled binary kernels. Maybe the new driver fails to load a cal.elf with 10 sections.

One more difference that in the ISA Note Section they replaced the constbuffer allocation sizes with constant 0xBAADF00D constants, haha.

I think that's all changes.

"even though hacking together my own compiler to bypass clCreateProgramWithSource()"

Did you checked inside clCreateProgramWithSource() and learned from it, how to upload a binary?

It seems fun, but it could change from version to version for sure... At least the elf is more static.

The most elegant way would be if the driver officially let us to upload whatever binary we want. It doesn't have to be compatible with the OpenCL standard at all, the kernel parameter passing doesn't have to be documented as we can learn that knowledge from disassembled opencl programs. Just name a dll function for this purpose.

But maybe they're affraid of that everyone would program the card in binary and it could kill OpenCL or something? That's nonsense ;-D

kd2 · ‎05-15-2013

I haven't taken a look at your work yet, but I'm sure I don't do much different than you. I still feed the elf into clCreateProgramWithBinary(). I'm just interested in writing the kernel's code and not about to get into everything else opencl does (the command processing -- threads scheduling, memory buffers, queues, etc). That said, I never explored what, if anything, clCreateProgramWithBinary() does with my elf. I'm assuming not much since each minor change I do to the asm code produces the timing results that I expect from those changes (even inserting obviously wasteful register computations).

My take on their motivation is that it always comes back to backward compatibility. Backward compatibility of x86 assembly has been a drag on the x86 industry's progress over the years and the gpu manufacturers would rather not have the same constraint. But there is a real need to allow developers to tweak the kernel's assembly.

As far as "upload a binary", I just have a few kernels that I execute which are fairly similar in structure (roughly same order of v and s registers used, etc). It's not 100% complete yet, but I put them as subroutines in my asm code (subroutine is just done by using a register to store the program counter's return, branching to the beginning of that routine and at the end, branching back to the location in that stored register). The main loop in my asm code sleeps (s_sleep) and reads a byte update as to which subroutine to execute or to exit (which I feed in from a different memory buffer on a different queue -- be sure to set glc/slc in the buffer read as appropriate to bypass the cache).

realhet · ‎05-17-2013

"The main loop in my asm code sleeps (s_sleep) and reads a byte update as to which subroutine to execute or to exit"

Long kernels (that utilize the hardware to the max) with real-time interactions.

I'm also thinking about these kind of non-standard gpu programming: In my project, i'm gonna need a global synchronization point between ALL the threads 400000 times in every second. s_sleep + gds atomics will be the key to this. Not sure if it will work, but I'll try.

There will be problematic cases when the opencl runtime decides that it will schedule the kernel not all then CUes. The kernel have to start with a function that detects all the active CU-es and distribute the work cleverly.

While using many s_sleeps, did you noticed reduced power consumption?