I'm proudly presenting the first preview release of my GCN ISA Assembler / AMD_IL errorchecker / scripter IDE featuring syntax_highlighting and code_insight (ctrl+space) for fast assembly development.
Here you can download it and get more info -> http://realhet.wordpress.com/2012/11/14/hello-world/
Note that this is a spare time project, full of bugs, don't plan to do anything serious with it, and use it on your own risk only!
Hardware I was able to test it on already: HD4850, HD5770, HD6970:CAL + AMD_IL; HD7970:CAL,OpenCL + GCN_ISA, I can only hope it works on other devices too.
that is really cool! I'll definitely play around with that over the weekend!
This can finally be a response to the missing mul24_hi in OpenCL. Do you already have some examples of how much faster some problem runs in an ISA implementation compared to OpenCL (e.g. your Mandelbrot)?
I think I could make good use of the HW carry and interleaved s_mul_u32.
What would be the best approach to move from an OpenCL implementation to ISA? Take the compiled result of my kernels and start optimizing? Or is it possible to mix OpenCL code with just a few ISA-ASM functions?
I really hope that it will work on your system too, not just here
Also you can inline codes that my assembler doesn't knows with DD (define dword) instruction.
>Examples on ISA vs. OpenCL
No, I haven't. But I can try to port that mandel demo to OpenCL.
I think there is a way to force OpenCL to compile to that 4 instruction main loop:
v_mul_f32 v0, X, X
v_mad_f32 Y2, X, Y, halfCy mul:2 ; yn=2xy+Cy
v_mad_f32 X, Y, Y, -v0
v_add_f32 X, -X, Cx ; xn=x^2-y^2+Cx
The only extra things here are the mul:2 output modifier and the neg() input modifiers. But those are exists on any hardware, so I think Ocl use it. (Unfortunately I can't try it now, I gotta wait for a few weeks until I have hd7xxx access again)
But I'm kinda lame in OpenCL, It would be no representative test if I try to optimize it haha.
Actually I've found out that you don't have to strictly interleave one by one. For example the instruction scheduler can handle situations like 8 v instructions followed by 4 s instructions. The key is the long term ratio between V and S instruction dwords (2dword instructions eats more). The more S you use, the more threads it will need and then you have to lower register usage down to 84 or even 64 to get full V alu utilization.
>OpenCL -> ISA
Well I really don't know... I've came from OpenGL's fragment shader a long time ago, then I've found early OpenCL on HD4850, it was kinda unoptimal in those beta times. I saw what AMD_IL code it makes and I decided to write the AMD_IL code myself. Then finally GCN came out a year ago, and I decided to fall one more level deeper
>OpenCL with few ISA.
That's quiet impossible. You know, even we could mark a special part in OCL source and then patch it later with ISA, then how would we know what register is what variable and stuff. Not mention that opencl unrolls a lot -> duplicating repetitive code -> and that code will be optimized globally by the AMD_IL compiler.
Although thinking in GCN ISA is fun IMO: For example whenever you do an IF, you have to think in 64bit SRegs and bitwise operations. I'ts like an x86 with 2048bit SSE that comes with very flexible memory read/write instructions
I thought you were probably working on something like this but I was really surprised to see the integrated development environment, which looks very nice.
I've been developing a somewhat different set of tools for working with GCN. It uses an ANCII C compiler ported to GCN and separate full assembler so that the entire GCN/hardware environment can be exposed at the C level. Something OpenCL is unlikely to ever to do. Users work from an ordinary source file with sections for Opencl C, ANSII C, GCN, and AMDIL. Later, I'll try to post some examples somewhere.
I'm not sure I'm as brave as you to open it for public use, but I maybe I'll try. Like you said, there will always be a lot of bugs first time out, but I think most people around here understand that. Congrats again and good luck.
BTW I tried to compile one of your examples but it was unable to open the a temporary file (precompile_out or something like that) in the C root dir. Do I need to do anything special? Could it be a windows permission/path thing?
Thx for positive vibes!
Sorry, I forgot to say that it writes some temp files in the C:\ dir (I have UAC disabled cause I still use XP mainly lol). Also if you feed it an OpenCL source (with NewKernel()) it will redirect it's temp files into the C:\ .
It's really cool that you also made a solution that you can use all the languages (host and gpu) from a single file I'm looking forward to see your examples.
And the reason I've published this is that I've reached to the end of a job, and I have no more fear that someone will beat me at GCN asm on that particular field. My actual project is my own realtime video decoding/processing/VJing and stuff, so everyone feel free to beat me in that, I don't care and up for the challenge.
I've tested HetPas a bit. On my machine with an HD5770 it works well (of course except the GCN part 🙂 ).
On my box with an HD7850, it crashes during startup. As hetpas is stripped code, I could not find out anything meaningful about why it dies (OK, it dies in a push, so obviously the stack pointer is bad, but I could not see when that happened).
I'm afraid it can be because I have to access this machine with remote desktop. Maybe I'll try teamviewer or something ... It'll be a while until I can get to the console. Normally, OpenCL applications have no problem with remote desktop on AMD (unlike nvidia).
If you could create a version of hetpas that is not optimized & stripped that much, then maybe I could find out what's going wrong. Or if you still happen to have the debug symbols of the build on your website ...
Hi and thx for checking it!
As you requested, I've tried to attach debug infos, and a stackTracer, but oddly Delphi's linker threw an internal error. Maybe I'm using generics too excessively or something. (I never included debug info before because I usually tested it inside the IDE and that needs only dcu files). The problematic part can be somewhere in 50k lines
So the best I can do is to put a detailed function/line map file near the exe file so at least I can investigate the problems location from the exception's address. (Please redownload the zip from the website before you try it again)
Remote desktop: I've tried it with VNC Viewer only. But thats really weird why it throws an error with the most official remote assistance software. Note that, when the IDE starts it does nothing GPU related at all, only static linking cal's and opencl's dlls.
HD7850: That's a big question mark for me, because the only hardware I was able to test was Tahiti, so there's a chance that my current attempt to inject ISA into OpenCL's elf will fail on the smaller GCN chips. In a few weeks I'll have access to a HD7750, and I'll check. Hopefully the only difference will be the chip target id's, because the chips only differs in no. of CUes and DoublePrec units/CUes (as I think).
Hmm, not sure if that is correct, as I did some address magic in order to get from the runtime addresses to the addresses in the map file.
Do you have something called SelfTest, which can be related to a line number 2844 (het.Objects.pas)? Is there anything that could cause an access violation?
In the machine code above the exception, I see a "call 0058D8E4", which I translated to het.Objects.TSelfTest.SetName.
Does that help somehow? BTW, the problem does not depend on RDP, it hapens the same way when running the UI directly.
I the followed the program by single-stepping. The call chain of the abort is as follows:
in InitUnits there is a loop that appears to initialize static objects. The initialization of object 0x68 throws the exception. The call address was 0x006401D0, (translated 23F1D0) which is not in your map.
So it appears something very basic is missing on this machine. Do you require any frameworks/tools/addons/engines/whatever?
Thx again for testing!
Finally I've tracked down that debuginfo problem. (There was a WindowPlacement properti I saved to the ini file, I did it with a class_helper, and when I made it published in the mainform, then the linker dropped that internal error. Sad that class_helpers can't contribute to the Runtime Type Information, on which my script lang is pretty much based on.)
I've uploaded the new exe, and it became 7MB bigger, so there is working debuginfo in it.
Also made a change: When the selftest fails it will ask you If you want to continue anyways(bad choice), or just check the exception information and exit.
This error you've discovered in your machine is very weird. It's in the heart of the system, so if that test fails, then all other thing could fail also (like the cl/decive/kernel/buffer object hierarchy). This is my own oop framework which does automatic obejct lifetime management, also automatically casts notifications of object/property changes.
"Do you require any frameworks/tools/addons/engines/whatever?"
Not at all, it only needs an XP environment and the cal,cl dll's from the Catalyst driver.
I can only think that one of those uncommon things are blocked on your system:
- It sometimes writes some temp files into the 'C:\' (for example the source file after macro precompilation)
- It uses WriteProcessMemory to be able to notify about property changes. (Replaces empty property.setter functions with custom code) Also there are some Variant related patches like case insensitive = operator for strings. <- Maybe your system hates self-modifying code.
The TSelfTest.SetName() function is an example of this:
In the code is just an empty function:
procedure TSelfTest.SetName(const Value: ansistring);begin end;
And in the executable it is patched automatically to became this:
procedure TSelfTest.SetName(const Value: ansistring);
if FName<>Value then begin
So after patch, in your debugger you can see a jmp instruction instead of an empty function.
Can you pls specify the system you tried to run it on?
(All I know that It runs on: Intel core2+winXP-32, AMD Athlon2+win7-64, Intel core(1)+win7-64, I'll ask more friends to try and hopefully we can reproduce the error)
Self-modifying code, evil-evil! I did not know that there is still any OS out there allowing for that, but as you have a list of platforms where it works, there may be options to configure that.
With the new binary, I get the SelfTest failed popup. Continuing brings up the UI, with an empty, grey left side. Anyway I can load any of the examples into it. Compiling adds "Compiling OK (0.001 sec)" to the status line. Trying to run the code locks up HetPas, and strange enough, also all other GPU computing applications (I was running a few trial-factoring programs).
My System is a Xeon X5650 (hex-core) 2.66GHz, hyperthr. enabled, 6GB, HD7850, W7SP1-64, UAC disabled, DEP enabled.
OK, as DEP (Data Execution Prevention) almost sounds like prohibiting self-modifying code, I disabled it, rebooted, and voila:
elapsed:0.0940783619880676 for GCN_OpenCL_mandel
elapsed: 5.347 ms
Cycles (including latency): 600 for GCN_OpenCL_latency_test
elapsed:0.000502757262438536 for GCN_OpenCL_Fibonacci_recursive
Really cool! Now I have something to play with ... and you can document that HetPas does not work with DEP enabled .
I'll try to find some more time soon to test my own kernels. Is HetPas creating binary kernel files that can be used by OpenCL's clCreateProgramWithBinary to load it into "normal" OpenCL programs? My ideal workflow (given that AMD does not want to support GCN-ASM) would be to write/use my normal OpenCL kernels, let it compile, try to optimize the resulting ASM, and finally use the optimized binary kernel ... some day.