cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

corry
Adept III

more mysterious uav11?

Still waiting for why I'm not getting bursting from my other thread, but I went over my code probably 50 times (no lie), and can't see anything wrong, but when I use uav11 for input, i.e. I map its memory, write data, unmap, etc, then have my il program write back what it read, its all 0's.  Not using caching, so thats not the issue...Everything worked with uav8, so what gives?   Is this more mysterious uav11 behavior?

0 Likes
20 Replies
corry
Adept III

I reverted code back several revisions, and it was still broken....then I realized I installed 11.10 preview 3 today....I think that is where the problem lies...

So what did you change in 11.10 with uav11???????????????

0 Likes

I take that back....reverted and rebooted, found a slight oversight that would have made the old code on uav8 work.....11 still seems as if its writable by the gpu program only...at least anything I put there my program doesn't see...It just reads it as 0 every time...

 

0 Likes

Ok....I've extensivly tested this now...a simple change from uav11 to uav10 results in correct results....really, mapping uav11 to *anything* else works fine.  I can *write* to uav *in my kernel* and get correct results, but I *cannot* write to uav11 from the host program send it up to the gpu, have the gpu read uav11, and get correct results.  A dump of what it read from uav11, is always 0.

What gives?

 

0 Likes

As far as I know uav11 is cached by default. And cached reads in CAL results 0 ( some driver versions return random values ).

I think that OpenCL uses some "special" magic to map memory to uav11 that isn't available for us CAL users and it makes uav11 work properly.

You can read more ***** from amd about it in this thread

0 Likes

Originally posted by: hazeman As far as I know uav11 is cached by default. And cached reads in CAL results 0 ( some driver versions return random values ).

I think that OpenCL uses some "special" magic to map memory to uav11 that isn't available for us CAL users and it makes uav11 work properly.

You can read more ***** from amd about it in this thread

Interesting....I had read that thread before, but forgot about it....yes, 11.9, and 11.10 preview 3 this is still broken....

 

maybe it's time to disassemble an OpenCL exe and see what undocumented calls they are making to tell the thing to flush the cache...

 

0 Likes

corry,
There are lots of restrictions on UAV caching that are undocmented and will remain so as CAL is deprecated. OpenCL generated IL complies to these restrictions. Your best bet is to generate IL in the OpenCL binary format and use the OpenCL API to execute your kernel. Doing it any other way is going to be very frustrating as the CAL API has not been updated in over two years(SDK 1.4). OpenCL takes advantage of new hardware features, the public CAL API does not.
0 Likes

Remember I said I start to get frustrated with marketing type answers?  Thats what this is....you said nothing, with a whole lot of words...why can't you just say what these mysterious restrictions are...

I'm seriously starting to regret dissenting and pulling people over to using the AMD GPUs now...I'm sorry you might take this personally as a compiler author, but the OpenCL compiler produces slow code.  Its simply not an option.  Cuda on nVidia cards trounces OpenCL on AMD cards, especially with inline PTX....Patching binaries might be fine for hobbiests, and people doing personal programming projects, but it is 100% unacceptable in the real world (with a ver very very few exceptions, with incredibly strict guidelines, and this doesn't match).  I happen to work as a professional, so patching doesn't fly. 

See my private topic I invited you and Lee to, as so far, you 2 have been the most helpfull, but yeah, I'm starting to get really frustarted that you'd rather the community reverse engineer your stuff, than just give us the info...

0 Likes

Can you please provide an example and exact benchmark figures proving your claim? Or it's just that OpenCL is slower cause it's not "low level". 

0 Likes

Wow....talk about flame bait....

Yes of course I did my research, and I don't make statements like that without having done research.

I work for a large employer, so they own my code, so they consider it to be proprietary, so no, it cannot be posted. 

0 Likes

Sure, but you are complaining about OpenCL generating faster code yet at the same time you also claim hand-written CAL/IL is faster. It's not a flame bait, I kind of can't get the idea. I also don't think your employer makes a good choice betting on a technology that is being deprecated, of course that's just my opinion. Even if you are going after CAL/IL, from my (limited) experience it's far better to code it in OpenCL, dump the IL, remove the ***** and do your modifications rather than reinventing the wheel. And the result is usually a much better performing code.

0 Likes

Originally posted by: gat3way Sure, but you are complaining about OpenCL generating faster code yet at the same time you also claim hand-written CAL/IL is faster. It's not a flame bait, I kind of can't get the idea. I also don't think your employer makes a good choice betting on a technology that is being deprecated, of course that's just my opinion. Even if you are going after CAL/IL, from my (limited) experience it's far better to code it in OpenCL, dump the IL, remove the ***** and do your modifications rather than reinventing the wheel. And the result is usually a much better performing code.


Ok, clearly you are lacking context.  You were on the BFI_INT discussions, so I take it this is what you think I mean.  That would be patching the ISA binary in memory, *NOT* patching the OpenCL "Binary".  OpenCL compiles to IL in case you hadn't heard, and the text of the IL is contained in the OpenCL "Binary" thus the "'s.  Patching the openCL binary refers to removing the existing .textil section with my own IL code, as has been half-heartedly reccomended here a few times.  Get your facts straight.  Further, you admit to not writing OpenCL, but then claim its faster.  Let me guess, you also think Java is just as fast as C++....do some independant testing and research, and don't believe everything you read on the internet.  This is why we did in fact test.

Last, you made the point of "betting on deprecated tech".  In the real world, procurement takes time.  At the time of order, CAL had *NOT* been deprecated, and we had no reason to believe it would be deprecated.  IMO, the sudden deprecation was a lot like NASA deprecating the space shuttle.  Here is something with continued utility, old, clunky, in dire need of a replacement sure, but axed before the replacement was available.  Always a bad idea. 

Clear things up a bit?

0 Likes

Binary patching is not what I mean. What I mean is that getting your IL code out of the OpenCL kernel build process is a matter of setting an environment variable. You can apply your modifications, then load that using your CAL host code. Another approach would be replacing the .textil section, yes.

I also never admitted I don't write OpenCL. In am doing exactly that. Had my experiments with CAL some time ago and decided that OpenCL suits my needs better. That was long before AMD decided to deprecate it. It's easier and with some manual assistance to the compiler, it usually generates better code than handwritten IL. Besides that, with each new Catalyst version something breaks in CAL and it's a ***** mess to keep up with that, you cannot constantly recommend your users to downgrade drivers. And this is getting worse as what AMD means by "deprecation" is that they are not fixing problems in CAL anymore. You always have to look out for (usually slower) workarounds and that hurts.

I can understand your C++ and Java analogies, however this is not the case here. 

 

 

 

0 Likes

Originally posted by: gat3way

...

I also never admitted I don't write OpenCL. In am doing exactly that. Had my experiments with CAL some time ago and decided that OpenCL suits my needs better. That was long before AMD decided to deprecate it. It's easier and with some manual assistance to the compiler, it usually generates better code than handwritten IL.



Really ? It's like saying that C or C++ compiler generates better code than hand written assembly. It might be only true if you are really bad at assembler.

Besides that, with each new Catalyst version something breaks in CAL and it's a ***** mess to keep up with that, you cannot constantly recommend your users to downgrade drivers. And this is getting worse as what AMD means by "deprecation" is that they are not fixing problems in CAL anymore. You always have to look out for (usually slower) workarounds and that hurts.


Sorry but it looks you don't know at all how AMD OpenCL and CAL works. You should know that 99% of the problems are with IL compiler. And maybe it will be a shock for you but OpenCL uses the same IL compiler as CAL. So when some catalyst version brakes CAL it also brakes OpenCL ... The problem is that IL coders quite often try to use more advanced features available in IL just to squize all performance of the card ( like the bfi ). And OpenCL due to bad optimization doesn't yet uses those features.

And even if it's not obvious OpenCL is written on top of CAL. First versions of OpenCL used exactly the same api that is publicly available. Now they use new hidden api.

I can understand your C++ and Java analogies, however this is not the case here. 


You are badly mistaken here. One of more important examples is matrix multiplication. You can find really long thread about it on this forum. The best OpenCL version was 30-50% percent slower than CAL/IL version available.

More examples you can find in CAL++ library. And i have few kernels which are 2-3 times faster than possible in OpenCL ( tricks with memory access, better code possible directly in IL, etc )

0 Likes

 

Really ? It's like saying that C or C++ compiler generates better code than hand written assembly. It might be only true if you are really bad at assembler.


 

You would be surprised how often the code generated by gcc beats hand-written assembly even at -O1. Even if it was written by someone that is experienced in assembly.

Sorry but it looks you don't know at all how AMD OpenCL and CAL works. You should know that 99% of the problems are with IL compiler. So when some catalyst version brakes CAL it also brakes OpenCL ... The problem is that IL coders quite often try to use more advanced features available in IL just to squize all performance of the card ( like the bfi ). And OpenCL due to bad optimization doesn't yet uses those features.


Did you succeed to get your bfi using CAL and hand-written IL code? Also  how does this correlate to your previous statement:

 

 

As far as I know uav11 is cached by default. And cached reads in CAL results 0 ( some driver versions return random values ).

I think that OpenCL uses some "special" magic to map memory to uav11 that isn't available for us CAL users and it makes uav11 work properly.



 

 

Yeah.

 

 

You are badly mistaken here. One of more important examples is matrix multiplication. You can find really long thread about it on this forum. The best OpenCL version was 30-50% percent slower than CAL/IL version available.


 

 

Just 30-50% faster than what the OpenCL compiler generated 2 years ago on a 4xxx device with no images support and LDS emulated in global memory? Big win.

 

Overall, I do not disagree that using hand-written IL may allow you to  achieve faster results, on many if not most occasions, this does not happen though. Instead people waste time trying to workaround poorly documented problems, battling with some mysterious bug related to host-device transfers that suddenly emerged with the recent Catalyst release and often implementing things the suboptimal way. The end result being a kernel that performs worse than some simple OpenCL kernel you throw in 15 minutes. Another thing is that while we, the OpenCL users might wait for months until some annoying problem in the runtime or the compiler, you will likely never get your CAL issues resolved because you are basically using a deprecated product.

 

0 Likes

One thing I have learned is that there is no discussion with fanboys. They are always right and the thing they are fanboying about is always the best.

0 Likes

Originally posted by: gat3way  

 

You would be surprised how often the code generated by gcc beats hand-written assembly even at -O1. Even if it was written by someone that is experienced in assembly.



Please stop making fiction to support your theory.

 

Just 30-50% faster than what the OpenCL compiler generated 2 years ago on a 4xxx device with no images support and LDS emulated in global memory? Big win.


Again please stop making fiction. If you don't have dency to check the facts then don't answer at all.

Your whole post is typical for fanboy. I'm not feeding the troll anymore.

 

0 Likes

Check what facts? Facts like "there was a thread somewhere...", "I have faster kernels..." or "only very poor assembly developers write code that is slower than that generated by gcc". Those are not facts, those are claims. If you cannot make a distinction between those two, you have no right to blame anyone for being a fanboy.

 

0 Likes

Originally posted by: hazeman Your whole post is typical for fanboy. I'm not feeding the troll anymore.


Yeah, thats why I said flamebait....

Unfortunately, this sort of nonsense requoted everywhere means in both personal collaborative programming projects, and even sometimes at work (though I tend to like to work at places with other intelligent people, so I don't have this problem), I end up having to clean up the messes of these individuals.  Its sad how modern programmers are always looking for the easy way out, and thus look for one idiot spewing GCC is better than hand optimized ASM!!!!, then hop onboard with them without doing any research (work).

If you suck at asm, or if you optimize for the wrong architecture yeah, the asm will be slow.  A huge example of this was around the time of the P4.  P6/K6 optimized assembly code was much slower on netburst because pipelines were everything.  Code written for shallow pipes often stalled the netburst deep pipe resulting in slower then original execution time.  Then authors went back through and optimized for deep pipes, and the asm was again faster.  (I'm not so sure what the Athlons were doing at the time...I never followed them as much, because netburst was so easy to hate).  Then Intel brings back the P6 architecture with updates as the Core architecture, and we're back to more shallow pipes, but a lot of other new things to optimize for as well, so again, gcc was beating the asm optimized for netburst.  Once the coders came back and optimized for Core, surprise, asm is faster again.  This illustrates the difficulty in programming in asm.  Your code has to be revised with hardware revisions.  Almost without fail.  There is no write once and forget.  The reason anyone still does it even today, is the speed advantages.  To give you an idea of just how widespread it is, go to

http://www.google.com/codesearch and for the term, use lang:^assembly$ mov rax 

I put the mov rax there to limit to x64 assembly, so you can be sure its relativly recent.  Start putting some core service apps in there.  zlib came right up for example.

However, with this, I'm done.  I've done my due dilligence to attempt to show you the error of your ways, but hazeman is right, you're a fanboy, and science has proven people can be wired to be fanboys and not even realize it.  You aren't going to do  your own research, so you're never going to learn.  Even in the face of vast amounts of information online, I gave you google codesearch....you are still going to illogically cling to your laziness/ignorance, rather than attempt to learn.  Its actually, quite sad...

0 Likes

corry/hazeman,
Just to clarify, is your main issue with coding in OpenCL the performance of the kernel, or the whole application?
If the whole application is the problem, which part is the bottleneck?
If the kernel is the problem, would OpenCL + IL help with this?

Basically, the CAL interface is deprecated and not coming back, but IL itself is not. Currently the way to use IL + OpenCL is to create a binary and replace the .text/.amdil section. Is there preferences/feedback on how you would like to see this improved?

As for High Level vs. Low Level, I've always found Paul Hsieh, while dated, to be insighful. Link here: http://www.azillionmonkeys.com...ptimize.html#asmdebate
0 Likes

Originally posted by: MicahVillmow corry/hazeman, Just to clarify, is your main issue with coding in OpenCL the performance of the kernel, or the whole application? If the whole application is the problem, which part is the bottleneck? If the kernel is the problem, would OpenCL + IL help with this? Basically, the CAL interface is deprecated and not coming back, but IL itself is not. Currently the way to use IL + OpenCL is to create a binary and replace the .text/.amdil section. Is there preferences/feedback on how you would like to see this improved? As for High Level vs. Low Level, I've always found Paul Hsieh, while dated, to be insighful. Link here: http://www.azillionmonkeys.com...ptimize.html#asmdebate


In my code, right now, the application is doing little other than trivial conditioning of input data.  Running, hah, surprise, a hand optimized x86 version of the conditioning gained me 20% over running it on the GPU, thanks to the CPU having a single instruction for the exact conditioning I needed to do 🙂  (Optimized by ordering loads/stores, and knowing the number of execution units on the processor family coded for, mush like how the link describes 🙂  Also saw the same trick used with a different processor intrinsic, so I can't take all the credit 🙂 )  Beyond that, the host is sitting and waiting for the GPU.

For the next version, I'm trying to strike a balance before I even get to the IL.  I have some clever ways of manipulating my input data so I think I can force aligned loads of 16 reads of 16 bytes, but then I need to modify a few individual bytes in the buffer.  The amount of data is small, so I'd rather initialize, and condition on the host side, then repeatedly modify those few bytes, and ingest the results in a loop on the GPU.  This may not be possible, and I may have to go back to the drawing board on this one.  Do too much on the CPU, and I run the GPU out of ram, too much on the GPU, and everything slows down 😞  Gotta find that balanceing point.

I've been saying right along, I hate having to do everything in ASM.  I'd love it if either the system could "link" some IL modules, with OpenCL modules, or perhaps, take inline IL in the OpenCL.  Either way, On the CPU implementations of this stuff, I generally have a bunch of C code setting up the algorithm, and the algorithm in a seperate .asm file.  I used to like MSVC's inline ASM, until I found it would shut the optimizer off, and if you didn't follow it's prescribed rules for register use, things could get ugly...so I switched to the external ASM files.  For me, thats a very familier workflow.  I'd love to have either.  This junk I'm doing now with the setup, and alignment, etc, I have in C, to align for SSE, with the actual algorithm written in ASM with SSE).  It's be nice to be able to reuse more of that!

I know there would be some interesting aspects linking IL in it's current form into OpenCL (since OpenCL mangles its function names into numbers, so what happens with a collision....something new would have to be introduced, like function naming for IL code only that will be compiled into an OpenCL binary)  I think its doable, and heck, you might already be working on that/be done with that for all I know 🙂 

0 Likes