Originally posted by: corry *Bump*
It is already deprecated. So no further enhanments are exposed at CAL level
Originally posted by: genaganna
Originally posted by: corry *Bump*
It is already deprecated. So no further enhanments are exposed at CAL level
You'd think I didn't know AMD made a stupid decision in Deprecating CAL by this cop out of a post.
I know AMD has a hubris problem when it comes to their OpenCL compiler, and like most cases of hubris, its misplaced. The OpenCL compiler is trash. My naive IL implementation of the same kernel is 33% faster. Do you think I'm *EVER* going to switch to OpenCL? Here's a hint, I'll switch to nVidia who allows inline PTX in their Cuda code first. I'll switch to a company less arrogant about its compiler, one who understands that when employing current optimizing compiler techniques, humans will always beat compilers. (I'll be glad to stop doing the hard work when it can be automated properly). So please, don't give me the cop out answer.
Either say we here at AMD like to point weapons of all sorts at our own feet, and so you're going to have to reverse engineer our stuff to find the information you are requesting, or just hand it over. Don't just make a blankey statement about CAL being deprecated. I already know that. I was hoping AMD would do the intelligent thing at let us have the information, even if unsupported.
Giving out information on depreciated spec, which is is encouraging use and further uptake by other users (not just yourself, that's arrogance), is NOT the "intelligent thing" for AMD.
You misunderstand. I am syaing its arrogant for them to deprecate the interface, and try to force everyone to use the high level interface, which even to present day optimizing compilers, fail to beat a human optimizer. Its arrogant of them to say just use the optimizing compiler, it is better than you. The fact is, it isn't.
Since I assume its a business level decision, and not an engineering level decision to deprecate CAL, It would be nice if AMD engineers, or whatever it is the AMD guys who post on here are, would occasionally hand out a little bit of information.
I can, and so can many others, reverese engineer the APP Profiler, and generate this information, and share it on the forums, and others would use it unsupported, just like what I'm asking for. All I want is the info, and not to have to spend hours figuring it out, especially when they ask for hours in generating test cases for their bugs. If I have to RE their stuff, thats less time I have to help them with their bugs...Seems to me it would be the smart move on all fronts to just release it unsupported...
So when I get counter 2, I'm given back 24 bytes of data. I tried some simple basic tests to tease data out, but I don't really have time for this. Here is what I found though should anyone else want to take a crack at it....I varied the number of wavefronts from 96 to 1, and the number of times the kernel ran per launch from 262144, to 1. Why those numbers? I just kept bumping up the wavefronts until I saw speed decrease, and 262144 is a power of 2 (I'm partial to them) that makes the kernel run for 5+ seconds (to get real info about the speed of the kernel...)
96 Wavefronts, loop running
de 4a 57 9e 01 00 00 00 6c 04 b9 b4 3e 8f 01 00
98 96 17 54 40 8f 01 00 ff ff ff ff ff ff ff ff
No loop 96 wavefronts
50 da 00 00 00 00 00 00 68 f5 f2 39 5f 8f 01 00
05 58 aa 3a 5f 8f 01 00 ff ff ff ff ff ff ff ff
1 wavefront, loop running
35 0b 5e ff 00 00 00 00 f8 fe 11 49 79 8f 01 00
f4 5a 87 49 7a 8f 01 00 ff ff ff ff ff ff ff ff
1 wavefront, no loop
4a b6 00 00 00 00 00 00 7a d2 38 83 89 8f 01 00
10 1b 29 84 89 8f 01 00 ff ff ff ff ff ff ff ff
Performance counter 3 simply returns 32 bits of 0, and trying to create 4 returns an error.
Now, are you guys really going to make me/us/someone/? reverse engineer this stuff? With enough parameters varied, I'm sure we can tease out all the the fields...
Just curious, Any luck?
Hmm..don't get me wrong, I understand why your doing what your doing, and it must be certainly very frustrating to have the best option taken away from you (good luck by the way ;-) ).
I suspect AMDs is trying to push the heterogenous programming model quite strongly as its the main advantage over Cuda, and certainly the only way to topple Cuda, even if it makes a few unforgiving and seemingly selfish decisions along the way by depreciating end-user support for CAL and then forcing users to switch 100% to OpenCL. OpenCL was never going to be the most resource efficient solution for GPUs, but I guess they imagine that low level device specific coding would fragment their consumer base causing less uptake in the future (although you may argue the opposite is true by helping users, I guess the strength of OpenCL and GPGPU lies in the long term where a few lost ALU cycles won't be noticed).
I could go on for hours....literally. What it boils down to is AMD had a loyal following because of their low level interface. nVidia has a loyal following because of their high level interface. nVidia sought to woo the low level following from AMD by offering better access to their low level (inline ptx code in cuda), while maintaining their high level interface. AMD on the other hand is trying to woo high level developers with clunky OpenCL, while dropping support for their low level interface.
Read that a few times, and you tell me who you think has the better strategy. Yes, I think nVidia is in trouble when it comes to FSA, though I wonder if they might try to make a fast ARM chip conforming to FSA since intel won't sell them an x86 license, and ARM said they're on board with FSA, but I digress...
I understand you might be a high level guy, and I appreciate high level stuff. I have worked on projects with line counts in the high hundreds of thousands, and appreciate the organization, and shortcuts the high level puts there. I've also worked on multi-thousand line low level projects, and appreciate the 30%-????% speed increase it gives me. I always say pick the right tool for the job. AMD has just decided to take away the right tool for the job, rather than offering both tools. Doesn't this sound, at best, misguided?
Like I said though, I could go on for hours about it, but it comes down to that, offer more, don't take stuff away you had previously given out, especially while your largest competitor works to undermine your hold on what you are now taking away...
As for the counters, I haven't managed to get back to it. Differential analysis is time consuming, and I have a kernel to make run faster...I worked for a long time on intel processors without vtune....thank's to aforementioned stupidity, I'll just have to continue in that fashion with AMD GPUs now...
I'm just curious, but are you trying to make a kernel go faster? or a whole application? The reason is in many cases, the bottleneck is not the kernel, but memory transfers.
If you are this interested in low level programming, I would recommend looking at the OpenCL binary and using the OpenCL interface for memory transfers and then optimizing the OpenCL binary itself. You can use standard open source tools to access our binaries and pull out the IL sections, modify them and re-insert them. As long as you leave the infrastructure the same, the runtime should load and execute the binary and recompile your IL to device ISA(as long as the CAL binary has been removed).
the short answer is yes. The medium/long answer is...well...I tend to get a little wordy....I'll try to be brief though...
In the interest of brevity, I'll just cut to the chase. It sounds to me like you're asking, why do I want access to these counters anyways, and if I want them so bad, why don't I just patch binaries.
First the counters. Right now, I'm simply planning for the future. I have an algorithm implemented, and am working on getting it through the ALU as fast as possible. In most cases, it will not read or write a whole lot of data, and I hope program size doesn't become an issue, so I can write even less data by inlining the other functions I plan to use with the algorithm That said, there is at least one use case where that just won't be possible. I will basically be streaming a fair amount of data into the kernel, but still writing very little assuming I can inline everything else...otherwise, yeah, its going to be a bit of a nightmare! ;)
When we all first started programming though, we debugged our C programs with printf("This in variable is now %d!\n", someInt);, then we learned about debuggers, stepping through code, watching variables, data breakpoints, code breakpoints, int 3 on x86, etc, and suddenly, a several hundred thousand line project doesn't seem so daunting to work on, and our debugging is taken to a whole new level. Well, with optimizing, we learned tricks like xoring an x86 reg with itself instead of moving 0 into it, instruction ordering/grouping, pushing loads as early as possible, using as many registers as we can (avoid the stack aka ram, even it if is "likely" to be in L1 cache), prefetching, etc. Then we learn about performance counters, and seeing consistantly mispredicted branches, and helping the system with cmovcc, bit twiddling to conditionally set variables without branching, seeing where we're violating spacial locality, etc. With the debugger, cmpared to printf, it practically pointed to your bug, and put blinking lights all around it. Well performance counters do the same thing to bottlenecks. Right now, I want to make sure the algorithm I'm implementing isn't a bottleneck, or, well, thats not really possible, but I want it to be the largest opening bottleneck that it can possibly be. Probably don't need performance counters for it, which is why I spent 20 minutes generating that post, and left it. However the future is a different story, and I will likely end up wanting, if not needing them.
I'm not really sure how usefull the counters in the AMD APP profiler really will be, but at least I'll get some hard numbers, not the ones SKA produces out of thin air, which no one can tell me how accurate they are. (For the record, it's estimating I should see a throughput of say 200M Threads/sec, but I'm only seeing about 70% of that. Of course, it shows the openCL version as getting about 70% of my 70%....which would not bode well for setting it up to actually run on some data :)
So thats enough on performance counters...
So why not "just patch". My father used to beat a saying into my, and my siblings skulls. I assume he picked it up in his time as an engineer with the Navy. He would always say, "If a job is worth doing", and expect us to finish with "its worth doing right". If compiling a blank opencl file, ripping out its textil section, and replacing it with my own was the "right" way to do things, wouldn't the compiler have an option to just read IL text and insert it appropriatly? Further, it may work for you on these forums, and work for me on my own time, writing code for myself or my own business/pseudo-business, and maybe even at AMD (Though, I would think that might explain a few things with ATI/AMD....). In every job I've worked, in every professional software anything, patching binaries will always be called, the wrong way. Tools may exist, it may work perfectly well, I know were I to suggest such a thing to anyone I work for, with, or otherwise, I'd instantly lose credibility as a professional engineer, and slip into the "hacker" territory. (hacker in this case being used as being a hack, or hacking crap apart to work togather, not hacking through security systems for those who needed the explanation). Thats a great title in certain sub-specialties. IMHO its better to reverse engineer the counters, and use the systems native than it is to patch binaries, and leave the pseudo-public interface as a black box.
That said, yes, even if I have to do it on my own time, I'll disassemble the profiler, and figure out what makes it tick. Maybe performace counter 3 is a total red herring (then again, perhaps i could cut down a tree with that herring!), and again, if I have to differential analysis it to death on my own time to figure that out, I will.....eventually.
Yes, many of us complained, with good reason about the quality of the documentation. I don't know if I speak for all of the complainers here, but I'll say it, first, I've had to work with far worse, and second, I've had to work with none before. All of the above situations fall into the category of sucking. However, some documentation sucks less than 0, and merely poor documentation sucks less than "some" documentation. I can give...non-specific examples...if requested (don't want to get sued for libel or anything as I have some particularly great examples of particularly poor documentation ;) ) I don't know if the quality, and prior complaints (since by the time I started working with my card, and got on here to try to figure the whole thing out, CAL was already deprecated), had anything to do with the deprecations of CAL, and well, to be honest, I have no idea what the reason for deprecating CAL was, but I wonder if the documenation didn't have something to do with it. I suspect that, and someone said why maintain 2 interfaces, which of course, sounds great on paper, but then, why maintain X86 asm and C++? See my previous posts for why!
Anyhow, I'd REALLY like to know why CAL was deprecated, especially when its still under active development. Seems to point me more in the direction of the documentation issue. If so, I think I know a pretty good formal writer (at least when he switches into formal writing mode) who might be willing to sign his life away in NDA's and work on contract to generate some documentation pending his day job's HR approving him for the task since most employers these days make you sign papers which specifically state that you will not take any other jobs without approval, I'm sure AMD made you sign such a form in the mountains of papers you signed when you first started there :) )
With that, I'm off to bed...to start this whole thing again in less than 7 hours *_*