cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

mrolle
Adept II

Any optimization enthusiasts out there?

Looking for kindred spirits

I am doing a lot of hand-coded assembler optimization on all AMD CPU generations.  I would like to share ideas with other interested people.
Especially perplexing is a lack of precise details of the pipeline operation.  I don't want anything that is an AMD secret, but if any of you guys have figured some things out, or you know of some published reports, or patent applications, I'd like to hear about them.

In return, I have developed some very good and accurate code timing methods, using both IP sampling and RDTSC methods.  I'd be happy to share them with anyone.

7 Replies
avk
Adept III

I'm very interested in this matter. BTW, can I learn, what a code you write ?
0 Likes
0r
Journeyman III

Originally posted by: mrolle
I am doing a lot of hand-coded assembler optimization on all AMD CPU generations.  I would like to share ideas with other interested people.
In return, I have developed some very good and accurate code timing methods, using both IP sampling and RDTSC methods.  I'd be happy to share them with anyone.



Wow! That's great! Because there are so a little info on this subject.
What can i do to take your share? 😃

0 Likes
eduardoschardong
Journeyman III

One more hobbist
Optimizing for AMD processors isn't that dificult, it perform as do you expect, load and store are the ones that hurts more, in any case, the pipeline simulator in CodeAnalyst gives a fairly precise view.
0 Likes

eduardoschardong: The problem is in software developers. They think that optimization for AMD is not relevant, because these CPUs do not prevail on the market. I must admit that their point of view is understandable. I think that AMD must help software developers by not only providing software tools, but by optimizing CPU-hungry code pieces of their software.
0 Likes
ajiva
Staff

Originally posted by: mrolle

I am doing a lot of hand-coded assembler optimization on all AMD CPU generations.  I would like to share ideas with other interested people.
Especially perplexing is a lack of precise details of the pipeline operation.  I don't want anything that is an AMD secret, but if any of you guys have figured some things out, or you know of some published reports, or patent applications, I'd like to hear about them.



In return, I have developed some very good and accurate code timing methods, using both IP sampling and RDTSC methods.  I'd be happy to share them with anyone.



Have you looked into Instruction Based Sampling (IBS) (http://forums.amd.com/devblog/...adid=87847&catid=271)? This might help you with accurate code timing. Code Analyst provides IBS profiling on Barcelona hardware and should at least get you part of the way there. AMD is also working on a user space version of IBS called Light Weight Profiling (LWP, http://developer.amd.com/cpu/LWP/Pages/default.aspx) that should be available in a future hardware revision.

I'd be interested in what you've worked out with IP sampling and RDTSC? Especially since RDTSC is an expensive instruction and is not as useful in a multi-core environment where certain cores can be powered down or running a reduced frequencies.

0 Likes

The most impactful optimizations are usually at the algorithm level. The old Zen slogan "the fastest instruction is the one that is never executed". And those optimizations tend to help all platforms.

Once you are using the most efficient algorithm, of course there may be more room for improvement on a given platform.

As someone already pointed out, the AMD CodeAnalyst Performance Analyzer has a pipeline simulation mode for those who seek cycle-by-cycle analysis of inner loops. It also supports Instruction Based Sampling for more precise timing and event data. You can download CodeAnalyst free from this web site (look for the link) and it has a step-by-step tutorial in the /help section.

For the ASM code programmer, The Optimization Guides for AMD processors include instruction latency and throughput details in Appendix C. The guide for "Family 10h" covers the latest generation of CPUs. Pay special attention to the different "decode type" of instructions, the most common ones are most optimized while some of the more obscure instructions are slower. There is also some good detail regarding the microarchitecture in the guide. The guide and other optimization papers are on this web site too.

From my experience, there is also no substitute for experimentation to squeeze every last bit of performance. I have usually obtained the best results by making a small test program, then using it to test all the different ideas and actually measure the results. Tweak, compile, benchmark, evaluate. Then try the next idea. It's like shampoo: lather, rinse, repeat

-MW

0 Likes
avk
Adept III

mwall: I'm not one of Zen worshippers, but let me to add my comment to that slogan: "But, if it is not possible to avoid an instruction's execution, take care to make it faster than it is on rival's CPU" .
0 Likes