bgrainger

Application crashes only on "Llano" APUs -- is it a CPU bug?

Discussion created by bgrainger on Aug 16, 2011
Latest reply on Oct 4, 2011 by blaz

In early July, my company started getting bug reports from customers that an access violation was crashing our main application.

Investigation of the crash reports showed that all customers with the bug had a AMD A6-3400M, A8-3500M, or A8-3850 APU. (These are all "Llano" chips that were released in June, shortly before we first started getting the bug reports.) I purchased a Toshiba Satellite L745D (with an A6-3400M APU) and was immediately able to reproduce the problem on it. Neither I nor any of our customers have reproduced the crash on any other CPU.

The problem is described in detail at StackOverflow: http://stackoverflow.com/questions/7004728/is-this-should-not-happen-crash-an-amd-fusion-cpu-bug

In summary, a function pointer value from memory (that is always 0 in our app) is loaded into eax, then a "test eax, eax; je ..." instruction pair checks if the value is 0. If not, "call eax" is executed. "call eax" should never be executed (because the value should always be 0) but after 1-5 minutes of execution, the application crashes with an access violation trying to execute code at 00000000.

The stack is always the same when the crash occurs, although this (common) pattern is repeated throughout the application many times.

Curiously, if I modify the surrounding code slightly (e.g., replace an earlier "jne" with "nop"s so that the "test eax, eax" is *always* executed), the crash doesn't happen. In my testing, any modification--semantically equivalent or not--to the surrounding code prevents the crash, even though it's 100% reproducible with the original code. I've also tried to write a small test app (that uses the same assembly instructions) but haven't had any success at reproducing the problem.

I've read the "Revision Guide for AMD Family 12h Processors" and although it doesn't describe an issue that is obviously the same, it does list errata that occur "under highly specific and detailed internal timing conditions", which--to my mind--describes the conditions under which this crash appears to occur. I know that programmers should be extremely reluctant to blame the CPU, but I really can't think of another probable cause at this point.

I don't need a fix; as mentioned above, I can stop the crash by modifying the original C code so that the compiled code is slightly different. However, if anyone from AMD is interested in following up on this, I'd be happy to provide more information about the problem.

Or if someone wants to suggest other tests I can run to eliminate/confirm the CPU as the cause of the problem, I'd be happy to do that, too. (I'm glad we've been able to ship an update to our software that stops the crash for our customers, but I'd really love to know what's causing this problem to satisfy my own curiousity.)

Outcomes