In early July, my company started getting bug reports from customers that an access violation was crashing our main application.
Investigation of the crash reports showed that all customers with the bug had a AMD A6-3400M, A8-3500M, or A8-3850 APU. (These are all "Llano" chips that were released in June, shortly before we first started getting the bug reports.) I purchased a Toshiba Satellite L745D (with an A6-3400M APU) and was immediately able to reproduce the problem on it. Neither I nor any of our customers have reproduced the crash on any other CPU.
The problem is described in detail at StackOverflow: http://stackoverflow.com/questions/7004728/is-this-should-not-happen-crash-an-amd-fusion-cpu-bug
In summary, a function pointer value from memory (that is always 0 in our app) is loaded into eax, then a "test eax, eax; je ..." instruction pair checks if the value is 0. If not, "call eax" is executed. "call eax" should never be executed (because the value should always be 0) but after 1-5 minutes of execution, the application crashes with an access violation trying to execute code at 00000000.
The stack is always the same when the crash occurs, although this (common) pattern is repeated throughout the application many times.
Curiously, if I modify the surrounding code slightly (e.g., replace an earlier "jne" with "nop"s so that the "test eax, eax" is *always* executed), the crash doesn't happen. In my testing, any modification--semantically equivalent or not--to the surrounding code prevents the crash, even though it's 100% reproducible with the original code. I've also tried to write a small test app (that uses the same assembly instructions) but haven't had any success at reproducing the problem.
I've read the "Revision Guide for AMD Family 12h Processors" and although it doesn't describe an issue that is obviously the same, it does list errata that occur "under highly specific and detailed internal timing conditions", which--to my mind--describes the conditions under which this crash appears to occur. I know that programmers should be extremely reluctant to blame the CPU, but I really can't think of another probable cause at this point.
I don't need a fix; as mentioned above, I can stop the crash by modifying the original C code so that the compiled code is slightly different. However, if anyone from AMD is interested in following up on this, I'd be happy to provide more information about the problem.
Or if someone wants to suggest other tests I can run to eliminate/confirm the CPU as the cause of the problem, I'd be happy to do that, too. (I'm glad we've been able to ship an update to our software that stops the crash for our customers, but I'd really love to know what's causing this problem to satisfy my own curiousity.)
I just wanted to say that we have recently started seeing a very similar problem in our application, and it happens exclusively on AMD Llano CPUs. We're in the process of purchasing a machine with one of these CPUs for further testing, but we're getting hundreds of crash dumps every day from customers exhibiting this problem.
We've seen the problem manifest in two parts of our code:
0112A9CC: 8D 94 24 0C 01 00 00 lea edx,[esp+10Ch]
0112A9D3: 39 02 cmp dword ptr [edx],eax ; (1) CRASH
0112A9D5: 7C 04 jl 0112A9DB
0112A9D7: 8D 54 24 1C lea edx,[esp+1Ch]
0112A9DB: 8B 12 mov edx,dword ptr [edx] ; (2) CRASH
From the crashes we've seen, (1) was the faulting instruction 80% of the time, and (2) was it the other 20%.
The exception and register contents at the point of the crash are:
Unhandled exception at 0x0112A9D3: 0xC0000005: Access violation reading location 0x00000000.
EAX = 00000003 EBX = 18EADB4C ECX = 14319948 EDX = 07DAF934 ESI = 18EADB40 EDI = 18EADB64
EIP = 0112A9D3 ESP = 07DAF828 EBP = 18EADB58 EFL = 00010246
Unhandled exception at 0x0112A9DB: 0xC0000005: Access violation reading location 0x00000000.
EAX = 00000005 EBX = 2472600C ECX = 17D12B60 EDX = 0730F7D0 ESI = 24726000 EDI = 24726024
EIP = 0112A9DB ESP = 0730F7B4 EBP = 24726018 EFL = 00010246
Our application is also multithreaded and puts a lot of load on the CPU and GPU (it's a video game).
This problem started occurring a few days ago. The code at this location used to look differently due to different compiler options. This is how the non-crashing code looked like:
01059070: 8D 4D FC lea ecx,[ebp-4]
01059073: 39 01 cmp dword ptr [ecx],eax ; (1)
01059075: 7C 03 jl 0105907A
01059077: 8D 4D B4 lea ecx,[ebp-4Ch]
0105907A: 8B 11 mov edx,dword ptr [ecx] ; (2)
The difference is EBP-relative addressing and the use of ECX instead of EDX.
I received confirmation from an AMD engineer (that I spoke to in person at Microsoft Build) that my problem was a known Llano erratum. See my update at: http://stackoverflow.com/questions/7004728/is-this-should-not-happen-crash-an-amd-fusion-cpu-bug/7642385#7642385
Unfortunately, the only workaround (apart from generating different code) was to wait for vendors to ship BIOS updates and have customers apply them.
I'll let him know that you've posted on this thread, in case he can help you.
Thank you very much for your reply! It does look like we're hitting that highly specific and detailed set of internal timing conditions that the erratum mentions.
We do have some IDIV instructions surrounding the faulting code (see A&B on the code listings).
I guess we'll resort to compiling this source file with the set of compiler options that was making it work.
NEW CODE STREAM 0112A9A1: 74 0B je 0112A9AE 0112A9A3: 8B C2 mov eax,edx 0112A9A5: 99 cdq 0112A9A6: F7 B9 38 04 00 00 idiv eax,dword ptr [ecx+00000438h] ; (A) 0112A9AC: EB 2F jmp 0112A9DD 0112A9AE: 8B 81 38 04 00 00 mov eax,dword ptr [ecx+00000438h] 0112A9B4: 48 dec eax 0112A9B5: 85 D2 test edx,edx 0112A9B7: 89 44 24 1C mov dword ptr [esp+1Ch],eax 0112A9BB: C7 84 24 0C 01 00 mov dword ptr [esp+0000010Ch],0 00 00 00 00 00 0112A9C6: 8D 54 24 18 lea edx,[esp+18h] 0112A9CA: 7F 07 jg 0112A9D3 0112A9CC: 8D 94 24 0C 01 00 lea edx,[esp+0000010Ch] 00 0112A9D3: 39 02 cmp dword ptr [edx],eax ; (1) 0112A9D5: 7C 04 jl 0112A9DB 0112A9D7: 8D 54 24 1C lea edx,[esp+1Ch] 0112A9DB: 8B 12 mov edx,dword ptr [edx] ; (2) 0112A9DD: 8B 89 3C 04 00 00 mov ecx,dword ptr [ecx+0000043Ch] 0112A9E3: 8B C2 mov eax,edx 0112A9E5: 99 cdq 0112A9E6: F7 F9 idiv eax,ecx ; (B) OLD CODE STREAM 0105904F: 74 0B je 0105905C 01059051: 8B C2 mov eax,edx 01059053: 99 cdq 01059054: F7 BE 38 04 00 00 idiv eax,dword ptr [esi+00000438h] ; (A) 0105905A: EB 20 jmp 0105907C 0105905C: 8B 86 38 04 00 00 mov eax,dword ptr [esi+00000438h] 01059062: 48 dec eax 01059063: 89 4D FC mov dword ptr [ebp-4],ecx 01059066: 3B D1 cmp edx,ecx 01059068: 89 45 B4 mov dword ptr [ebp-4Ch],eax 0105906B: 8D 4D 0C lea ecx,[ebp+0Ch] 0105906E: 7F 03 jg 01059073 01059070: 8D 4D FC lea ecx,[ebp-4] 01059073: 39 01 cmp dword ptr [ecx],eax ; (1) 01059075: 7C 03 jl 0105907A 01059077: 8D 4D B4 lea ecx,[ebp-4Ch] 0105907A: 8B 11 mov edx,dword ptr [ecx] ; (2) 0105907C: 8B 8E 3C 04 00 00 mov ecx,dword ptr [esi+0000043Ch] 01059082: 8B C2 mov eax,edx 01059084: 99 cdq 01059085: F7 F9 idiv eax,ecx ; (B)
We have been unable to reproduce this problem ourselves, but we have already ordered a machine with one of these CPUs for testing.
For now, all I can suggest is downloading the free trial of RIFT and just playing the game. Our players are crashing at random times between 30 seconds and 3 hours after logging in.