cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

godsic
Journeyman III

AMD SIMD bug!

Hello.

I think I find bug in AMD K11 (or maybe all others generations).

I programm some code and spent 4h to find bug in my code, which is was like that:

{

...

_mm_store_sd(&v[1],XMM0)

_mm_storeh_pd(&v[0],XMM0)

}

I always use such combinations to store in MEM if there is no way to optimization (also such structure cost me less than _mm_shuffle+_mm_store_pd) and on x86 OS it is work perfect, but when I recompile my program for x64 I notice incorrect results in my prog, so I try to find and correct my errors for 4h, after all I am just try to change this code to

{

...

XMM1=_mm_shuffle_pd(XMM0,XMM0,_MM_SHUFFLE2(0,1));

_mm_store_pd(&v[0],XMM1);

}

After my program runs correctly. I suggest that this is bug in SIMD pipeline (probably instructions decoder).

Is anybody from AMD CPU part can accept this information?

0 Likes
25 Replies
stroia
Staff

Thank you for bringing this to our attention.  I have passed it on to our engineering team for futher review. 

 

0 Likes

I compile this code using VS2008 with following flags-

/GL /O2 /Ob1 /Oi /Ot /fp:fast /arch:SSE2 /favor:blend /Zp1 /OPT:ref /openmp /D_CRT_SECURE_NO_DEPRECATE

Some flags are ignoring by x64 compiler, but it presented, because I also compile for x86.

 CPU: AMD Athlonx2 QL-62

0 Likes

I compile this code using VS2008 with following flags-

/GL /O2 /Ob1 /Oi /Ot /fp:fast /arch:SSE2 /favor:blend /Zp1 /OPT:ref /openmp /D_CRT_SECURE_NO_DEPRECATE

Some flags are ignoring by x64 compiler, but it presented, because I also compile for x86.

CPU: AMD Athlonx2 QL-62

 

0 Likes

You see, I think that it's extremely difficult to find a bug in CPU. Maybe in your case you've got just compiler's quirk? Please post here disassembly listing of the problem code.

0 Likes
godsic
Journeyman III

Sorry, but I can not do it. Because CodeAnalyst does not support IBS on my CPU, and my program consist many modules, which is use TCL/TK for GUI. I use script with to build it. Also, I have not any dissasm to do it. 

May be other ways?

0 Likes

Is AMD can add support of offline dissasm to CodeAnalyst?

Is AMD can add on-fly connect to running process and its profiling?

It will increase usability of CA ten times!

 

 

 

0 Likes

Nobody wants to see whole your program. Try to isolate the buggy code fragment and write a simple function or program that prove your suspision. BTW, Visual Studio contains disassembler, it can be accessed during debug.

0 Likes
godsic
Journeyman III

I modify GNU program! So I can show all code, even all sources Nice idea. I know that VS has dissasembler, but I have not VS project of my program, so I cant use it :-(! Ok, I write simple console and show dissasm in 10 minutes.  

0 Likes

WIN32 code:

#include <windows.h>

#include <intrin.h>

#include <stdio.h>

void main()

{

00841000  push        ebp  

00841001  mov         ebp,esp 

00841003  and         esp,0FFFFFFF0h 

00841006  sub         esp,30h 

   __m128d XMM0,XMM1,XMM3;

   _declspec(align(16)) double A[2]={0.1,0.2};

00841009  movsd       xmm0,mmword ptr [__real@3fb999999999999a (8422E0h)] 

   _declspec(align(16)) double B[2]={0.2,0.3};

   _declspec(align(16)) double C[2]={0.0,0.0};

00841011  fldz             

00841013  movsd       mmword ptr [esp+20h],xmm0 

00841019  fst         qword ptr [esp] 

0084101C  movsd       xmm0,mmword ptr [__real@3fc999999999999a (8422D0h)] 

00841024  fstp        qword ptr [esp+8] 

00841028  movsd       mmword ptr [esp+28h],xmm0 

 

   XMM0 = _mm_load_pd(A);

0084102E  movapd      xmm1,xmmword ptr [esp+20h] 

00841034  movsd       mmword ptr [esp+10h],xmm0 

0084103A  movsd       xmm0,mmword ptr [__real@3fd3333333333333 (8422C8h)] 

00841042  movsd       mmword ptr [esp+18h],xmm0 

   XMM1 = _mm_load_pd(B);

00841048  movapd      xmm0,xmmword ptr [esp+10h] 

 

   XMM3 = _mm_add_pd(XMM0,XMM1);

0084104E  addpd       xmm0,xmm1 

   _mm_store_sd(&C[1],XMM3);

00841052  movsd       mmword ptr [esp+8],xmm0 

   _mm_storeh_pd(&C[0],XMM3);

 

   printf("test: %e %e\n",C[0],C[1]);

00841058  fld         qword ptr [esp+8] 

0084105C  sub         esp,10h 

0084105F  fstp        qword ptr [esp+8] 

00841063  movhpd      qword ptr [esp+10h],xmm0 

00841069  fld         qword ptr [esp+10h] 

0084106D  fstp        qword ptr [esp] 

00841070  push        offset string "test: %e %e\n" (8422B8h) 

00841075  call        dword ptr [__imp__printf (8420A8h)] 

0084107B  add         esp,14h 

 

}

0084107E  xor         eax,eax 

00841080  mov         esp,ebp 

00841082  pop         ebp  

00841083  ret              

 



0 Likes

WIN64 code:

000000013FAF1002  db          c4h  

000000013FAF1003  sub         rsp,58h 

   __m128d XMM0,XMM1,XMM3;

   _declspec(align(16)) double A[2]={0.1,0.2};

000000013FAF1007  movsd       xmm1,mmword ptr [__real@3fc999999999999a (13FAF22B8h)] 

000000013FAF100F  movsd       xmm0,mmword ptr [__real@3fb999999999999a (13FAF22B0h)] 

   _declspec(align(16)) double B[2]={0.2,0.3};

   _declspec(align(16)) double C[2]={0.0,0.0};

 

   XMM0 = _mm_load_pd(A);

   XMM1 = _mm_load_pd(B);

 

   XMM3 = _mm_add_pd(XMM0,XMM1);

   _mm_store_sd(&C[1],XMM3);

   _mm_storeh_pd(&C[0],XMM3);

 

   printf("test: %e %e\n",C[0],C[1]);

000000013FAF1017  lea         rcx,[string "test: %e %e\n" (13FAF2298h)] 

000000013FAF101E  movsd       mmword ptr [rax-18h],xmm0 

000000013FAF1023  movsd       xmm0,mmword ptr [__real@3fd3333333333333 (13FAF22A8h)] 

000000013FAF102B  movsd       mmword ptr [rax-28h],xmm1 

000000013FAF1030  movsd       mmword ptr [rax-10h],xmm1 

000000013FAF1035  xorpd       xmm1,xmm1 

000000013FAF1039  movsd       mmword ptr [rax-20h],xmm0 

000000013FAF103E  movapd      xmm0,xmmword ptr [rax-28h] 

000000013FAF1043  addpd       xmm0,xmmword ptr [rax-18h] 

000000013FAF1048  movsd       mmword ptr [rax-30h],xmm1 

000000013FAF104D  movsd       mmword ptr [rax-38h],xmm1 

000000013FAF1052  movsd       mmword ptr [rax-30h],xmm0 

000000013FAF1057  movhpd      qword ptr [rax-38h],xmm0 

000000013FAF105C  movsd       xmm2,mmword ptr [rax-30h] 

000000013FAF1061  movsd       xmm1,mmword ptr [rax-38h] 

000000013FAF1066  movd        r8,xmm2 

000000013FAF106B  movd        rdx,xmm1 

000000013FAF1070  call        qword ptr [__imp_printf (13FAF2140h)] 

 

}

000000013FAF1076  xor         eax,eax 

000000013FAF1078  add         rsp,58h 

000000013FAF107C  ret      



0 Likes

Is it called collaboration between AMD and Microsoft?

Also, two previous parts of code works perfect on both x64 and x86!

 

0 Likes

this win64 code work incorrect:

#include <windows.h>

#include <intrin.h>

#include <stdio.h>

void main()

{

000000013FB21000  sub         rsp,48h 

   __m128d XMM0,XMM1,XMM3;

   _declspec(align(16)) double *v;

   v = (double*)malloc(sizeof(double)*2);

000000013FB21004  mov         ecx,10h 

000000013FB21009  call        qword ptr [__imp_malloc (13FB22138h)] 

   _declspec(align(16)) double A[2]={0.1,0.2};

   _declspec(align(16)) double B[2]={0.2,0.3};

   //_declspec(align(16)) double C[2]={0.0,0.0};

 

   XMM0 = _mm_load_pd(A);

   XMM1 = _mm_load_pd(B);

 

   XMM3 = _mm_add_pd(XMM0,XMM1);

   

   //XMM3 = _mm_shuffle_pd(XMM3,XMM3,_MM_SHUFFLE2(0,1));

   //_mm_store_pd(&v[0],XMM3);

 

   _mm_store_sd(&v[1],XMM3);

   _mm_storeh_pd(&v[0],XMM3);

 

   printf("test: %e %e\n",v[0],v[1]);

000000013FB2100F  lea         rcx,[string "test: %e %e\n" (13FB221B0h)] 

000000013FB21016  movsd       xmm0,mmword ptr [__real@3fb999999999999a (13FB221D0h)] 

000000013FB2101E  movsd       xmm1,mmword ptr [__real@3fc999999999999a (13FB221C8h)] 

000000013FB21026  movsd       mmword ptr [rsp+38h],xmm1 

000000013FB2102C  movsd       mmword ptr ,xmm1 

000000013FB21032  movsd       mmword ptr ,xmm0 

000000013FB21038  movsd       xmm0,mmword ptr [__real@3fd3333333333333 (13FB221C0h)] 

000000013FB21040  movsd       mmword ptr [rsp+28h],xmm0 

000000013FB21046  movapd      xmm1,xmmword ptr  

000000013FB2104C  addpd       xmm1,xmmword ptr  

000000013FB21052  movhpd      qword ptr [rax],xmm1 

000000013FB21056  movsd       xmm2,mmword ptr [rax+8] 

000000013FB2105B  movsd       xmm1,mmword ptr [rax] 

000000013FB2105F  movd        r8,xmm2 

000000013FB21064  movd        rdx,xmm1 

000000013FB21069  call        qword ptr [__imp_printf (13FB22128h)] 

 

}

000000013FB2106F  xor         eax,eax 

000000013FB21071  add         rsp,48h 

000000013FB21075  ret              



0 Likes

Same C++ code, but compiled for x86 works prefect:

00151002  in          al,dx 

00151003  and         esp,0FFFFFFF0h 

00151006  sub         esp,20h 

   __m128d XMM0,XMM1,XMM3;

   _declspec(align(16)) double *v;

   v = (double*)malloc(sizeof(double)*2);

00151009  push        10h  

0015100B  call        dword ptr [__imp__malloc (1520A4h)] 

   _declspec(align(16)) double A[2]={0.1,0.2};

00151011  fld         qword ptr [__real@3fb999999999999a (152118h)] 

00151017  fstp        qword ptr [esp+14h] 

   _declspec(align(16)) double B[2]={0.2,0.3};

   //_declspec(align(16)) double C[2]={0.0,0.0};

 

   XMM0 = _mm_load_pd(A);

   XMM1 = _mm_load_pd(B);

 

   XMM3 = _mm_add_pd(XMM0,XMM1);

   

   //XMM3 = _mm_shuffle_pd(XMM3,XMM3,_MM_SHUFFLE2(0,1));

   //_mm_store_pd(&v[0],XMM3);

 

   _mm_store_sd(&v[1],XMM3);

   _mm_storeh_pd(&v[0],XMM3);

 

   printf("test: %e %e\n",v[0],v[1]);

0015101B  sub         esp,0Ch 

0015101E  fld         qword ptr [__real@3fc999999999999a (152110h)] 

00151024  fst         qword ptr [esp+28h] 

00151028  movapd      xmm1,xmmword ptr [esp+20h] 

0015102E  fstp        qword ptr [esp+10h] 

00151032  fld         qword ptr [__real@3fd3333333333333 (152108h)] 

00151038  fstp        qword ptr [esp+18h] 

0015103C  movapd      xmm0,xmmword ptr [esp+10h] 

00151042  addpd       xmm0,xmm1 

00151046  movsd       mmword ptr [eax+8],xmm0 

0015104B  movhpd      qword ptr [eax],xmm0 

0015104F  fld         qword ptr [eax+8] 

00151052  fstp        qword ptr [esp+8] 

00151056  fld         qword ptr [eax] 

00151058  fstp        qword ptr [esp] 

0015105B  push        offset string "test: %e %e\n" (1520F4h) 

00151060  call        dword ptr [__imp__printf (15209Ch)] 

00151066  add         esp,14h 

 

}

00151069  xor         eax,eax 

0015106B  mov         esp,ebp 

0015106D  pop         ebp  

0015106E  ret             



0 Likes

sorec:

#include <windows.h>

#include <intrin.h>

#include <stdio.h>

void main()

{

   __m128d XMM0,XMM1,XMM3;

   _declspec(align(16)) double *v;

   v = (double*)malloc(sizeof(double)*2);

   _declspec(align(16)) double A[2]={0.1,0.2};

   _declspec(align(16)) double B[2]={0.2,0.3};

   //_declspec(align(16)) double C[2]={0.0,0.0};

 

   XMM0 = _mm_load_pd(A);

   XMM1 = _mm_load_pd(B);

 

   XMM3 = _mm_add_pd(XMM0,XMM1);

   

   //XMM3 = _mm_shuffle_pd(XMM3,XMM3,_MM_SHUFFLE2(0,1));

   //_mm_store_pd(&v[0],XMM3);

 

   _mm_store_sd(&v[1],XMM3);

   _mm_storeh_pd(&v[0],XMM3);

 

   printf("test: %e %e\n",v[0],v[1]);

 

}



0 Likes

Please describe what's wrong with a buggy code fragment. I mean: "It should produce X, but I've got Y."

0 Likes
godsic
Journeyman III

I think code is quite clear, so we got:

v[0]=A[1]+B[1], v[1]=A[0]+B[0];

in x86 case printf print - test: 5.00000000e-1, 3.00000000e-1!

in x64 case printf print  -test: 5.00000000e-1, 0.00000000e-1!

Also, as I see on asm - VS compiler produce slower code for x64, why it split MOVAPD to 2xMOVSD?

 I can only suggest that SIMD units is natively x86 (32bit address)!

 

 

0 Likes

Maybe it's a bug with VS code generation?

 

In the x64 asm:

000000013FB21056  movsd       xmm2,mmword ptr [rax+8] 

Where [rax+8] was written?

 

0 Likes

godsic: Look, did you try to run your program on Intel CPUs? I guess that you'll get the same results. Please do it and then we'll continue.

0 Likes
godsic
Journeyman III

Basically, I am AMD fan so I have not Intel based PC!

And I think eduardoschardong right that VS generate wrong code! But it can e due to x64 limitations? 

TO AMD: How SIMD units fetch from memory or L2,L3 chache?

This bug occur only when I try to write to data allocated with malloc(_align_malloc), if you declare static data all is perfect

0 Likes

I think that it's extremely difficult to find a bug in CPU.

Edit: Removed advertising from post

0 Likes

Thank’s for the suggestion.  AMD CodeAnalyst Performance Analyzer

focuses on program profiling and performance analysis. It does disassemble

code when it displays profile data. However, it is not intended to be a

dedicated disassembly tool. We would recommend using dumpbin or

one of the disassemblers that are available as freeware, shareware, or

released products.

 

CodeAnalyst does not attach to a process in the way that debuggers

attach to processes. For disassembly of short regions of code, a

debugger would be an alternative to a disassembler.

 

CodeAnalyst collects data using system-wide sampling (using

either timer-based sampling, event-based sampling, or Instruction-

Based Sampling). Timer- and event-based sampling are available

even on processors without Instruction-Based Sampling. Thus,

if you would really like to get a disassembly using CodeAnalyst,

please run and profile your program with CodeAnalyst, then

drill-down to the region of interest within the program.

0 Likes

to storia:

I think that it is the most AMD disadvantage dont listen their customers!

It will be very useful to use CA to measure the latency of block of code offline without execution!  So we need to see dissasembly to optimize C++ code?

If AMD create all in one tool it will be great present to AMD fans!

I think that PerfAnalyzer is all-in-one tool for OpenGL DX developers focused on ATI GPU! So, why AMD wan't to do the same tool for CPU developers?

For example you can create TOPIC to ask AMD developers what they want to see in AMD tools!

Make sense?

 

0 Likes

Like I said, I think that you're encounter with compiler's quirk or bug. So try to find the root of problem on Microsoft's forum.

0 Likes
godsic
Journeyman III

to avk: yes you was right, but have a look on disassembly!

It is ugly and unefficient! AMD claimed that they work with MSVS developers to optimize MSVC compiler for better perf on AMD, but from dissasm i see that they neglected with all benefits of x64 architecture, they still prefer to use memory to store data ( to amd: do you really think that developers optimize their code for cache? :-)) even if registers can be used, which result in slower code and probably cache polluting!

So I suggest that AMD doesnt work with Microsoft at all!

 

0 Likes

Nobody is perfect. Please don't jump to conclusion about the Microsoft and AMD cooperation. I think that AMD cannot directly influence on quality of Microsoft compiler. I believe that AMD just provide some information for optimization and maybe some testbeds. So if you think that you've find an error in compiler, then let it's author (Microsoft) know about this.

0 Likes