25 Replies Latest reply on Mar 25, 2009 11:35 AM by indi123

    AMD SIMD bug!

    godsic

      Hello.

      I think I find bug in AMD K11 (or maybe all others generations).

      I programm some code and spent 4h to find bug in my code, which is was like that:

      {

      ...

      _mm_store_sd(&v[1],XMM0)

      _mm_storeh_pd(&v[0],XMM0)

      }

      I always use such combinations to store in MEM if there is no way to optimization (also such structure cost me less than _mm_shuffle+_mm_store_pd) and on x86 OS it is work perfect, but when I recompile my program for x64 I notice incorrect results in my prog, so I try to find and correct my errors for 4h, after all I am just try to change this code to

      {

      ...

      XMM1=_mm_shuffle_pd(XMM0,XMM0,_MM_SHUFFLE2(0,1));

      _mm_store_pd(&v[0],XMM1);

      }

      After my program runs correctly. I suggest that this is bug in SIMD pipeline (probably instructions decoder).

      Is anybody from AMD CPU part can accept this information?

        • AMD SIMD bug!

          Thank you for bringing this to our attention.  I have passed it on to our engineering team for futher review. 

           

            • AMD SIMD bug!
              godsic

              I compile this code using VS2008 with following flags-

              /GL /O2 /Ob1 /Oi /Ot /fp:fast /arch:SSE2 /favor:blend /Zp1 /OPT:ref /openmp /D_CRT_SECURE_NO_DEPRECATE

              Some flags are ignoring by x64 compiler, but it presented, because I also compile for x86.

               CPU: AMD Athlonx2 QL-62

              • AMD SIMD bug!
                godsic

                I compile this code using VS2008 with following flags-

                /GL /O2 /Ob1 /Oi /Ot /fp:fast /arch:SSE2 /favor:blend /Zp1 /OPT:ref /openmp /D_CRT_SECURE_NO_DEPRECATE

                Some flags are ignoring by x64 compiler, but it presented, because I also compile for x86.

                CPU: AMD Athlonx2 QL-62

                 

                  • AMD SIMD bug!
                    avk

                    You see, I think that it's extremely difficult to find a bug in CPU. Maybe in your case you've got just compiler's quirk? Please post here disassembly listing of the problem code.

                      • AMD SIMD bug!
                        godsic

                        Sorry, but I can not do it. Because CodeAnalyst does not support IBS on my CPU, and my program consist many modules, which is use TCL/TK for GUI. I use script with to build it. Also, I have not any dissasm to do it. 

                        May be other ways?

                          • AMD SIMD bug!
                            godsic

                            Is AMD can add support of offline dissasm to CodeAnalyst?

                            Is AMD can add on-fly connect to running process and its profiling?

                            It will increase usability of CA ten times!

                             

                             

                             

                              • AMD SIMD bug!
                                avk

                                Nobody wants to see whole your program. Try to isolate the buggy code fragment and write a simple function or program that prove your suspision. BTW, Visual Studio contains disassembler, it can be accessed during debug.

                                  • AMD SIMD bug!
                                    godsic

                                    I modify GNU program! So I can show all code, even all sources Nice idea. I know that VS has dissasembler, but I have not VS project of my program, so I cant use it :-(! Ok, I write simple console and show dissasm in 10 minutes.  

                                      • AMD SIMD bug!
                                        godsic

                                        WIN32 code:

                                         

                                        #include <windows.h>

                                        #include <intrin.h>

                                        #include <stdio.h>

                                        void main()

                                        {

                                        00841000  push        ebp  

                                        00841001  mov         ebp,esp 

                                        00841003  and         esp,0FFFFFFF0h 

                                        00841006  sub         esp,30h 

                                           __m128d XMM0,XMM1,XMM3;

                                           _declspec(align(16)) double A[2]={0.1,0.2};

                                        00841009  movsd       xmm0,mmword ptr [__real@3fb999999999999a (8422E0h)] 

                                           _declspec(align(16)) double B[2]={0.2,0.3};

                                           _declspec(align(16)) double C[2]={0.0,0.0};

                                        00841011  fldz             

                                        00841013  movsd       mmword ptr [esp+20h],xmm0 

                                        00841019  fst         qword ptr [esp] 

                                        0084101C  movsd       xmm0,mmword ptr [__real@3fc999999999999a (8422D0h)] 

                                        00841024  fstp        qword ptr [esp+8] 

                                        00841028  movsd       mmword ptr [esp+28h],xmm0 

                                         

                                           XMM0 = _mm_load_pd(A);

                                        0084102E  movapd      xmm1,xmmword ptr [esp+20h] 

                                        00841034  movsd       mmword ptr [esp+10h],xmm0 

                                        0084103A  movsd       xmm0,mmword ptr [__real@3fd3333333333333 (8422C8h)] 

                                        00841042  movsd       mmword ptr [esp+18h],xmm0 

                                           XMM1 = _mm_load_pd(B);

                                        00841048  movapd      xmm0,xmmword ptr [esp+10h] 

                                         

                                           XMM3 = _mm_add_pd(XMM0,XMM1);

                                        0084104E  addpd       xmm0,xmm1 

                                           _mm_store_sd(&C[1],XMM3);

                                        00841052  movsd       mmword ptr [esp+8],xmm0 

                                           _mm_storeh_pd(&C[0],XMM3);

                                         

                                           printf("test: %e %e\n",C[0],C[1]);

                                        00841058  fld         qword ptr [esp+8] 

                                        0084105C  sub         esp,10h 

                                        0084105F  fstp        qword ptr [esp+8] 

                                        00841063  movhpd      qword ptr [esp+10h],xmm0 

                                        00841069  fld         qword ptr [esp+10h] 

                                        0084106D  fstp        qword ptr [esp] 

                                        00841070  push        offset string "test: %e %e\n" (8422B8h) 

                                        00841075  call        dword ptr [__imp__printf (8420A8h)] 

                                        0084107B  add         esp,14h 

                                         

                                        }

                                        0084107E  xor         eax,eax 

                                        00841080  mov         esp,ebp 

                                        00841082  pop         ebp  

                                        00841083  ret              

                                         



                                          • AMD SIMD bug!
                                            godsic

                                            WIN64 code:

                                             

                                            000000013FAF1002  db          c4h  

                                            000000013FAF1003  sub         rsp,58h 

                                               __m128d XMM0,XMM1,XMM3;

                                               _declspec(align(16)) double A[2]={0.1,0.2};

                                            000000013FAF1007  movsd       xmm1,mmword ptr [__real@3fc999999999999a (13FAF22B8h)] 

                                            000000013FAF100F  movsd       xmm0,mmword ptr [__real@3fb999999999999a (13FAF22B0h)] 

                                               _declspec(align(16)) double B[2]={0.2,0.3};

                                               _declspec(align(16)) double C[2]={0.0,0.0};

                                             

                                               XMM0 = _mm_load_pd(A);

                                               XMM1 = _mm_load_pd(B);

                                             

                                               XMM3 = _mm_add_pd(XMM0,XMM1);

                                               _mm_store_sd(&C[1],XMM3);

                                               _mm_storeh_pd(&C[0],XMM3);

                                             

                                               printf("test: %e %e\n",C[0],C[1]);

                                            000000013FAF1017  lea         rcx,[string "test: %e %e\n" (13FAF2298h)] 

                                            000000013FAF101E  movsd       mmword ptr [rax-18h],xmm0 

                                            000000013FAF1023  movsd       xmm0,mmword ptr [__real@3fd3333333333333 (13FAF22A8h)] 

                                            000000013FAF102B  movsd       mmword ptr [rax-28h],xmm1 

                                            000000013FAF1030  movsd       mmword ptr [rax-10h],xmm1 

                                            000000013FAF1035  xorpd       xmm1,xmm1 

                                            000000013FAF1039  movsd       mmword ptr [rax-20h],xmm0 

                                            000000013FAF103E  movapd      xmm0,xmmword ptr [rax-28h] 

                                            000000013FAF1043  addpd       xmm0,xmmword ptr [rax-18h] 

                                            000000013FAF1048  movsd       mmword ptr [rax-30h],xmm1 

                                            000000013FAF104D  movsd       mmword ptr [rax-38h],xmm1 

                                            000000013FAF1052  movsd       mmword ptr [rax-30h],xmm0 

                                            000000013FAF1057  movhpd      qword ptr [rax-38h],xmm0 

                                            000000013FAF105C  movsd       xmm2,mmword ptr [rax-30h] 

                                            000000013FAF1061  movsd       xmm1,mmword ptr [rax-38h] 

                                            000000013FAF1066  movd        r8,xmm2 

                                            000000013FAF106B  movd        rdx,xmm1 

                                            000000013FAF1070  call        qword ptr [__imp_printf (13FAF2140h)] 

                                             

                                            }

                                            000000013FAF1076  xor         eax,eax 

                                            000000013FAF1078  add         rsp,58h 

                                            000000013FAF107C  ret      



                                              • AMD SIMD bug!
                                                godsic

                                                Is it called collaboration between AMD and Microsoft?

                                                Also, two previous parts of code works perfect on both x64 and x86!

                                                 

                                                  • AMD SIMD bug!
                                                    godsic

                                                    this win64 code work incorrect:

                                                     

                                                    #include <windows.h>

                                                    #include <intrin.h>

                                                    #include <stdio.h>

                                                    void main()

                                                    {

                                                    000000013FB21000  sub         rsp,48h 

                                                       __m128d XMM0,XMM1,XMM3;

                                                       _declspec(align(16)) double *v;

                                                       v = (double*)malloc(sizeof(double)*2);

                                                    000000013FB21004  mov         ecx,10h 

                                                    000000013FB21009  call        qword ptr [__imp_malloc (13FB22138h)] 

                                                       _declspec(align(16)) double A[2]={0.1,0.2};

                                                       _declspec(align(16)) double B[2]={0.2,0.3};

                                                       //_declspec(align(16)) double C[2]={0.0,0.0};

                                                     

                                                       XMM0 = _mm_load_pd(A);

                                                       XMM1 = _mm_load_pd(B);

                                                     

                                                       XMM3 = _mm_add_pd(XMM0,XMM1);

                                                       

                                                       //XMM3 = _mm_shuffle_pd(XMM3,XMM3,_MM_SHUFFLE2(0,1));

                                                       //_mm_store_pd(&v[0],XMM3);

                                                     

                                                       _mm_store_sd(&v[1],XMM3);

                                                       _mm_storeh_pd(&v[0],XMM3);

                                                     

                                                       printf("test: %e %e\n",v[0],v[1]);

                                                    000000013FB2100F  lea         rcx,[string "test: %e %e\n" (13FB221B0h)] 

                                                    000000013FB21016  movsd       xmm0,mmword ptr [__real@3fb999999999999a (13FB221D0h)] 

                                                    000000013FB2101E  movsd       xmm1,mmword ptr [__real@3fc999999999999a (13FB221C8h)] 

                                                    000000013FB21026  movsd       mmword ptr [rsp+38h],xmm1 

                                                    000000013FB2102C  movsd       mmword ptr ,xmm1 

                                                    000000013FB21032  movsd       mmword ptr [A],xmm0 

                                                    000000013FB21038  movsd       xmm0,mmword ptr [__real@3fd3333333333333 (13FB221C0h)] 

                                                    000000013FB21040  movsd       mmword ptr [rsp+28h],xmm0 

                                                    000000013FB21046  movapd      xmm1,xmmword ptr  

                                                    000000013FB2104C  addpd       xmm1,xmmword ptr [A] 

                                                    000000013FB21052  movhpd      qword ptr [rax],xmm1 

                                                    000000013FB21056  movsd       xmm2,mmword ptr [rax+8] 

                                                    000000013FB2105B  movsd       xmm1,mmword ptr [rax] 

                                                    000000013FB2105F  movd        r8,xmm2 

                                                    000000013FB21064  movd        rdx,xmm1 

                                                    000000013FB21069  call        qword ptr [__imp_printf (13FB22128h)] 

                                                     

                                                    }

                                                    000000013FB2106F  xor         eax,eax 

                                                    000000013FB21071  add         rsp,48h 

                                                    000000013FB21075  ret              



                                                      • AMD SIMD bug!
                                                        godsic

                                                        Same C++ code, but compiled for x86 works prefect:

                                                         

                                                        00151002  in          al,dx 

                                                        00151003  and         esp,0FFFFFFF0h 

                                                        00151006  sub         esp,20h 

                                                           __m128d XMM0,XMM1,XMM3;

                                                           _declspec(align(16)) double *v;

                                                           v = (double*)malloc(sizeof(double)*2);

                                                        00151009  push        10h  

                                                        0015100B  call        dword ptr [__imp__malloc (1520A4h)] 

                                                           _declspec(align(16)) double A[2]={0.1,0.2};

                                                        00151011  fld         qword ptr [__real@3fb999999999999a (152118h)] 

                                                        00151017  fstp        qword ptr [esp+14h] 

                                                           _declspec(align(16)) double B[2]={0.2,0.3};

                                                           //_declspec(align(16)) double C[2]={0.0,0.0};

                                                         

                                                           XMM0 = _mm_load_pd(A);

                                                           XMM1 = _mm_load_pd(B);

                                                         

                                                           XMM3 = _mm_add_pd(XMM0,XMM1);

                                                           

                                                           //XMM3 = _mm_shuffle_pd(XMM3,XMM3,_MM_SHUFFLE2(0,1));

                                                           //_mm_store_pd(&v[0],XMM3);

                                                         

                                                           _mm_store_sd(&v[1],XMM3);

                                                           _mm_storeh_pd(&v[0],XMM3);

                                                         

                                                           printf("test: %e %e\n",v[0],v[1]);

                                                        0015101B  sub         esp,0Ch 

                                                        0015101E  fld         qword ptr [__real@3fc999999999999a (152110h)] 

                                                        00151024  fst         qword ptr [esp+28h] 

                                                        00151028  movapd      xmm1,xmmword ptr [esp+20h] 

                                                        0015102E  fstp        qword ptr [esp+10h] 

                                                        00151032  fld         qword ptr [__real@3fd3333333333333 (152108h)] 

                                                        00151038  fstp        qword ptr [esp+18h] 

                                                        0015103C  movapd      xmm0,xmmword ptr [esp+10h] 

                                                        00151042  addpd       xmm0,xmm1 

                                                        00151046  movsd       mmword ptr [eax+8],xmm0 

                                                        0015104B  movhpd      qword ptr [eax],xmm0 

                                                        0015104F  fld         qword ptr [eax+8] 

                                                        00151052  fstp        qword ptr [esp+8] 

                                                        00151056  fld         qword ptr [eax] 

                                                        00151058  fstp        qword ptr [esp] 

                                                        0015105B  push        offset string "test: %e %e\n" (1520F4h) 

                                                        00151060  call        dword ptr [__imp__printf (15209Ch)] 

                                                        00151066  add         esp,14h 

                                                         

                                                        }

                                                        00151069  xor         eax,eax 

                                                        0015106B  mov         esp,ebp 

                                                        0015106D  pop         ebp  

                                                        0015106E  ret             



                                                          • AMD SIMD bug!
                                                            godsic

                                                            sorec:

                                                             

                                                            #include <windows.h>

                                                            #include <intrin.h>

                                                            #include <stdio.h>

                                                            void main()

                                                            {

                                                               __m128d XMM0,XMM1,XMM3;

                                                               _declspec(align(16)) double *v;

                                                               v = (double*)malloc(sizeof(double)*2);

                                                               _declspec(align(16)) double A[2]={0.1,0.2};

                                                               _declspec(align(16)) double B[2]={0.2,0.3};

                                                               //_declspec(align(16)) double C[2]={0.0,0.0};

                                                             

                                                               XMM0 = _mm_load_pd(A);

                                                               XMM1 = _mm_load_pd(B);

                                                             

                                                               XMM3 = _mm_add_pd(XMM0,XMM1);

                                                               

                                                               //XMM3 = _mm_shuffle_pd(XMM3,XMM3,_MM_SHUFFLE2(0,1));

                                                               //_mm_store_pd(&v[0],XMM3);

                                                             

                                                               _mm_store_sd(&v[1],XMM3);

                                                               _mm_storeh_pd(&v[0],XMM3);

                                                             

                                                               printf("test: %e %e\n",v[0],v[1]);

                                                             

                                                            }



                                                              • AMD SIMD bug!
                                                                avk

                                                                Please describe what's wrong with a buggy code fragment. I mean: "It should produce X, but I've got Y."

                                                                  • AMD SIMD bug!
                                                                    godsic

                                                                    I think code is quite clear, so we got:

                                                                    v[0]=A[1]+B[1], v[1]=A[0]+B[0];

                                                                    in x86 case printf print - test: 5.00000000e-1, 3.00000000e-1!

                                                                    in x64 case printf print  -test: 5.00000000e-1, 0.00000000e-1!

                                                                    Also, as I see on asm - VS compiler produce slower code for x64, why it split MOVAPD to 2xMOVSD?

                                                                     I can only suggest that SIMD units is natively x86 (32bit address)!

                                                                     

                                                                     

                                                                      • AMD SIMD bug!
                                                                        eduardoschardong

                                                                        Maybe it's a bug with VS code generation?

                                                                         

                                                                        In the x64 asm:

                                                                        000000013FB21056  movsd       xmm2,mmword ptr [rax+8] 

                                                                        Where [rax+8] was written?

                                                                         

                                                                        • AMD SIMD bug!
                                                                          avk

                                                                          godsic: Look, did you try to run your program on Intel CPUs? I guess that you'll get the same results. Please do it and then we'll continue.

                                                                            • AMD SIMD bug!
                                                                              godsic

                                                                              Basically, I am AMD fan so I have not Intel based PC!

                                                                              And I think eduardoschardong right that VS generate wrong code! But it can e due to x64 limitations? 

                                                                              TO AMD: How SIMD units fetch from memory or L2,L3 chache?

                                                                              This bug occur only when I try to write to data allocated with malloc(_align_malloc), if you declare static data all is perfect

                                                                              • AMD SIMD bug!
                                                                                indi123

                                                                                I think that it's extremely difficult to find a bug in CPU.

                                                                                Edit: Removed advertising from post

                                                          • AMD SIMD bug!

                                                            Thank’s for the suggestion.  AMD CodeAnalyst Performance Analyzer

                                                            focuses on program profiling and performance analysis. It does disassemble

                                                            code when it displays profile data. However, it is not intended to be a

                                                            dedicated disassembly tool. We would recommend using dumpbin or

                                                            one of the disassemblers that are available as freeware, shareware, or

                                                            released products.

                                                             

                                                            CodeAnalyst does not attach to a process in the way that debuggers

                                                            attach to processes. For disassembly of short regions of code, a

                                                            debugger would be an alternative to a disassembler.

                                                             

                                                            CodeAnalyst collects data using system-wide sampling (using

                                                            either timer-based sampling, event-based sampling, or Instruction-

                                                            Based Sampling). Timer- and event-based sampling are available

                                                            even on processors without Instruction-Based Sampling. Thus,

                                                            if you would really like to get a disassembly using CodeAnalyst,

                                                            please run and profile your program with CodeAnalyst, then

                                                            drill-down to the region of interest within the program.

                                                              • AMD SIMD bug!
                                                                godsic

                                                                to storia:

                                                                I think that it is the most AMD disadvantage dont listen their customers!

                                                                It will be very useful to use CA to measure the latency of block of code offline without execution!  So we need to see dissasembly to optimize C++ code?

                                                                If AMD create all in one tool it will be great present to AMD fans!

                                                                I think that PerfAnalyzer is all-in-one tool for OpenGL DX developers focused on ATI GPU! So, why AMD wan't to do the same tool for CPU developers?

                                                                For example you can create TOPIC to ask AMD developers what they want to see in AMD tools!

                                                                Make sense?

                                                                 

                                                                  • AMD SIMD bug!
                                                                    avk

                                                                    Like I said, I think that you're encounter with compiler's quirk or bug. So try to find the root of problem on Microsoft's forum.

                                                                      • AMD SIMD bug!
                                                                        godsic

                                                                        to avk: yes you was right, but have a look on disassembly!

                                                                        It is ugly and unefficient! AMD claimed that they work with MSVS developers to optimize MSVC compiler for better perf on AMD, but from dissasm i see that they neglected with all benefits of x64 architecture, they still prefer to use memory to store data ( to amd: do you really think that developers optimize their code for cache? :-)) even if registers can be used, which result in slower code and probably cache polluting!

                                                                        So I suggest that AMD doesnt work with Microsoft at all!

                                                                         

                                                                          • AMD SIMD bug!
                                                                            avk

                                                                            Nobody is perfect. Please don't jump to conclusion about the Microsoft and AMD cooperation. I think that AMD cannot directly influence on quality of Microsoft compiler. I believe that AMD just provide some information for optimization and maybe some testbeds. So if you think that you've find an error in compiler, then let it's author (Microsoft) know about this.