Hello.
I think I find bug in AMD K11 (or maybe all others generations).
I programm some code and spent 4h to find bug in my code, which is was like that:
{
...
_mm_store_sd(&v[1],XMM0)
_mm_storeh_pd(&v[0],XMM0)
}
I always use such combinations to store in MEM if there is no way to optimization (also such structure cost me less than _mm_shuffle+_mm_store_pd) and on x86 OS it is work perfect, but when I recompile my program for x64 I notice incorrect results in my prog, so I try to find and correct my errors for 4h, after all I am just try to change this code to
{
...
XMM1=_mm_shuffle_pd(XMM0,XMM0,_MM_SHUFFLE2(0,1));
_mm_store_pd(&v[0],XMM1);
}
After my program runs correctly. I suggest that this is bug in SIMD pipeline (probably instructions decoder).
Is anybody from AMD CPU part can accept this information?
Thank you for bringing this to our attention. I have passed it on to our engineering team for futher review.
I compile this code using VS2008 with following flags-
/GL /O2 /Ob1 /Oi /Ot /fp:fast /arch:SSE2 /favor:blend /Zp1 /OPT:ref /openmp /D_CRT_SECURE_NO_DEPRECATE
Some flags are ignoring by x64 compiler, but it presented, because I also compile for x86.
CPU: AMD Athlonx2 QL-62
I compile this code using VS2008 with following flags-
/GL /O2 /Ob1 /Oi /Ot /fp:fast /arch:SSE2 /favor:blend /Zp1 /OPT:ref /openmp /D_CRT_SECURE_NO_DEPRECATE
Some flags are ignoring by x64 compiler, but it presented, because I also compile for x86.
CPU: AMD Athlonx2 QL-62
You see, I think that it's extremely difficult to find a bug in CPU. Maybe in your case you've got just compiler's quirk? Please post here disassembly listing of the problem code.
Sorry, but I can not do it. Because CodeAnalyst does not support IBS on my CPU, and my program consist many modules, which is use TCL/TK for GUI. I use script with to build it. Also, I have not any dissasm to do it.
May be other ways?
Is AMD can add support of offline dissasm to CodeAnalyst?
Is AMD can add on-fly connect to running process and its profiling?
It will increase usability of CA ten times!
Nobody wants to see whole your program. Try to isolate the buggy code fragment and write a simple function or program that prove your suspision. BTW, Visual Studio contains disassembler, it can be accessed during debug.
I modify GNU program! So I can show all code, even all sources Nice idea. I know that VS has dissasembler, but I have not VS project of my program, so I cant use it :-(! Ok, I write simple console and show dissasm in 10 minutes.
WIN32 code:
#include <windows.h>
#include <intrin.h>
#include <stdio.h>
void main()
{
00841000 push ebp
00841001 mov ebp,esp
00841003 and esp,0FFFFFFF0h
00841006 sub esp,30h
__m128d XMM0,XMM1,XMM3;
_declspec(align(16)) double A[2]={0.1,0.2};
00841009 movsd xmm0,mmword ptr [__real@3fb999999999999a (8422E0h)]
_declspec(align(16)) double B[2]={0.2,0.3};
_declspec(align(16)) double C[2]={0.0,0.0};
00841011 fldz
00841013 movsd mmword ptr [esp+20h],xmm0
00841019 fst qword ptr [esp]
0084101C movsd xmm0,mmword ptr [__real@3fc999999999999a (8422D0h)]
00841024 fstp qword ptr [esp+8]
00841028 movsd mmword ptr [esp+28h],xmm0
XMM0 = _mm_load_pd(A);
0084102E movapd xmm1,xmmword ptr [esp+20h]
00841034 movsd mmword ptr [esp+10h],xmm0
0084103A movsd xmm0,mmword ptr [__real@3fd3333333333333 (8422C8h)]
00841042 movsd mmword ptr [esp+18h],xmm0
XMM1 = _mm_load_pd(B);
00841048 movapd xmm0,xmmword ptr [esp+10h]
XMM3 = _mm_add_pd(XMM0,XMM1);
0084104E addpd xmm0,xmm1
_mm_store_sd(&C[1],XMM3);
00841052 movsd mmword ptr [esp+8],xmm0
_mm_storeh_pd(&C[0],XMM3);
printf("test: %e %e\n",C[0],C[1]);
00841058 fld qword ptr [esp+8]
0084105C sub esp,10h
0084105F fstp qword ptr [esp+8]
00841063 movhpd qword ptr [esp+10h],xmm0
00841069 fld qword ptr [esp+10h]
0084106D fstp qword ptr [esp]
00841070 push offset string "test: %e %e\n" (8422B8h)
00841075 call dword ptr [__imp__printf (8420A8h)]
0084107B add esp,14h
}
0084107E xor eax,eax
00841080 mov esp,ebp
00841082 pop ebp
00841083 ret
WIN64 code:
000000013FAF1002 db c4h
000000013FAF1003 sub rsp,58h
__m128d XMM0,XMM1,XMM3;
_declspec(align(16)) double A[2]={0.1,0.2};
000000013FAF1007 movsd xmm1,mmword ptr [__real@3fc999999999999a (13FAF22B8h)]
000000013FAF100F movsd xmm0,mmword ptr [__real@3fb999999999999a (13FAF22B0h)]
_declspec(align(16)) double B[2]={0.2,0.3};
_declspec(align(16)) double C[2]={0.0,0.0};
XMM0 = _mm_load_pd(A);
XMM1 = _mm_load_pd(B);
XMM3 = _mm_add_pd(XMM0,XMM1);
_mm_store_sd(&C[1],XMM3);
_mm_storeh_pd(&C[0],XMM3);
printf("test: %e %e\n",C[0],C[1]);
000000013FAF1017 lea rcx,[string "test: %e %e\n" (13FAF2298h)]
000000013FAF101E movsd mmword ptr [rax-18h],xmm0
000000013FAF1023 movsd xmm0,mmword ptr [__real@3fd3333333333333 (13FAF22A8h)]
000000013FAF102B movsd mmword ptr [rax-28h],xmm1
000000013FAF1030 movsd mmword ptr [rax-10h],xmm1
000000013FAF1035 xorpd xmm1,xmm1
000000013FAF1039 movsd mmword ptr [rax-20h],xmm0
000000013FAF103E movapd xmm0,xmmword ptr [rax-28h]
000000013FAF1043 addpd xmm0,xmmword ptr [rax-18h]
000000013FAF1048 movsd mmword ptr [rax-30h],xmm1
000000013FAF104D movsd mmword ptr [rax-38h],xmm1
000000013FAF1052 movsd mmword ptr [rax-30h],xmm0
000000013FAF1057 movhpd qword ptr [rax-38h],xmm0
000000013FAF105C movsd xmm2,mmword ptr [rax-30h]
000000013FAF1061 movsd xmm1,mmword ptr [rax-38h]
000000013FAF1066 movd r8,xmm2
000000013FAF106B movd rdx,xmm1
000000013FAF1070 call qword ptr [__imp_printf (13FAF2140h)]
}
000000013FAF1076 xor eax,eax
000000013FAF1078 add rsp,58h
000000013FAF107C ret
Is it called collaboration between AMD and Microsoft?
Also, two previous parts of code works perfect on both x64 and x86!
this win64 code work incorrect:
#include <windows.h>
#include <intrin.h>
#include <stdio.h>
void main()
{
000000013FB21000 sub rsp,48h
__m128d XMM0,XMM1,XMM3;
_declspec(align(16)) double *v;
v = (double*)malloc(sizeof(double)*2);
000000013FB21004 mov ecx,10h
000000013FB21009 call qword ptr [__imp_malloc (13FB22138h)]
_declspec(align(16)) double A[2]={0.1,0.2};
_declspec(align(16)) double B[2]={0.2,0.3};
//_declspec(align(16)) double C[2]={0.0,0.0};
XMM0 = _mm_load_pd(A);
XMM1 = _mm_load_pd(B);
XMM3 = _mm_add_pd(XMM0,XMM1);
//XMM3 = _mm_shuffle_pd(XMM3,XMM3,_MM_SHUFFLE2(0,1));
//_mm_store_pd(&v[0],XMM3);
_mm_store_sd(&v[1],XMM3);
_mm_storeh_pd(&v[0],XMM3);
printf("test: %e %e\n",v[0],v[1]);
000000013FB2100F lea rcx,[string "test: %e %e\n" (13FB221B0h)]
000000013FB21016 movsd xmm0,mmword ptr [__real@3fb999999999999a (13FB221D0h)]
000000013FB2101E movsd xmm1,mmword ptr [__real@3fc999999999999a (13FB221C8h)]
000000013FB21026 movsd mmword ptr [rsp+38h],xmm1
000000013FB2102C movsd mmword ptr ,xmm1
000000013FB21032 movsd mmword ptr ,xmm0
000000013FB21038 movsd xmm0,mmword ptr [__real@3fd3333333333333 (13FB221C0h)]
000000013FB21040 movsd mmword ptr [rsp+28h],xmm0
000000013FB21046 movapd xmm1,xmmword ptr
000000013FB2104C addpd xmm1,xmmword ptr
000000013FB21052 movhpd qword ptr [rax],xmm1
000000013FB21056 movsd xmm2,mmword ptr [rax+8]
000000013FB2105B movsd xmm1,mmword ptr [rax]
000000013FB2105F movd r8,xmm2
000000013FB21064 movd rdx,xmm1
000000013FB21069 call qword ptr [__imp_printf (13FB22128h)]
}
000000013FB2106F xor eax,eax
000000013FB21071 add rsp,48h
000000013FB21075 ret
Same C++ code, but compiled for x86 works prefect:
00151002 in al,dx
00151003 and esp,0FFFFFFF0h
00151006 sub esp,20h
__m128d XMM0,XMM1,XMM3;
_declspec(align(16)) double *v;
v = (double*)malloc(sizeof(double)*2);
00151009 push 10h
0015100B call dword ptr [__imp__malloc (1520A4h)]
_declspec(align(16)) double A[2]={0.1,0.2};
00151011 fld qword ptr [__real@3fb999999999999a (152118h)]
00151017 fstp qword ptr [esp+14h]
_declspec(align(16)) double B[2]={0.2,0.3};
//_declspec(align(16)) double C[2]={0.0,0.0};
XMM0 = _mm_load_pd(A);
XMM1 = _mm_load_pd(B);
XMM3 = _mm_add_pd(XMM0,XMM1);
//XMM3 = _mm_shuffle_pd(XMM3,XMM3,_MM_SHUFFLE2(0,1));
//_mm_store_pd(&v[0],XMM3);
_mm_store_sd(&v[1],XMM3);
_mm_storeh_pd(&v[0],XMM3);
printf("test: %e %e\n",v[0],v[1]);
0015101B sub esp,0Ch
0015101E fld qword ptr [__real@3fc999999999999a (152110h)]
00151024 fst qword ptr [esp+28h]
00151028 movapd xmm1,xmmword ptr [esp+20h]
0015102E fstp qword ptr [esp+10h]
00151032 fld qword ptr [__real@3fd3333333333333 (152108h)]
00151038 fstp qword ptr [esp+18h]
0015103C movapd xmm0,xmmword ptr [esp+10h]
00151042 addpd xmm0,xmm1
00151046 movsd mmword ptr [eax+8],xmm0
0015104B movhpd qword ptr [eax],xmm0
0015104F fld qword ptr [eax+8]
00151052 fstp qword ptr [esp+8]
00151056 fld qword ptr [eax]
00151058 fstp qword ptr [esp]
0015105B push offset string "test: %e %e\n" (1520F4h)
00151060 call dword ptr [__imp__printf (15209Ch)]
00151066 add esp,14h
}
00151069 xor eax,eax
0015106B mov esp,ebp
0015106D pop ebp
0015106E ret
sorec:
#include <windows.h>
#include <intrin.h>
#include <stdio.h>
void main()
{
__m128d XMM0,XMM1,XMM3;
_declspec(align(16)) double *v;
v = (double*)malloc(sizeof(double)*2);
_declspec(align(16)) double A[2]={0.1,0.2};
_declspec(align(16)) double B[2]={0.2,0.3};
//_declspec(align(16)) double C[2]={0.0,0.0};
XMM0 = _mm_load_pd(A);
XMM1 = _mm_load_pd(B);
XMM3 = _mm_add_pd(XMM0,XMM1);
//XMM3 = _mm_shuffle_pd(XMM3,XMM3,_MM_SHUFFLE2(0,1));
//_mm_store_pd(&v[0],XMM3);
_mm_store_sd(&v[1],XMM3);
_mm_storeh_pd(&v[0],XMM3);
printf("test: %e %e\n",v[0],v[1]);
}
Please describe what's wrong with a buggy code fragment. I mean: "It should produce X, but I've got Y."
I think code is quite clear, so we got:
v[0]=A[1]+B[1], v[1]=A[0]+B[0];
in x86 case printf print - test: 5.00000000e-1, 3.00000000e-1!
in x64 case printf print -test: 5.00000000e-1, 0.00000000e-1!
Also, as I see on asm - VS compiler produce slower code for x64, why it split MOVAPD to 2xMOVSD?
I can only suggest that SIMD units is natively x86 (32bit address)!
Maybe it's a bug with VS code generation?
In the x64 asm:
000000013FB21056 movsd xmm2,mmword ptr [rax+8]
Where [rax+8] was written?
godsic: Look, did you try to run your program on Intel CPUs? I guess that you'll get the same results. Please do it and then we'll continue.
Basically, I am AMD fan so I have not Intel based PC!
And I think eduardoschardong right that VS generate wrong code! But it can e due to x64 limitations?
TO AMD: How SIMD units fetch from memory or L2,L3 chache?
This bug occur only when I try to write to data allocated with malloc(_align_malloc), if you declare static data all is perfect
I think that it's extremely difficult to find a bug in CPU.
Edit: Removed advertising from post
Thank’s for the suggestion. AMD CodeAnalyst Performance Analyzer
focuses on program profiling and performance analysis. It does disassemble
code when it displays profile data. However, it is not intended to be a
dedicated disassembly tool. We would recommend using dumpbin or
one of the disassemblers that are available as freeware, shareware, or
released products.
CodeAnalyst does not attach to a process in the way that debuggers
attach to processes. For disassembly of short regions of code, a
debugger would be an alternative to a disassembler.
CodeAnalyst collects data using system-wide sampling (using
either timer-based sampling, event-based sampling, or Instruction-
Based Sampling). Timer- and event-based sampling are available
even on processors without Instruction-Based Sampling. Thus,
if you would really like to get a disassembly using CodeAnalyst,
please run and profile your program with CodeAnalyst, then
drill-down to the region of interest within the program.
to storia:
I think that it is the most AMD disadvantage dont listen their customers!
It will be very useful to use CA to measure the latency of block of code offline without execution! So we need to see dissasembly to optimize C++ code?
If AMD create all in one tool it will be great present to AMD fans!
I think that PerfAnalyzer is all-in-one tool for OpenGL DX developers focused on ATI GPU! So, why AMD wan't to do the same tool for CPU developers?
For example you can create TOPIC to ask AMD developers what they want to see in AMD tools!
Make sense?
Like I said, I think that you're encounter with compiler's quirk or bug. So try to find the root of problem on Microsoft's forum.
to avk: yes you was right, but have a look on disassembly!
It is ugly and unefficient! AMD claimed that they work with MSVS developers to optimize MSVC compiler for better perf on AMD, but from dissasm i see that they neglected with all benefits of x64 architecture, they still prefer to use memory to store data ( to amd: do you really think that developers optimize their code for cache? :-)) even if registers can be used, which result in slower code and probably cache polluting!
So I suggest that AMD doesnt work with Microsoft at all!
Nobody is perfect. Please don't jump to conclusion about the Microsoft and AMD cooperation. I think that AMD cannot directly influence on quality of Microsoft compiler. I believe that AMD just provide some information for optimization and maybe some testbeds. So if you think that you've find an error in compiler, then let it's author (Microsoft) know about this.