I write an assembler function in SSE to caculate Vector mutiply Matrix ...That works well on an Intel CPU , cost only 30% time compare to the FLU assembler by VC8....But as to my AMD CPU(AthlonX2 3600+).....It cost about double time than FLU... I tried 3DNOW,which worked even worse... Does AMD SIMD just work slow?
Can some one help me? Any suggestion is welcomed.