I write an assembler function in SSE to caculate Vector mutiply Matrix ...That works well on an Intel CPU , cost only 30% time compare to the FLU assembler by VC8....But as to my AMD CPU(AthlonX2 3600+).....It cost about double time than FLU... I tried 3DNOW,which worked even worse... Does AMD SIMD just work slow?
Can some one help me? Any suggestion is welcomed.
I am no expert, but you probably have to take into account the fact that AMD K-8's SSE unit is much slower than Intel C2D's, since it can process only 64-bit per clock cycle.
Also, memory access pattern can be very influential factor.
I was toying with some asm routines in linux kernel and have managed to accelerate them on K-8/K-10 just by removing a couple of prefetches that were supposed to lift performance on Intel...
Could you describe the problem in detail and post source code if possible? We can take a look at it.
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
I think It'll be enough a part of the source code