6 Replies Latest reply on Dec 17, 2009 6:42 AM by wilsncharle

    New bug found in AMD CPU Athlon 64 X2 5400+


      I write an image rotation function, on AMD Athlon 64 X2 5400+(HP CQ3009cx desktop PC), when image height is 1536, the performance of AMD cpu suddenly reduce 70%, that is terrible. My OS is windows XP sp3, compiler is visual studio 2005, and the code is normally for Intel CPU. the following is some output result for Athlon X2 5400+, and the problem existed on Athlon X2 QL-66, may be on other AMD processor.


      Height = 1535. time: 312 ms.

      Height = 1536. time: 2219ms.

      Height = 1537. time: 297ms.

      I contact to AMD support and HP support, they both can not help me to confirm this problem, so I post this message here, can any AMD engineers confirm this problem for me?



      // TestRotation.cpp : Defines the entry point for the console application. // #include <stdio.h> #include <windows.h> #define WIDTH 2144 #define HEIGHT 1536 // 1536, bad #define LOOP_COUNT 16 // we count from (1536 - 8) to (1536 + 8) #define BEGIN_LOG_TIME() \ DWORD dwTickStart_, dwTickEnd_; \ dwTickStart_ = GetTickCount(); #define END_LOG_TIME_TRACE() \ dwTickEnd_ = GetTickCount(); \ printf("time: %d ms.\n", dwTickEnd_ - dwTickStart_); int CC_Rotation(BYTE *pBufIn, BYTE *pBufOut, int nHeight, int nWidth); ////////////////////////////////////////////////////////////////////////// int main(int argc, char* argv[]) { printf("Test Rotation for AMD CPU Athlon 64 X2 (5400+ or others ... ).\n"); int nBufSize = WIDTH * (HEIGHT + LOOP_COUNT); BYTE *pIn, *pOut; pIn = new BYTE [nBufSize]; pOut= new BYTE [nBufSize]; for (int k=0; k<LOOP_COUNT; k++) { int nHeight = (HEIGHT - LOOP_COUNT/2) + k; printf("Height = %d. ", nHeight); BEGIN_LOG_TIME(); for (int i=0; i<10; i++) CC_Rotation(pIn, pOut, nHeight, WIDTH); END_LOG_TIME_TRACE(); } printf("Press return key to end !"); getchar(); delete [] pIn; delete [] pOut; return 0; } int CC_Rotation(BYTE *pBufIn, BYTE *pBufOut, int nHeight, int nWidth) { int i,j; BYTE *p = pBufIn, *q = pBufOut; BYTE *pHead = NULL; p = pBufIn; pHead = pBufOut + nHeight; q = pHead; for(i=0; i<nHeight; i++) { q = --pHead; for (j=0; j<nWidth; j++) { *q = *p++; q += nHeight; } } return 0; }

        • New bug found in AMD CPU Athlon 64 X2 5400+

          Maybe the problem is in Athlon X2 5400+'s TLB, maybe is in its relatively small L2 cache (0.5 MB per core, whereas Intel Core 2 has 4 or 6 MB). BTW, there are many libraries on the Internet (AMD Framewave, Intel IPP and so on) can perform image rotation (as well as many other functions) and most of them are written on assembly language and work much faster than yours on C.

          • New bug found in AMD CPU Athlon 64 X2 5400+

            Your stride size on q is probably large enough (nHeight) that you're going to see a LOT of misses there. I'm also unclear why you are using pointer arithmetic rather than array indexing; any decent compiler should have no problem optimizing out the address calculations so the performance should be similar and the code would be easier to read.

            Note that your application (image rotation) is a textbook example of where memory optimization can pay off. Using a blocking approach could result in a dramatic performance improvement (5x to 10x or more).

            As avk pointed out, there are already optimized routines that do this, so you should probably be using those. Assembly vs. C isn't the critical factor, since the Microsoft C compiler is pretty good; you might see a slight improvement by moving to hand-optimized assembly code, but it's not huge. The critical factor is making the best use of the CPU's cache and TLB, and secondarily using the right vector instructions (e.g. SSE3).

            There are a lot of good articles/books on cache optimization if you want/need to write your own algorithm, but typically for the best performance you want to have different optimized code paths for different CPU families (Athlon 64 vs. Core 2, for example) and possibly even major CPU versions (e.g. Phenom vs Athlon 64). Unless you are targeting a single CPU, I would recommend trying to leverage library functions as much as possible, since they can be optimized for different CPUs without you having to worry about writing multiple versions of your code.

              • New bug found in AMD CPU Athlon 64 X2 5400+

                bsoft, thank you for your answer, the reason that I did not use array indexing is the compiler (vs2005) will generate multiplication operation. I think multiplication operation needs much more cpu cycle than addition arithmetic operation, but I do not compare which is the better for optimizing multiplication or optimizing to prefetch data for CPU cache.

                Of course, my function is not optimized for CPU cache and pipeline, since my function is a simple task for simple application.

                The other question is if height 1536 is out of cache, why height 1537 is not ? we can see performance of 1537 is normal.