51 Replies Latest reply on Dec 21, 2009 6:14 PM by MicahVillmow

    HD 5870 and 5970 working :-)

    zpdixon

      I ordered an HD 5870 from Newegg on Tuesday (5min after a script I wrote sent me an alert indicating its availability ), I received it today Thursday, upgraded the ATI drivers on my 64-bit Linux GPGPU dev box to version 9.9, kept the SDK to version 1.4, and compiled a test program to measure the FLOPS rating:

      2662 GFLOPS, or 98% of the max theoretical 2720 GFLOPS
      This is 36% more FLOPS than my 4850 X2 cards, at 81% the power consumption. Everything just worked on the first attempt even though the card is not yet "officially supported" by the 9.9 Linux drivers - I love it :-)

      Update: I have 2 HD 5970 working too

        • HD 5870 working :-)
          empty_knapsack

          What info.target CAL returns for calDeviceGetInfo(&info, deviceno)? Is it still 5 == CAL_TARGET_770?

            • HD 5870 working :-)
              zpdixon
              Originally posted by: empty_knapsack What info.target CAL returns for calDeviceGetInfo(&info, deviceno)? Is it still 5 == CAL_TARGET_770?

              No, it returns 8, which probably corresponds to (as of today undefined) CAL_TARGET_8XX.
                • HD 5870 working :-)
                  empty_knapsack

                   

                  No, it returns 8, which probably corresponds to (as of today undefined) CAL_TARGET_8XX.


                  Thanks, zpdixon.

                  ...Although it means I'll need more kernel recompilations in future .

                   

                  Are you using only linux? Any chances you can run some win32 tests with your 5870?

                    • HD 5870 working :-)
                      zpdixon
                      I am only using Linux. Though if you can find me a good guide to boot Windows on my diskless GPGPU dev box, I may give it a try :-)
                        • HD 5870 working :-)
                          empty_knapsack

                           

                          Originally posted by: zpdixon I am only using Linux. Though if you can find me a good guide to boot Windows on my diskless GPGPU dev box, I may give it a try :-)


                          I guess it'll be easier to recompile my tests for linux. Or even better -- to wait some 5870s with win32 around .

                           

                          Also, after some meditations with CAL I realized that 8 == CAL_TARGET_870 and 9 == CAL_TARGET_830.

                            • HD 5870 working :-)
                              zpdixon

                               

                              Originally posted by: empty_knapsack

                              I guess it'll be easier to recompile my tests for linux. Or even better -- to wait some 5870s with win32 around .

                               

                               

                              Also, after some meditations with CAL I realized that 8 == CAL_TARGET_870 and 9 == CAL_TARGET_830.



                              Well I have a good news for you. I set up my diskless box to boot 32-bit XP from an iSCSI SAN (using gPXE and an OpenSolaris ZFS volume iSCSI target in case anyone is interested).

                              And I tested your ighashgpu v0.61 bruteforcer -- I assume this is what you were about to ask me :-) So my 5870 with 9.9 drivers does about 2430 MHash/s for MD5 and 2670 MHash/s for MD4. -cpudontcare doesn't produce any noticeable difference. I assume that without it you are not busy-looping on calCtxIsEventDone, but then you are probably doing something wrong because without it it still uses about 60% of one of the cores of my 3.0GHz Core 2 Duo.

                              Also I found a bug (v0.61): according to the doc the tool supports -t:sha1 and -t:mysql5, however they don't seem implemented (prints a generic error message stating the arguments are invalid).

                              If you need me to test anything else, let me know.

                                • HD 5870 working :-)
                                  empty_knapsack

                                   

                                  Originally posted by: zpdixon

                                  Well I have a good news for you. I set up my diskless box to boot 32-bit XP from an iSCSI SAN (using gPXE and an OpenSolaris ZFS volume iSCSI target in case anyone is interested).  And I tested your ighashgpu v0.61 bruteforcer -- I assume this is what you were about to ask me :-) So my 5870 with 9.9 drivers does about 2430 MHash/s for MD5 and 2670 MHash/s for MD4.



                                  Thanks for the tests, I've already got some reports from 5870's owners, results very close to expected ones but a bit lower for MD5.

                                   

                                  -cpudontcare doesn't produce any noticeable difference. I assume that without it you are not busy-looping on calCtxIsEventDone, but then you are probably doing something wrong because without it it still uses about 60% of one of the cores of my 3.0GHz Core 2 Duo.


                                  With 5870 MD5 kernel runs for only 3.5ms. Without -cpudontcare switch ighashgpu trying to compute kernel's run time and do Sleep() before calCtxIsEventDone to avoid cpu load. With such low 3.5ms execution we can sleep only for 2-3ms, so 1.5-0.5ms we're aggressively puling calCtxIsEventDone, thus >50% of CPU usage. For other kernels (like mysql5) you should notice the CPU usage difference.

                                  Actually it's a shame that ATI don't have flag similar to nVidia's CU_CTX_BLOCKING_SYNC when creating GPU context. Burning CPU time in while (calCtxIsEventDone() == CAL_RESULT_PENDING); loop isn't smart behavior at all.

                                   

                                  Also I found a bug (v0.61): according to the doc the tool supports -t:sha1 and -t:mysql5, however they don't seem implemented (prints a generic error message stating the arguments are invalid).


                                  You're probably mistyped some parameters, try examples from documentation, both sha1 & mysql5 works (at least with HD4800 ).

                                   

                                  P.S. WTB "Preview" button on these forums!

                                   

                                    • HD 5870 working :-)
                                      zpdixon

                                      You are right about sha1/mysql5, I was accidentally passing a 128-bit hash instead of 160-bit... Doh!

                                      Results: 790MHash/s for sha1 and 420Mhash/s for mysql5.

                                        • HD 5870 working :-)
                                          empty_knapsack

                                          Results for single hashes very close to expected ones. HD5870 should be 2.83 times faster than 4770 on stock clocks (this is only applies to hash cracking obviously) and after tests it's around 2.7-2.8x, so it's OK.

                                          However, looking at ISA generated for HD5800 I've noticed that only difference is clauses for memory fetching, ALU clauses are exactly the same as for HD4800s. And when running kernel which heavily depends on memory fetches (it's multihash MD5) performance of RV870 isn't looking good at all -- it dropped by 50% from expected value. Hard to say, is it driver issues or RV870 issues or my kernel specifics.

                                            • HD 5870 working :-)
                                              eduardoschardong

                                              Could you post the part of generated ISA that differs?

                                                • HD 5870 working :-)
                                                  empty_knapsack

                                                   

                                                  Originally posted by: eduardoschardong Could you post the part of generated ISA that differs?


                                                  Actually after careful examination of ISAs I've realized that only difference is new VFETCH thing:

                                                   

                                                   01 TEX: ADDR(3168) CNT(1)
                                                  3 VFETCH R0.xy__, R2.x, fc147 MEGA(8)
                                                    FETCH_TYPE(NO_INDEX_OFFSET)


                                                  There are some minor changes before this but all other ALU clauses are exactly the same.

                                                  If this VFETCH is in fact is a "prefetch" instruction then it's probably a reason of slowdown. Unfortunately (again, as always) there no control possible for CAL compiler, so it's impossible to remove VFETCH and check how it'll affect the performance.

                              • HD 5870 working :-)
                                MicahVillmow
                                zpdixon,
                                This is expected as CAL is not released with the SDK but is released with the Catalyst drivers on a monthly basis. So although a GPU SDK has not been released since 1.4, CAL upgrades and improvements have been happening on a monthly basis. That being said, the only driver that is officially 'supported' and tested on is the driver that the SDK was released with. Drivers before or after that release are use at your own risk.
                                  • HD 5870 working :-)
                                    ryta1203

                                     

                                    Originally posted by: MicahVillmow zpdixon, This is expected as CAL is not released with the SDK but is released with the Catalyst drivers on a monthly basis. So although a GPU SDK has not been released since 1.4, CAL upgrades and improvements have been happening on a monthly basis. That being said, the only driver that is officially 'supported' and tested on is the driver that the SDK was released with. Drivers before or after that release are use at your own risk.


                                    Micah,

                                    You say there are CAL updates and improvements in the driver but none of this information can be found in the release notes... can you guys please starting putting this information in the release notes!?

                                  • HD 5870 working :-)
                                    c360

                                    I have not been able to get my HD 5870 running under linux with cat 9.9.  Any special changes you made to the system?


                                    Thanks!

                                      • HD 5870 working :-)
                                        zpdixon
                                        No changes. This is a standard Ubuntu 8.04 amd64 system with ATI drivers 9.9 and the Stream SDK 1.4. Can you modprobe fglrx? Does the Xorg driver detect your card?
                                          • HD 5870 working :-)
                                            c360

                                            I am using openSUSE 11.1 x64.  Have Stream SDK 1.4 and installed/upgraded to 9.9 from ATI's site.

                                            The 5870 is first card and there is a 3870 is in the system.  Boots fine to console.  aticonfig --list-adapters only shows 3870.

                                            Starting X (after editing xorg.conf with adapter info) a logo (AMD) came up "not supported hardware" and when I tried running a stream application the system froze in under 10 seconds.

                                            EDIT: due to crippling linux at this point I pulled the card and replaced the config.  I still end up with the sytem locking using 9.9.  Guess there are more issues afoot.  Will change my dev/test partition to Ubuntu 8.04 and see if there is a difference.

                                            • HD 5870 working :-)
                                              nnsan

                                              Exactly with the same config to zpdixon (Ubuntu 8.04, Catalyst 9.9, CAL 1.4beta), 5870 works fine on our system. I posted our benchmark results here. This is great!

                                                • HD 5870 working :-)
                                                  c360

                                                  Does aticonfig --list-adapters report 5870 as an available adapter?

                                                    • HD 5870 working :-)
                                                      curryml

                                                      For me, this is not true.  Here is the set of steps I have taken:

                                                       

                                                      1. Clean install of Ubuntu 8.04
                                                      2. Installation of ATI Catalyst 9.9
                                                      After this, there's no hope for doing anything with this card, as aticonfig complains of no supported adapters detected.
                                                      lspci says there's a VGA compatible controller: ATI Technologies Inc Unknown device 6898.  Inserting fglrx into xorg.conf does no good either.
                                                      Perhaps it's just in the transition from a working device to the new one, tricking the computer into accepting it, I suppose.  
                                                      Any tips?


                                                        • HD 5870 working :-)
                                                          nnsan

                                                          On my system, "aticonfig --list-adapoters" reports "No supported adapters detected".

                                                          But xorg starts up with "unsupported hardware" logo at  lower right corner of the screen.

                                                          Still, hellocal and other CAL programs work well.

                                                            • HD 5870 working :-)
                                                              c360

                                                               

                                                              Originally posted by: nnsan On my system, "aticonfig --list-adapoters" reports "No supported adapters detected".

                                                              But xorg starts up with "unsupported hardware" logo at  lower right corner of the screen.

                                                              Still, hellocal and other CAL programs work well.

                                                              Great!  That is what I get and I just received a new compile of the CAL application that was locking system up with Cat 9.9 and it is working now.  Not tuned to 5870 yet but shortly.

                                                              Result am working under opensuse 11.1/11.2 with Cat 9.9 x86_64 drivers.

                                                  • HD 5870 working :-)
                                                    gpgpu_4870

                                                     

                                                    Originally posted by: zpdixon I ordered an HD 5870 from Newegg on Tuesday (5min after a script I wrote sent me an alert indicating its availability ), I received it today Thursday, upgraded the ATI drivers on my 64-bit Linux GPGPU dev box to version 9.9, kept the SDK to version 1.4, and compiled a test program to measure the FLOPS rating:
                                                    2662 GFLOPS, or 98% of the max theoretical 2720 GFLOPS
                                                    This is 36% more FLOPS than my 4850 X2 cards, at 81% the power consumption. Everything just worked on the first attempt even though the card is not yet "officially supported" by the 9.9 Linux drivers - I love it :-)


                                                    Hi zpdixon. I'm new in ATI Stream and I'm very interested to know what program you used to measure Gflops performance of your card. Thanks in advanceisgust;

                                                      • HD 5870 working :-)
                                                        zpdixon

                                                         

                                                        Originally posted by: gpgpu_4870

                                                         

                                                        Hi zpdixon. I'm new in ATI Stream and I'm very interested to know what program you used to measure Gflops performance of your card. Thanks in advanceisgust;

                                                         

                                                        It's a very simple tool I wrote: a CAL IL kernel running MAD instructions in a while loop. In my opinion AMD developers should include an equilavent tool in the SDK. I can post the source code if you want.

                                                         

                                                          • HD 5870 working :-)
                                                            gpgpu_4870

                                                            Thanks for the reply zpdixon. It would be great to post the source code for that allowing to me to be able to run that benchmark to my current 4870 card and my upcoming 5870..

                                                            Something else. Have you or anyone else noticed that with latest catalyst 9.11 beta drivers that support OpenCL, CAL performance (e.g. simple_matmult sample) has decreased by a huge leap (I mean the gflops value returned)?


                                                              • HD 5870 working :-)
                                                                zpdixon

                                                                Ok, here it is. I compile it under Linux with "gcc -std=c99 -pedantic -Wextra -Wall -Werror -I/usr/local/amdcal/include  -c -o ilperf ilperf.c". The file.il must end with a NUL byte due to my use of mmap().

                                                                ---BEGIN-ilperf.c---

                                                                #define _POSIX_C_SOURCE 199309L
                                                                #include <sys/mman.h>
                                                                #include <sys/types.h>
                                                                #include <sys/stat.h>
                                                                #include <sys/time.h>
                                                                #include <unistd.h>
                                                                #include <fcntl.h>
                                                                #include <stdio.h>
                                                                #include <string.h>
                                                                #include <stdlib.h>
                                                                #include <time.h>
                                                                #include <cal.h>
                                                                #include <calcl.h>
                                                                #include <stdbool.h>

                                                                #define NR_GROUPS 10
                                                                #define THREADS_PER_GRP 512
                                                                #define NR_ITERATIONS 0x100000
                                                                #define NR_MAD_INSN 209

                                                                typedef struct    gpu_state_s
                                                                {
                                                                    CALdevice        device;
                                                                    CALcontext        ctx;
                                                                    CALresource        outputRes; // <width> UINTs
                                                                    CALmem        outputMem;
                                                                    CALfunc        entry;
                                                                    CALprogramGrid    pg;
                                                                    CALevent        e;
                                                                    struct timeval    tv_end;
                                                                }        gpu_state_t;

                                                                void fatal(const char *func_name)
                                                                {
                                                                    const char *cal_msg = calGetErrorString();
                                                                    const char *comp_msg = calclGetErrorString();
                                                                    fprintf(stderr, "%s failed - cal:[%s] calcl:[%s]\n",
                                                                        func_name, cal_msg, comp_msg);
                                                                    // calGetErrorString error messages are prematurely truncated because of
                                                                    // stray NUL chars. print the message until 3 consecutive NUL chars are
                                                                    // encountered
                                                                    fprintf(stderr, "Full CAL error: ");
                                                                    for (int i = 0; i < 128; i++) {
                                                                    if (i > 1 && !cal_msg[i - 2] && !cal_msg[i - 1] && !cal_msg)
                                                                        break;
                                                                    if (cal_msg
                                                                )
                                                                        fprintf(stderr, "%c", cal_msg);
                                                                    }
                                                                    fprintf(stderr, "\n");
                                                                    exit(1);
                                                                }

                                                                void show_ver()
                                                                {
                                                                    CALuint major, minor, imp;
                                                                    if (CAL_RESULT_OK != calGetVersion(&major, &minor, &imp))
                                                                    fatal("calGetVersion");
                                                                    printf("CAL version %u.%u.%u\n", major, minor, imp);
                                                                }

                                                                void show_stats(CALuint devi, struct timeval *v0, struct timeval *v1)
                                                                {
                                                                    long long ms0 = v0->tv_sec * 1000 + v0->tv_usec / 1000;
                                                                    long long ms1 = v1->tv_sec * 1000 + v1->tv_usec / 1000;
                                                                    printf("Device %d: execution time %lld ms, achieved %lld GFLOPS\n",
                                                                        devi, ms1 - ms0,
                                                                        8 /* float op per MAD */ * (long long)NR_MAD_INSN *
                                                                        NR_ITERATIONS * THREADS_PER_GRP * NR_GROUPS /
                                                                        (ms1 - ms0) / (long long)1e6);
                                                                }

                                                                void prepare_run(CALuint devi, gpu_state_t *gs)
                                                                {
                                                                    // open device
                                                                    if (CAL_RESULT_OK != calDeviceOpen(&gs->device, devi))
                                                                    fatal("calDeviceOpen");
                                                                    if (CAL_RESULT_OK != calCtxCreate(&gs->ctx, gs->device))
                                                                    fatal("calCtxCreate");
                                                                    // allocate resources
                                                                    CALuint width = NR_GROUPS * THREADS_PER_GRP;
                                                                    if (CAL_RESULT_OK != calResAllocLocal2D(&gs->outputRes, gs->device,
                                                                        width, 1, CAL_FORMAT_UINT_1,
                                                                        CAL_RESALLOC_GLOBAL_BUFFER))
                                                                    fatal("calResAllocLocal2D");
                                                                    // init mem
                                                                    CALuint *outputPtr = 0;
                                                                    CALuint outputPitch = 0;
                                                                    if (CAL_RESULT_OK != calResMap((CALvoid**)&outputPtr, &outputPitch,
                                                                        gs->outputRes, 0))
                                                                    fatal("calResMap");
                                                                    for (unsigned i = 0; i < width; i++)
                                                                    outputPtr
                                                                = 10;
                                                                    if (CAL_RESULT_OK != calResUnmap(gs->outputRes))
                                                                    fatal("calResUnmap");
                                                                    // acquire mem handle
                                                                    if (CAL_RESULT_OK != calCtxGetMem(&gs->outputMem, gs->ctx, gs->outputRes))
                                                                    fatal("calCtxGetMem");
                                                                }

                                                                void load_module_and_execute(gpu_state_t *gs_base, CALuint count, CALimage img)
                                                                {
                                                                    CALuint devi;
                                                                    for (devi = 0; devi < count; devi++)
                                                                      {
                                                                    gpu_state_t *gs = gs_base + devi;
                                                                    CALmodule module;
                                                                    if (CAL_RESULT_OK != calModuleLoad(&module, gs->ctx, img))
                                                                        fatal("calModuleLoad");
                                                                    if (CAL_RESULT_OK != calModuleGetEntry(&gs->entry, gs->ctx, module,
                                                                            "main"))
                                                                        fatal("calModuleGetEntry");
                                                                    CALname outputName = 0;
                                                                    if (CAL_RESULT_OK != calModuleGetName(&outputName, gs->ctx, module,
                                                                            "g[]"))
                                                                        fatal("calModuleGetName");
                                                                    if (CAL_RESULT_OK != calCtxSetMem(gs->ctx, outputName, gs->outputMem))
                                                                        fatal("calCtxSetMem (output)");
                                                                    // init program grid
                                                                    CALprogramGrid pg = {
                                                                        .func = gs->entry,
                                                                        .gridBlock = { .width = THREADS_PER_GRP, .height = 1, .depth = 1 },
                                                                        .gridSize = { .width = NR_GROUPS, .height = 1, .depth = 1 },
                                                                        .flags = 0
                                                                    };
                                                                    gs->pg = pg;
                                                                    gs->e = 0;
                                                                      }
                                                                    // run
                                                                    struct timeval tv0;
                                                                    gettimeofday(&tv0, NULL);
                                                                    for (devi = 0; devi < count; devi++)
                                                                      {
                                                                    gpu_state_t *gs = gs_base + devi;
                                                                    if (CAL_RESULT_OK != calCtxRunProgramGrid(&gs->e, gs->ctx, &gs->pg))
                                                                        fatal("calCtxRunProgram");
                                                                      }
                                                                    // non-busy wait
                                                                    unsigned waiting_for = 0;
                                                                    if (count > sizeof (waiting_for) * 8)
                                                                    fprintf(stderr, "Cannot deal with %u devices\n", count), exit(1);
                                                                    for (unsigned i = 0; i < count; i++)
                                                                    waiting_for |= (1 << i);
                                                                    CALresult res;
                                                                    struct timespec req = { .tv_sec = 0, .tv_nsec = 1e6 };
                                                                    while (waiting_for)
                                                                      {
                                                                    for (devi = 0; devi < count; devi++)
                                                                      {
                                                                        gpu_state_t *gs = gs_base + devi;
                                                                        if (!(waiting_for & (1 << devi)))
                                                                        continue;
                                                                        res = calCtxIsEventDone(gs->ctx, gs->e);
                                                                        if (res == CAL_RESULT_OK)
                                                                          {
                                                                        gettimeofday(&gs->tv_end, NULL);
                                                                        waiting_for &= ~(1 << devi);
                                                                          }
                                                                        else if (res != CAL_RESULT_PENDING)
                                                                        fatal("calCtxIsEventDone");
                                                                      }
                                                                    nanosleep(&req, NULL);
                                                                      }
                                                                    // show stats
                                                                    for (devi = 0; devi < count; devi++)
                                                                    show_stats(devi, &tv0, &gs_base[devi].tv_end);
                                                                }

                                                                void finish_run(gpu_state_t *gs)
                                                                {
                                                                    // release output mem & handle
                                                                    if (CAL_RESULT_OK != calCtxReleaseMem(gs->ctx, gs->outputMem))
                                                                    fatal("calCtxReleaseMem");
                                                                    if (CAL_RESULT_OK != calResFree(gs->outputRes))
                                                                    fatal("calResFree");
                                                                    // close device
                                                                    if (CAL_RESULT_OK != calCtxDestroy(gs->ctx))
                                                                    fatal("calCtxDestroy");
                                                                    if (CAL_RESULT_OK != calDeviceClose(gs->device))
                                                                    fatal("calDeviceClose");
                                                                }

                                                                void display_attribs(CALdeviceattribs *a)
                                                                {
                                                                    printf(
                                                                        "target            %u\n"
                                                                        "localRAM          %u MB\n"
                                                                        "uncachedRemoteRAM %u MB\n"
                                                                        "cachedRemoteRAM   %u MB\n"
                                                                        "engineClock       %u MHz\n"
                                                                        "memoryClock       %u MHz\n"
                                                                        "wavefrontSize     %u\n"
                                                                        "numberOfSIMD      %u\n"
                                                                        "doublePrecision   %u\n"
                                                                        "localDataShare    %u\n"
                                                                        "globalDataShare   %u\n"
                                                                        "globalGPR         %u\n"
                                                                        "computeShader     %u\n"
                                                                        "memExport         %u\n"
                                                                        "pitch_alignment   %u\n"
                                                                        "surface_alignment %u\n"
                                                                        , a->target, a->localRAM, a->uncachedRemoteRAM, a->cachedRemoteRAM,
                                                                        a->engineClock, a->memoryClock, a->wavefrontSize, a->numberOfSIMD,
                                                                        a->doublePrecision, a->localDataShare, a->globalDataShare,
                                                                        a->globalGPR, a->computeShader, a->memExport, a->pitch_alignment,
                                                                        a->surface_alignment);
                                                                }

                                                                void cal_puts(const CALchar *msg)
                                                                {
                                                                    fputs(msg, stdout);
                                                                }

                                                                void compile_and_run(CALuint count, const char *src)
                                                                {
                                                                    CALuint devi;
                                                                    CALdeviceattribs attribs;
                                                                    // get attributes of the target devices
                                                                    for (devi = 0; devi < count; devi++)
                                                                      {
                                                                    attribs.struct_size = sizeof(CALdeviceattribs);
                                                                    if (CAL_RESULT_OK != calDeviceGetAttribs(&attribs, devi))
                                                                        fatal("calDeviceGetAttribs");
                                                                    printf("Device %u: target=%u\n", devi, attribs.target);
                                                                    //display_attribs(&attribs);
                                                                      }
                                                                    // compile the object, link it into an image
                                                                    CALobject obj;
                                                                    if (CAL_RESULT_OK != calclCompile(&obj, CAL_LANGUAGE_IL, src, attribs.target))
                                                                    fatal("calclCompile");
                                                                    CALimage img;
                                                                    if (CAL_RESULT_OK != calclLink(&img, &obj, 1))
                                                                    fatal("calclLink");
                                                                    // disassemble the image
                                                                    //calclDisassembleImage(img, cal_puts);
                                                                    // prepare running it
                                                                    gpu_state_t *gs_base = malloc(count * sizeof (*gs_base));
                                                                    if (!gs_base)
                                                                    perror("malloc"), exit(1);
                                                                    for (devi = 0; devi < count; devi++)
                                                                    prepare_run(devi, gs_base + devi);
                                                                    // load module and execute it
                                                                    load_module_and_execute(gs_base, count, img);
                                                                    // free the per-GPU resources
                                                                    for (devi = 0; devi < count; devi++)
                                                                    finish_run(gs_base + devi);
                                                                    // free the image and object
                                                                    if (CAL_RESULT_OK != calclFreeImage(img))
                                                                    fatal("calclFreeImage");
                                                                    if (CAL_RESULT_OK != calclFreeObject(obj))
                                                                    fatal("calclFreeObject");
                                                                }

                                                                void read_file(const char *fname, char **data)
                                                                {
                                                                    int fd;
                                                                    if (-1 == (fd = open(fname, O_RDONLY)))
                                                                    perror(fname), exit(1);
                                                                    struct stat st;
                                                                    if (-1 == fstat(fd, &st))
                                                                    perror("stat"), exit(1);
                                                                    if (MAP_FAILED == (*data = mmap(NULL, st.st_size, PROT_READ,
                                                                            MAP_PRIVATE, fd, 0)))
                                                                    perror("mmap"), exit(1);
                                                                    if (st.st_size < 2)
                                                                    fprintf(stderr, "file should be at least length 2: %s\n", fname),
                                                                        exit(1);
                                                                    if ((*data)[st.st_size - 1] && (*data)[st.st_size - 2])
                                                                    fprintf(stderr, "file is not NUL-terminated: %s\n", fname), exit(1);
                                                                }

                                                                void usage(const char *name)
                                                                {
                                                                    fprintf(stderr, "Usage: %s <file.il>\n", name);
                                                                }

                                                                int main(int argc, char** argv)
                                                                {
                                                                    char *src;
                                                                    if (argc != 2)
                                                                    usage(argv[0]), exit(1);
                                                                    read_file(argv[1], &src);
                                                                    if (CAL_RESULT_OK != calInit())
                                                                    fatal("calInit");
                                                                    setuid(42);
                                                                    show_ver();
                                                                    CALuint count;
                                                                    if (CAL_RESULT_OK != calDeviceGetCount(&count))
                                                                    fatal("calDeviceGetCount");
                                                                    printf("Found %u device%s\n", count, count > 1 ? "s" : "");
                                                                    if (count >= 1)
                                                                    compile_and_run(count, src);
                                                                    calShutdown();
                                                                    return 0;
                                                                }
                                                                ---END-ilperf.c---

                                                                ---BEGIN-file.il---

                                                                il_cs
                                                                  ; must be equal to THREADS_PER_GRP
                                                                  dcl_num_thread_per_group 512
                                                                  ; l0.x must be equal to NR_ITERATIONS
                                                                  dcl_literal l0, 0x100000, 0xffffffff, 42.0, 0x0

                                                                  ; nr of iterations
                                                                  mov r0.x, l0.x
                                                                  ; arbitrary values
                                                                  mov r1, l0.z
                                                                  mov r2, l0.z

                                                                  whileloop
                                                                    break_logicalz r0
                                                                    ; 209 mad instructions (must be equal to NR_MAD_INSN)
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    mad r2, r2, r2, r2
                                                                    mad r1, r1, r1, r1
                                                                    iadd r0.x, r0.x, l0.y ; counter--
                                                                  endloop

                                                                  ; use r1-r2 to compute something to prevent CAL from
                                                                  ; optimizing out the whole loop
                                                                  iadd r0, r1, r2
                                                                  mov g[vaTid.x], r0
                                                                end
                                                                ---END-file.il---

                                                                 

                                                                • HD 5870 working :-)
                                                                  aymankh

                                                                  zpdixon,

                                                                   Can you please measure LDS bandwidth in 5870 using ATI stream LDS_read and LDS_write, with and without neighbourExch (w) flag? I am interested to know if something has changed? The numbers for 4870 were really poor? I am trying  to decide wheather I will buy the card so please help me.

                                                            • HD 5870 working :-)
                                                              SteveLR

                                                              Dear zpdixon,

                                                              I was putting a system together similar to the one you listed here, and would be running Linux.  In any case, I am in the ordering parts phase.  My main question, are the Radeon 5870's worth it, ie can you crunch anything with them?  I mean do they just run Graphics according to specks, and drop down to 50-100 Gflops if you try and use them as GPGPU's.  My main questions were related to purchase of cheeper Radeon's Vs. bying a single Firestream or Nvidia tesla c1070 for 1200$, which boast only 1 Tflop peak performance, but has been tested on GPGPU applications with double floats (as opposed to the single).  I mean, I would like to run all 4 Radeons on an ASUS P6T7 with an i7 950 Quadcore CPU.  I basically am interested in what you accomplish with your system, success or failure, and where to throw my money.

                                                              Stephan Watkins  lloyd.riggs@gmx.ch

                                                                • HD 5870 working :-)
                                                                  empty_knapsack

                                                                  Stephan, I'm strongly recommend you to purchase some cheap ATI GPU first (for example, $100 for 4850) and test, is it suitable for your computations or not.

                                                                  There are TONS of problems with ATI GPGPU SDK right now, it marked as beta for more than a year for a reason. At this moment CUDA is a way more mature environment for developing GPGPU applications (though buying Tesla probably unneeded step unless 4G RAM is a must).

                                                                    • HD 5870 working :-)
                                                                      gpgpu_4870

                                                                      Hi zpdixon and the others,

                                                                      I have a problem with results from your benchmark. Well, I ran it in Windows Vista(after a few modifications) on my 4870 and it get as expected 1178Gflops which is pretty much the maximum of thetheoretical 1200Gflops (time elapsed about 7.7secs).

                                                                      The problem is that when i ran the same benchmark in Windows Vista on my brand new 5870 it returned a value of 1350Gflops!!! which is about the half of the theoretical value of 2720Gflops (time elapsed about 6.5secs). I am very very frustated with this and ran the code in Ubuntu 9.04 too and got exactly the same numbers.

                                                                      To conclude, I ran the cosmology test nnsan posted and got about 2Teraflops as he got too in the not-optimized version of his kernel.

                                                                      So, how you got about 2660Gflops? Is there any modification of the cal il code u made, eg to number of threads?

                                                                       

                                                                      Thanks, in advance

                                                                       

                                                                       

                                                                        • HD 5870 working :-)
                                                                          hazeman

                                                                           

                                                                          Originally posted by: gpgpu_4870 So, how you got about 2660Gflops? Is there any modification of the cal il code u made, eg to number of threads?


                                                                          Change #define NR_GROUPS 10 to #define NR_GROUPS 20 ( as 5870 has 20 simd cores )

                                                                            • HD 5870 working :-)
                                                                              gpgpu_4870

                                                                              Thanks! Got the value of 2660Gflops but time elapsed remained the same(about 6.5secs). I tried with number of threads per SIMD in range 64-256 (wavefrontsize = 64) and i got much better time with fewer threads but fewer gflops as well. The best combination was with 256 threads where time was getting close to the half (3.4secs) and gflops a bit lower than maximum (about 2600Gflops). So, is it all about overhead or something;s wrong with the app / CAL intialiazations?

                                                                               

                                                                              Something else: I tried to run the same code in my 3850 agp system (changed NR_GROUPS to 4 cause have 4 SIMDs) and got the CAL error below (Is it because it does not have, i think, compute kernel support?):


                                                                              calclCompile failed - cal:[No error] calcl:[ILScanILBinary: Unsupported opcode for architecture]


                                                                      • HD 5870 working :-)
                                                                        MicahVillmow
                                                                        gpgpu_4870,
                                                                        In order to run that code on the HD3850, you need to change a few things.
                                                                        il_cs <-- this must be il_ps
                                                                        ; must be equal to THREADS_PER_GRP
                                                                        dcl_num_thread_per_group 512 <-- this must be removed
                                                                        vaTid.x <-- this must be vObjIndex.x

                                                                        Also, you need to declare your vObjIndex in the kernel like some of the cal samplers.
                                                                        You also need to change from calctxRunProgramGrid to calctxRunProgram
                                                                        This won't work unmodified on that card because it does not have hardware compute shader.
                                                                          • HD 5870 working :-)
                                                                            gpgpu_4870

                                                                            Thanks Micah.

                                                                            Managed to get 427Gflops out of my "old" HD3850 AGP at clock 720Mhz.

                                                                              • HD 5870 and 5970 working :-)
                                                                                zpdixon

                                                                                Micah: I see you took care of helping others run my program on HD 3xxx, thanks

                                                                                I received 2 HD 5970 from Newegg yesterday and they "just worked" too with no pb, on the same 64-bit Linux dev box, same 9.9 drivers, etc. I measured 4540 GFLOPS, or 98% of the max theoretical 4640 GFLOPS. Other than this bench, my GPGPU workloads are ALU-bound with very rare memory accesses. At full load I measure a power consumption of only about 185 Watt for a single 5970 (2.4A on the PCI-E slot's 12V rail, 5.0A on the 6-pin 12V power connector, 8.3A on the 8-pin 12V power conn.), or 62% of the max theoretical 294 Watt TDP. This is very impressive, my perf/W increased by 2.7x compared to a 4850 X2. Definitely worth the ~$600 each, the power savings are going to recoup the hw price in less than 4 months... A 5970 is roughly 4x faster than the competition (GTX 295) on my workloads, and the latter consumes more power and is about the same price +/- $100. Brilliant!

                                                                                  • HD 5870 and 5970 working :-)
                                                                                    riza.guntur

                                                                                    183 watt? Do you only program for single GPU? or your program utilized both GPU on the board?

                                                                                      • HD 5870 and 5970 working :-)
                                                                                        zpdixon

                                                                                        I utilize both GPUs. I guess the power draw is so low in my case because:

                                                                                        • Contrary to games or other GPGPU workloads, I rarely access video memory. GDDR5, as opposed to the 4850 X2's GDDR3, enters very low power modes when it is not used.
                                                                                        • My workloads obviously don't stress parts of the chip that I don't use: floating-point, ROPs, etc. AMD made huge efforts to make R800 conserve power when these logic blocks are not used.

                                                                                        For those wondering how I measure the wattage: I connected the 5970 to the motherboard via a flexible PCI-E extender whose ribbon cable has the 5 wires used to carry the 12V rail separated from the other wires. This allows me to clamp a clamp-meter around them. I use the same clamp-meter to measure the draw on the 6-pin and 8-pin power connectors. Technically my measurements are slightly inaccurate because I am not taking into account the 3.3V rail on the slot. This is not important because in my experience video cards draws only ~2A from this rail (the PCI-E specs allows a max of 3.0A). This accounts for only ~7W (max 9.9W).

                                                                                          • HD 5870 and 5970 working :-)
                                                                                            empty_knapsack

                                                                                            In contrary .

                                                                                             

                                                                                            I've got some results from 5770 GPUs and that VFETCH thing I've mention above true for them too. For my application (which isn't pure synthetic MAD MAD MAD test but more real word app) R800 10% slower than R700 on the same config (i.e. 5770 = 800SP @ 850Mhz R800 and 4890 = 800SP @ 850Mhz R700). With adding more memory read operations R800's performance drops way faster than R700, difference can be as large as 50%.

                                                                                             

                                                                                            Now I'm curious is it just poorly written IL compiler issue (and hopefully it'll fixed with future drivers) or R800 having some (serious) problems in hardware.

                                                                                              • HD 5870 and 5970 working :-)
                                                                                                riza.guntur

                                                                                                The 5770 only have half the bandwidth of 4890, so isn't it expected to have half the performance of 4890 on lots of read operations?

                                                                                                  • HD 5870 and 5970 working :-)
                                                                                                    empty_knapsack

                                                                                                    riza.guntur,

                                                                                                     

                                                                                                    that 50% actually was applied to 5870 not 5770. Compare between 4890 & 5770 was done with heavily ALU bound kernel, memory bandwidth shouldn't be a problem there at all.

                                                                                                      • HD 5870 and 5970 working :-)
                                                                                                        empty_knapsack

                                                                                                        And more about R800. The way it works with memory fetching way too differs from R700. I'm using simple IL construction like:

                                                                                                        dcl_resource_id(0)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                                                                        dcl_resource_id(1)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                                                                        dcl_resource_id(2)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                                                                        dcl_resource_id(3)_type(1d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

                                                                                                        ...

                                                                                                        sample_resource(1)_sampler(1) r11, r1.y000
                                                                                                        sample_resource(2)_sampler(2) r61, r1.y000

                                                                                                        ... etc

                                                                                                        And test results aren't looking good at all. With a lot of memory reads performance of 5770 dropped to 285M while 4770 shows 357M. That's R800 850Mhz 800SP vs R700 750Mhz 640SP, while memory for 5770 clocked at 1200Mhz vs 800Mhz for 4770, GDDR5, 128-bit bus. So 5770 25% slower than 4770 while in theory it must be 40% faster.

                                                                                                         

                                                                                                         Anybody else getting similar results? Any explanations of this? If memory model changed for R800 what's the best way to do memory fetches? CAL examples coming with OpenCL beta 4 using the same sample_resource() constructions but that examples way too old.

                                                                                                         

                                                                                        • HD 5870 and 5970 working :-)
                                                                                          MicahVillmow
                                                                                          empty_knapsack,
                                                                                          Our next release will have updated documentation that should cover all the newer hardware.
                                                                                          • HD 5870 and 5970 working :-)
                                                                                            MicahVillmow
                                                                                            empty_knapsack,
                                                                                            binding a UAV surface should work just like binding a global buffer surface except that instead of using g[] with cal get name you use uav#.

                                                                                            so instead of r = calModuleGetName(&progName, *ctx, *module, "g[]"); you would use
                                                                                            r = calModuleGetName(&progName, *ctx, *module, "uav0");
                                                                                            r = calModuleGetName(&progName, *ctx, *module, "uav1");
                                                                                            up to 8 UAV's on HD5XXX cards and 1 UAV on HD4XXX card

                                                                                              • HD 5870 and 5970 working :-)
                                                                                                empty_knapsack

                                                                                                Binding isn't a problem. Allocating buffer itself is.

                                                                                                Or am I wrong and it's possible to bind resource created with calResCreate2D/calResAllocLocal2D/etc as UAV buffer? I wasn't successful with it but probably I've made some mistake, I was under expression that calResAllocView strongly required to allocate UAV buffer.

                                                                                              • HD 5870 and 5970 working :-)
                                                                                                MicahVillmow
                                                                                                empty_knapsack,
                                                                                                calResAllocView is not used to create a resource, it's usage is still experimental, which is why it is not exposed yet. The OpenCL runtime uses calResCreate*/calResAlloc* to create the UAV memory.
                                                                                                • HD 5870 and 5970 working :-)
                                                                                                  MicahVillmow
                                                                                                  Ok, allocating raw UAV should be no different than allocating global.