Archives Discussions

zpdixon · ‎09-25-2009

I ordered an HD 5870 from Newegg on Tuesday (5min after a script I wrote sent me an alert indicating its availability ), I received it today Thursday, upgraded the ATI drivers on my 64-bit Linux GPGPU dev box to version 9.9, kept the SDK to version 1.4, and compiled a test program to measure the FLOPS rating:

2662 GFLOPS, or 98% of the max theoretical 2720 GFLOPS

This is 36% more FLOPS than my 4850 X2 cards, at 81% the power consumption. Everything just worked on the first attempt even though the card is not yet "officially supported" by the 9.9 Linux drivers - I love it 🙂

Update: I have 2 HD 5970 working too

empty_knapsack · ‎09-25-2009

What info.target CAL returns for calDeviceGetInfo(&info, deviceno)? Is it still 5 == CAL_TARGET_770?

zpdixon · ‎09-26-2009

Originally posted by: empty_knapsack What info.target CAL returns for calDeviceGetInfo(&info, deviceno)? Is it still 5 == CAL_TARGET_770?

No, it returns 8, which probably corresponds to (as of today undefined) CAL_TARGET_8XX.

empty_knapsack · ‎09-27-2009

No, it returns 8, which probably corresponds to (as of today undefined) CAL_TARGET_8XX.

Thanks, zpdixon.

...Although it means I'll need more kernel recompilations in future .

Are you using only linux? Any chances you can run some win32 tests with your 5870?

zpdixon · ‎09-28-2009

I am only using Linux. Though if you can find me a good guide to boot Windows on my diskless GPGPU dev box, I may give it a try 🙂

empty_knapsack · ‎09-28-2009

Originally posted by: zpdixon I am only using Linux. Though if you can find me a good guide to boot Windows on my diskless GPGPU dev box, I may give it a try 🙂

I guess it'll be easier to recompile my tests for linux. Or even better -- to wait some 5870s with win32 around .

Also, after some meditations with CAL I realized that 8 == CAL_TARGET_870 and 9 == CAL_TARGET_830.

zpdixon · ‎10-12-2009

Originally posted by: empty_knapsack

I guess it'll be easier to recompile my tests for linux. Or even better -- to wait some 5870s with win32 around .

Also, after some meditations with CAL I realized that 8 == CAL_TARGET_870 and 9 == CAL_TARGET_830.

Well I have a good news for you. I set up my diskless box to boot 32-bit XP from an iSCSI SAN (using gPXE and an OpenSolaris ZFS volume iSCSI target in case anyone is interested).

And I tested your ighashgpu v0.61 bruteforcer -- I assume this is what you were about to ask me 🙂 So my 5870 with 9.9 drivers does about 2430 MHash/s for MD5 and 2670 MHash/s for MD4. -cpudontcare doesn't produce any noticeable difference. I assume that without it you are not busy-looping on calCtxIsEventDone, but then you are probably doing something wrong because without it it still uses about 60% of one of the cores of my 3.0GHz Core 2 Duo.

Also I found a bug (v0.61): according to the doc the tool supports -t:sha1 and -t:mysql5, however they don't seem implemented (prints a generic error message stating the arguments are invalid).

If you need me to test anything else, let me know.

empty_knapsack · ‎10-12-2009

Originally posted by: zpdixon

Well I have a good news for you. I set up my diskless box to boot 32-bit XP from an iSCSI SAN (using gPXE and an OpenSolaris ZFS volume iSCSI target in case anyone is interested). And I tested your ighashgpu v0.61 bruteforcer -- I assume this is what you were about to ask me 🙂 So my 5870 with 9.9 drivers does about 2430 MHash/s for MD5 and 2670 MHash/s for MD4.

Thanks for the tests, I've already got some reports from 5870's owners, results very close to expected ones but a bit lower for MD5.

-cpudontcare doesn't produce any noticeable difference. I assume that without it you are not busy-looping on calCtxIsEventDone, but then you are probably doing something wrong because without it it still uses about 60% of one of the cores of my 3.0GHz Core 2 Duo.

With 5870 MD5 kernel runs for only 3.5ms. Without -cpudontcare switch ighashgpu trying to compute kernel's run time and do Sleep() before calCtxIsEventDone to avoid cpu load. With such low 3.5ms execution we can sleep only for 2-3ms, so 1.5-0.5ms we're aggressively puling calCtxIsEventDone, thus >50% of CPU usage. For other kernels (like mysql5) you should notice the CPU usage difference.

Actually it's a shame that ATI don't have flag similar to nVidia's CU_CTX_BLOCKING_SYNC when creating GPU context. Burning CPU time in while (calCtxIsEventDone() == CAL_RESULT_PENDING); loop isn't smart behavior at all.

Also I found a bug (v0.61): according to the doc the tool supports -t:sha1 and -t:mysql5, however they don't seem implemented (prints a generic error message stating the arguments are invalid).

You're probably mistyped some parameters, try examples from documentation, both sha1 & mysql5 works (at least with HD4800 ).

P.S. WTB "Preview" button on these forums!

zpdixon · ‎10-12-2009

You are right about sha1/mysql5, I was accidentally passing a 128-bit hash instead of 160-bit... Doh!

Results: 790MHash/s for sha1 and 420Mhash/s for mysql5.

empty_knapsack · ‎10-13-2009

Results for single hashes very close to expected ones. HD5870 should be 2.83 times faster than 4770 on stock clocks (this is only applies to hash cracking obviously) and after tests it's around 2.7-2.8x, so it's OK.

However, looking at ISA generated for HD5800 I've noticed that only difference is clauses for memory fetching, ALU clauses are exactly the same as for HD4800s. And when running kernel which heavily depends on memory fetches (it's multihash MD5) performance of RV870 isn't looking good at all -- it dropped by 50% from expected value. Hard to say, is it driver issues or RV870 issues or my kernel specifics.

eduardoschardong · ‎10-13-2009

Could you post the part of generated ISA that differs?

empty_knapsack · ‎10-14-2009

Originally posted by: eduardoschardong Could you post the part of generated ISA that differs?

Actually after careful examination of ISAs I've realized that only difference is new VFETCH thing:

01 TEX: ADDR(3168) CNT(1)
3 VFETCH R0.xy__, R2.x, fc147 MEGA(8)
FETCH_TYPE(NO_INDEX_OFFSET)

There are some minor changes before this but all other ALU clauses are exactly the same.

If this VFETCH is in fact is a "prefetch" instruction then it's probably a reason of slowdown. Unfortunately (again, as always) there no control possible for CAL compiler, so it's impossible to remove VFETCH and check how it'll affect the performance.

MicahVillmow · ‎09-25-2009

zpdixon,
This is expected as CAL is not released with the SDK but is released with the Catalyst drivers on a monthly basis. So although a GPU SDK has not been released since 1.4, CAL upgrades and improvements have been happening on a monthly basis. That being said, the only driver that is officially 'supported' and tested on is the driver that the SDK was released with. Drivers before or after that release are use at your own risk.

ryta1203 · ‎09-29-2009

Originally posted by: MicahVillmow zpdixon, This is expected as CAL is not released with the SDK but is released with the Catalyst drivers on a monthly basis. So although a GPU SDK has not been released since 1.4, CAL upgrades and improvements have been happening on a monthly basis. That being said, the only driver that is officially 'supported' and tested on is the driver that the SDK was released with. Drivers before or after that release are use at your own risk.

Micah,

You say there are CAL updates and improvements in the driver but none of this information can be found in the release notes... can you guys please starting putting this information in the release notes!?

c360 · ‎09-28-2009

I have not been able to get my HD 5870 running under linux with cat 9.9. Any special changes you made to the system?

Thanks!

zpdixon · ‎09-29-2009

No changes. This is a standard Ubuntu 8.04 amd64 system with ATI drivers 9.9 and the Stream SDK 1.4. Can you modprobe fglrx? Does the Xorg driver detect your card?

c360 · ‎09-29-2009

I am using openSUSE 11.1 x64. Have Stream SDK 1.4 and installed/upgraded to 9.9 from ATI's site.

The 5870 is first card and there is a 3870 is in the system. Boots fine to console. aticonfig --list-adapters only shows 3870.

Starting X (after editing xorg.conf with adapter info) a logo (AMD) came up "not supported hardware" and when I tried running a stream application the system froze in under 10 seconds.

EDIT: due to crippling linux at this point I pulled the card and replaced the config. I still end up with the sytem locking using 9.9. Guess there are more issues afoot. Will change my dev/test partition to Ubuntu 8.04 and see if there is a difference.

nnsan · ‎10-05-2009

Exactly with the same config to zpdixon (Ubuntu 8.04, Catalyst 9.9, CAL 1.4beta), 5870 works fine on our system. I posted our benchmark results here. This is great!

c360 · ‎10-05-2009

Does aticonfig --list-adapters report 5870 as an available adapter?

curryml · ‎10-06-2009

For me, this is not true. Here is the set of steps I have taken:

Clean install of Ubuntu 8.04
Installation of ATI Catalyst 9.9

After this, there's no hope for doing anything with this card, as aticonfig complains of no supported adapters detected.

lspci says there's a VGA compatible controller: ATI Technologies Inc Unknown device 6898. Inserting fglrx into xorg.conf does no good either.

Perhaps it's just in the transition from a working device to the new one, tricking the computer into accepting it, I suppose.

Any tips?

nnsan · ‎10-06-2009

On my system, "aticonfig --list-adapoters" reports "No supported adapters detected".

But xorg starts up with "unsupported hardware" logo at lower right corner of the screen.

Still, hellocal and other CAL programs work well.

c360 · ‎10-06-2009

Originally posted by: nnsan On my system, "aticonfig --list-adapoters" reports "No supported adapters detected".

But xorg starts up with "unsupported hardware" logo at lower right corner of the screen.
Still, hellocal and other CAL programs work well.

Great! That is what I get and I just received a new compile of the CAL application that was locking system up with Cat 9.9 and it is working now. Not tuned to 5870 yet but shortly.

Result am working under opensuse 11.1/11.2 with Cat 9.9 x86_64 drivers.

gpgpu_4870 · ‎10-13-2009

Originally posted by: zpdixon I ordered an HD 5870 from Newegg on Tuesday (5min after a script I wrote sent me an alert indicating its availability ), I received it today Thursday, upgraded the ATI drivers on my 64-bit Linux GPGPU dev box to version 9.9, kept the SDK to version 1.4, and compiled a test program to measure the FLOPS rating:
2662 GFLOPS, or 98% of the max theoretical 2720 GFLOPS
This is 36% more FLOPS than my 4850 X2 cards, at 81% the power consumption. Everything just worked on the first attempt even though the card is not yet "officially supported" by the 9.9 Linux drivers - I love it 🙂

Hi zpdixon. I'm new in ATI Stream and I'm very interested to know what program you used to measure Gflops performance of your card. Thanks in advanceisgust;

zpdixon · ‎10-15-2009

Originally posted by: gpgpu_4870

Hi zpdixon. I'm new in ATI Stream and I'm very interested to know what program you used to measure Gflops performance of your card. Thanks in advanceisgust;

It's a very simple tool I wrote: a CAL IL kernel running MAD instructions in a while loop. In my opinion AMD developers should include an equilavent tool in the SDK. I can post the source code if you want.

gpgpu_4870 · ‎10-16-2009

Thanks for the reply zpdixon. It would be great to post the source code for that allowing to me to be able to run that benchmark to my current 4870 card and my upcoming 5870..

Something else. Have you or anyone else noticed that with latest catalyst 9.11 beta drivers that support OpenCL, CAL performance (e.g. simple_matmult sample) has decreased by a huge leap (I mean the gflops value returned)?

zpdixon · ‎10-17-2009

Ok, here it is. I compile it under Linux with "gcc -std=c99 -pedantic -Wextra -Wall -Werror -I/usr/local/amdcal/include -c -o ilperf ilperf.c". The file.il must end with a NUL byte due to my use of mmap().

---BEGIN-ilperf.c---

#define _POSIX_C_SOURCE 199309L
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <cal.h>
#include <calcl.h>
#include <stdbool.h>

#define NR_GROUPS 10
#define THREADS_PER_GRP 512
#define NR_ITERATIONS 0x100000
#define NR_MAD_INSN 209

typedef struct    gpu_state_s
{
    CALdevice        device;
    CALcontext        ctx;
    CALresource        outputRes; // <width> UINTs
    CALmem        outputMem;
    CALfunc        entry;
    CALprogramGrid    pg;
    CALevent        e;
    struct timeval    tv_end;
}        gpu_state_t;

void fatal(const char *func_name)
{
    const char *cal_msg = calGetErrorString();
    const char *comp_msg = calclGetErrorString();
    fprintf(stderr, "%s failed - cal:[%s] calcl:[%s]\n",
        func_name, cal_msg, comp_msg);
    // calGetErrorString error messages are prematurely truncated because of
    // stray NUL chars. print the message until 3 consecutive NUL chars are
    // encountered
    fprintf(stderr, "Full CAL error: ");
    for (int i = 0; i < 128; i++) {
    if (i > 1 && !cal_msg[i - 2] && !cal_msg[i - 1] && !cal_msg)
        break;
    if (cal_msg)
        fprintf(stderr, "%c", cal_msg);
    }
    fprintf(stderr, "\n");
    exit(1);
}

void show_ver()
{
    CALuint major, minor, imp;
    if (CAL_RESULT_OK != calGetVersion(&major, &minor, &imp))
    fatal("calGetVersion");
    printf("CAL version %u.%u.%u\n", major, minor, imp);
}

void show_stats(CALuint devi, struct timeval *v0, struct timeval *v1)
{
    long long ms0 = v0->tv_sec * 1000 + v0->tv_usec / 1000;
    long long ms1 = v1->tv_sec * 1000 + v1->tv_usec / 1000;
    printf("Device %d: execution time %lld ms, achieved %lld GFLOPS\n",
        devi, ms1 - ms0,
        8 /* float op per MAD */ * (long long)NR_MAD_INSN *
        NR_ITERATIONS * THREADS_PER_GRP * NR_GROUPS /
        (ms1 - ms0) / (long long)1e6);
}

void prepare_run(CALuint devi, gpu_state_t *gs)
{
    // open device
    if (CAL_RESULT_OK != calDeviceOpen(&gs->device, devi))
    fatal("calDeviceOpen");
    if (CAL_RESULT_OK != calCtxCreate(&gs->ctx, gs->device))
    fatal("calCtxCreate");
    // allocate resources
    CALuint width = NR_GROUPS * THREADS_PER_GRP;
    if (CAL_RESULT_OK != calResAllocLocal2D(&gs->outputRes, gs->device,
        width, 1, CAL_FORMAT_UINT_1,
        CAL_RESALLOC_GLOBAL_BUFFER))
    fatal("calResAllocLocal2D");
    // init mem
    CALuint *outputPtr = 0;
    CALuint outputPitch = 0;
    if (CAL_RESULT_OK != calResMap((CALvoid**)&outputPtr, &outputPitch,
        gs->outputRes, 0))
    fatal("calResMap");
    for (unsigned i = 0; i < width; i++)
    outputPtr = 10;
    if (CAL_RESULT_OK != calResUnmap(gs->outputRes))
    fatal("calResUnmap");
    // acquire mem handle
    if (CAL_RESULT_OK != calCtxGetMem(&gs->outputMem, gs->ctx, gs->outputRes))
    fatal("calCtxGetMem");
}

void load_module_and_execute(gpu_state_t *gs_base, CALuint count, CALimage img)
{
    CALuint devi;
    for (devi = 0; devi < count; devi++)
      {
    gpu_state_t *gs = gs_base + devi;
    CALmodule module;
    if (CAL_RESULT_OK != calModuleLoad(&module, gs->ctx, img))
        fatal("calModuleLoad");
    if (CAL_RESULT_OK != calModuleGetEntry(&gs->entry, gs->ctx, module,
            "main"))
        fatal("calModuleGetEntry");
    CALname outputName = 0;
    if (CAL_RESULT_OK != calModuleGetName(&outputName, gs->ctx, module,
            "g[]"))
        fatal("calModuleGetName");
    if (CAL_RESULT_OK != calCtxSetMem(gs->ctx, outputName, gs->outputMem))
        fatal("calCtxSetMem (output)");
    // init program grid
    CALprogramGrid pg = {
        .func = gs->entry,
        .gridBlock = { .width = THREADS_PER_GRP, .height = 1, .depth = 1 },
        .gridSize = { .width = NR_GROUPS, .height = 1, .depth = 1 },
        .flags = 0
    };
    gs->pg = pg;
    gs->e = 0;
      }
    // run
    struct timeval tv0;
    gettimeofday(&tv0, NULL);
    for (devi = 0; devi < count; devi++)
      {
    gpu_state_t *gs = gs_base + devi;
    if (CAL_RESULT_OK != calCtxRunProgramGrid(&gs->e, gs->ctx, &gs->pg))
        fatal("calCtxRunProgram");
      }
    // non-busy wait
    unsigned waiting_for = 0;
    if (count > sizeof (waiting_for) * 😎
    fprintf(stderr, "Cannot deal with %u devices\n", count), exit(1);
    for (unsigned i = 0; i < count; i++)
    waiting_for |= (1 << i);
    CALresult res;
    struct timespec req = { .tv_sec = 0, .tv_nsec = 1e6 };
    while (waiting_for)
      {
    for (devi = 0; devi < count; devi++)
    {
        gpu_state_t *gs = gs_base + devi;
        if (!(waiting_for & (1 << devi)))
        continue;
        res = calCtxIsEventDone(gs->ctx, gs->e);
        if (res == CAL_RESULT_OK)
          {
        gettimeofday(&gs->tv_end, NULL);
        waiting_for &= ~(1 << devi);
          }
        else if (res != CAL_RESULT_PENDING)
        fatal("calCtxIsEventDone");
    }
    nanosleep(&req, NULL);
      }
    // show stats
    for (devi = 0; devi < count; devi++)
    show_stats(devi, &tv0, &gs_base[devi].tv_end);
}

void finish_run(gpu_state_t *gs)
{
    // release output mem & handle
    if (CAL_RESULT_OK != calCtxReleaseMem(gs->ctx, gs->outputMem))
    fatal("calCtxReleaseMem");
    if (CAL_RESULT_OK != calResFree(gs->outputRes))
    fatal("calResFree");
    // close device
    if (CAL_RESULT_OK != calCtxDestroy(gs->ctx))
    fatal("calCtxDestroy");
    if (CAL_RESULT_OK != calDeviceClose(gs->device))
    fatal("calDeviceClose");
}

void display_attribs(CALdeviceattribs *a)
{
    printf(
        "target            %u\n"
        "localRAM          %u MB\n"
        "uncachedRemoteRAM %u MB\n"
        "cachedRemoteRAM   %u MB\n"
        "engineClock       %u MHz\n"
        "memoryClock       %u MHz\n"
        "wavefrontSize     %u\n"
        "numberOfSIMD      %u\n"
        "doublePrecision   %u\n"
        "localDataShare    %u\n"
        "globalDataShare   %u\n"
        "globalGPR         %u\n"
        "computeShader     %u\n"
        "memExport         %u\n"
        "pitch_alignment   %u\n"
        "surface_alignment %u\n"
        , a->target, a->localRAM, a->uncachedRemoteRAM, a->cachedRemoteRAM,
        a->engineClock, a->memoryClock, a->wavefrontSize, a->numberOfSIMD,
        a->doublePrecision, a->localDataShare, a->globalDataShare,
        a->globalGPR, a->computeShader, a->memExport, a->pitch_alignment,
        a->surface_alignment);
}

void cal_puts(const CALchar *msg)
{
    fputs(msg, stdout);
}

void compile_and_run(CALuint count, const char *src)
{
    CALuint devi;
    CALdeviceattribs attribs;
    // get attributes of the target devices
    for (devi = 0; devi < count; devi++)
      {
    attribs.struct_size = sizeof(CALdeviceattribs);
    if (CAL_RESULT_OK != calDeviceGetAttribs(&attribs, devi))
        fatal("calDeviceGetAttribs");
    printf("Device %u: target=%u\n", devi, attribs.target);
    //display_attribs(&attribs);
      }
    // compile the object, link it into an image
    CALobject obj;
    if (CAL_RESULT_OK != calclCompile(&obj, CAL_LANGUAGE_IL, src, attribs.target))
    fatal("calclCompile");
    CALimage img;
    if (CAL_RESULT_OK != calclLink(&img, &obj, 1))
    fatal("calclLink");
    // disassemble the image
    //calclDisassembleImage(img, cal_puts);
    // prepare running it
    gpu_state_t *gs_base = malloc(count * sizeof (*gs_base));
    if (!gs_base)
    perror("malloc"), exit(1);
    for (devi = 0; devi < count; devi++)
    prepare_run(devi, gs_base + devi);
    // load module and execute it
    load_module_and_execute(gs_base, count, img);
    // free the per-GPU resources
    for (devi = 0; devi < count; devi++)
    finish_run(gs_base + devi);
    // free the image and object
    if (CAL_RESULT_OK != calclFreeImage(img))
    fatal("calclFreeImage");
    if (CAL_RESULT_OK != calclFreeObject(obj))
    fatal("calclFreeObject");
}

void read_file(const char *fname, char **data)
{
    int fd;
    if (-1 == (fd = open(fname, O_RDONLY)))
    perror(fname), exit(1);
    struct stat st;
    if (-1 == fstat(fd, &st))
    perror("stat"), exit(1);
    if (MAP_FAILED == (*data = mmap(NULL, st.st_size, PROT_READ,
            MAP_PRIVATE, fd, 0)))
    perror("mmap"), exit(1);
    if (st.st_size < 2)
    fprintf(stderr, "file should be at least length 2: %s\n", fname),
        exit(1);
    if ((*data)[st.st_size - 1] && (*data)[st.st_size - 2])
    fprintf(stderr, "file is not NUL-terminated: %s\n", fname), exit(1);
}

void usage(const char *name)
{
    fprintf(stderr, "Usage: %s <file.il>\n", name);
}

int main(int argc, char** argv)
{
    char *src;
    if (argc != 2)
    usage(argv[0]), exit(1);
    read_file(argv[1], &src);
    if (CAL_RESULT_OK != calInit())
    fatal("calInit");
    setuid(42);
    show_ver();
    CALuint count;
    if (CAL_RESULT_OK != calDeviceGetCount(&count))
    fatal("calDeviceGetCount");
    printf("Found %u device%s\n", count, count > 1 ? "s" : "");
    if (count >= 1)
    compile_and_run(count, src);
    calShutdown();
    return 0;
}
---END-ilperf.c---

---BEGIN-file.il---

il_cs
; must be equal to THREADS_PER_GRP
dcl_num_thread_per_group 512
; l0.x must be equal to NR_ITERATIONS
dcl_literal l0, 0x100000, 0xffffffff, 42.0, 0x0

; nr of iterations
mov r0.x, l0.x
; arbitrary values
mov r1, l0.z
mov r2, l0.z

whileloop
    break_logicalz r0
    ; 209 mad instructions (must be equal to NR_MAD_INSN)
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    mad r2, r2, r2, r2
    mad r1, r1, r1, r1
    iadd r0.x, r0.x, l0.y ; counter--
endloop

; use r1-r2 to compute something to prevent CAL from
; optimizing out the whole loop
iadd r0, r1, r2
mov g[vaTid.x], r0
end
---END-file.il---

aymankh · ‎10-20-2009

zpdixon,

Can you please measure LDS bandwidth in 5870 using ATI stream LDS_read and LDS_write, with and without neighbourExch (w) flag? I am interested to know if something has changed? The numbers for 4870 were really poor? I am trying to decide wheather I will buy the card so please help me.

SteveLR · ‎10-21-2009

Dear zpdixon,

I was putting a system together similar to the one you listed here, and would be running Linux. In any case, I am in the ordering parts phase. My main question, are the Radeon 5870's worth it, ie can you crunch anything with them? I mean do they just run Graphics according to specks, and drop down to 50-100 Gflops if you try and use them as GPGPU's. My main questions were related to purchase of cheeper Radeon's Vs. bying a single Firestream or Nvidia tesla c1070 for 1200$, which boast only 1 Tflop peak performance, but has been tested on GPGPU applications with double floats (as opposed to the single). I mean, I would like to run all 4 Radeons on an ASUS P6T7 with an i7 950 Quadcore CPU. I basically am interested in what you accomplish with your system, success or failure, and where to throw my money.

Stephan Watkins lloyd.riggs@gmx.ch

empty_knapsack · ‎10-22-2009

Stephan, I'm strongly recommend you to purchase some cheap ATI GPU first (for example, $100 for 4850) and test, is it suitable for your computations or not.

There are TONS of problems with ATI GPGPU SDK right now, it marked as beta for more than a year for a reason. At this moment CUDA is a way more mature environment for developing GPGPU applications (though buying Tesla probably unneeded step unless 4G RAM is a must).

gpgpu_4870 · ‎10-25-2009

Hi zpdixon and the others,

I have a problem with results from your benchmark. Well, I ran it in Windows Vista(after a few modifications) on my 4870 and it get as expected 1178Gflops which is pretty much the maximum of thetheoretical 1200Gflops (time elapsed about 7.7secs).

The problem is that when i ran the same benchmark in Windows Vista on my brand new 5870 it returned a value of 1350Gflops!!! which is about the half of the theoretical value of 2720Gflops (time elapsed about 6.5secs). I am very very frustated with this and ran the code in Ubuntu 9.04 too and got exactly the same numbers.

To conclude, I ran the cosmology test nnsan posted and got about 2Teraflops as he got too in the not-optimized version of his kernel.

So, how you got about 2660Gflops? Is there any modification of the cal il code u made, eg to number of threads?

Thanks, in advance

hazeman · ‎10-25-2009

Originally posted by: gpgpu_4870 So, how you got about 2660Gflops? Is there any modification of the cal il code u made, eg to number of threads?

Change #define NR_GROUPS 10 to #define NR_GROUPS 20 ( as 5870 has 20 simd cores )

gpgpu_4870 · ‎10-27-2009

Thanks! Got the value of 2660Gflops but time elapsed remained the same(about 6.5secs). I tried with number of threads per SIMD in range 64-256 (wavefrontsize = 64) and i got much better time with fewer threads but fewer gflops as well. The best combination was with 256 threads where time was getting close to the half (3.4secs) and gflops a bit lower than maximum (about 2600Gflops). So, is it all about overhead or something;s wrong with the app / CAL intialiazations?

Something else: I tried to run the same code in my 3850 agp system (changed NR_GROUPS to 4 cause have 4 SIMDs) and got the CAL error below (Is it because it does not have, i think, compute kernel support?):

calclCompile failed - cal:[No error] calcl:[ILScanILBinary: Unsupported opcode for architecture]

MicahVillmow · ‎10-28-2009

gpgpu_4870,
In order to run that code on the HD3850, you need to change a few things.
il_cs <-- this must be il_ps
; must be equal to THREADS_PER_GRP
dcl_num_thread_per_group 512 <-- this must be removed
vaTid.x <-- this must be vObjIndex.x

Also, you need to declare your vObjIndex in the kernel like some of the cal samplers.
You also need to change from calctxRunProgramGrid to calctxRunProgram
This won't work unmodified on that card because it does not have hardware compute shader.

gpgpu_4870 · ‎10-28-2009

Thanks Micah.

Managed to get 427Gflops out of my "old" HD3850 AGP at clock 720Mhz.

zpdixon · ‎11-20-2009

Micah: I see you took care of helping others run my program on HD 3xxx, thanks

I received 2 HD 5970 from Newegg yesterday and they "just worked" too with no pb, on the same 64-bit Linux dev box, same 9.9 drivers, etc. I measured 4540 GFLOPS, or 98% of the max theoretical 4640 GFLOPS. Other than this bench, my GPGPU workloads are ALU-bound with very rare memory accesses. At full load I measure a power consumption of only about 185 Watt for a single 5970 (2.4A on the PCI-E slot's 12V rail, 5.0A on the 6-pin 12V power connector, 8.3A on the 8-pin 12V power conn.), or 62% of the max theoretical 294 Watt TDP. This is very impressive, my perf/W increased by 2.7x compared to a 4850 X2. Definitely worth the ~$600 each, the power savings are going to recoup the hw price in less than 4 months... A 5970 is roughly 4x faster than the competition (GTX 295) on my workloads, and the latter consumes more power and is about the same price +/- $100. Brilliant!

riza_guntur · ‎11-20-2009

183 watt? Do you only program for single GPU? or your program utilized both GPU on the board?

zpdixon · ‎11-20-2009

I utilize both GPUs. I guess the power draw is so low in my case because:

Contrary to games or other GPGPU workloads, I rarely access video memory. GDDR5, as opposed to the 4850 X2's GDDR3, enters very low power modes when it is not used.
My workloads obviously don't stress parts of the chip that I don't use: floating-point, ROPs, etc. AMD made huge efforts to make R800 conserve power when these logic blocks are not used.

For those wondering how I measure the wattage: I connected the 5970 to the motherboard via a flexible PCI-E extender whose ribbon cable has the 5 wires used to carry the 12V rail separated from the other wires. This allows me to clamp a clamp-meter around them. I use the same clamp-meter to measure the draw on the 6-pin and 8-pin power connectors. Technically my measurements are slightly inaccurate because I am not taking into account the 3.3V rail on the slot. This is not important because in my experience video cards draws only ~2A from this rail (the PCI-E specs allows a max of 3.0A). This accounts for only ~7W (max 9.9W).

empty_knapsack · ‎11-22-2009

In contrary .

I've got some results from 5770 GPUs and that VFETCH thing I've mention above true for them too. For my application (which isn't pure synthetic MAD MAD MAD test but more real word app) R800 10% slower than R700 on the same config (i.e. 5770 = 800SP @ 850Mhz R800 and 4890 = 800SP @ 850Mhz R700). With adding more memory read operations R800's performance drops way faster than R700, difference can be as large as 50%.

Now I'm curious is it just poorly written IL compiler issue (and hopefully it'll fixed with future drivers) or R800 having some (serious) problems in hardware.

riza_guntur · ‎11-22-2009

The 5770 only have half the bandwidth of 4890, so isn't it expected to have half the performance of 4890 on lots of read operations?

empty_knapsack · ‎11-22-2009

riza.guntur,

that 50% actually was applied to 5870 not 5770. Compare between 4890 & 5770 was done with heavily ALU bound kernel, memory bandwidth shouldn't be a problem there at all.

Archives Discussions

HD 5870 and 5970 working :-)