cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Ceq
Journeyman III

Brook 1.4 gather bug?

I think there is a bug that affects gather operations when using big 1D streams. The following example takes two streams and merges their data in a interleaved way, for example:

SIZE = 4

vIn1 = { 0, 1, 2, 3 }

vIn2 = { -0, -1, -2, -3}

vOut = { 0 , -0, 1, -1, 2, -2, 3, -3}

 

However if you use big stream, like 2^20:

- Using Catalyst 9.2, Brook+ 1.4, MSVC 2005, Radeon 3870x2 or Radeon 4850, WinXP 64 or WinXP 32

SIZE = 1 << 20;

vIn1 = { 0, 1, 2, 3, ... }

vIn2 = { -0, -1, -2, -3, ...}

vOut = { 0 , 0, 1, 0, 2, 0, 3, 0, ...}

- Using Catalyst 9.5, Radeon 4850 and WinXP 64 returns undefined values instead of 0, usually previous memory data.

- Using Catalyst 9.5, Radeon 3870x2 and WinXP 32 aborts the program.

 

Test case:

#include <  stdio.h >
#include < stdlib.h >

kernel void
ker(float in1[ ], float in2[ ], out float out1< > )
{
    int pos_x = instance().x;
    int i = pos_x >> 1;
    float t;
    if(pos_x & 0x01) {
        t = in2[i ];
    } else {
        t = in1[i ];
    }
    // Some operations with t
    out1 = t;
}


int main(int argc, char** argv) {
    const int SIZE = 1 << 20; // *** Wrong result using big sizes ***
    // const int SIZE = 1 << 10; // *** Right using small streams ***
    const int DSIZE = 2 * SIZE;
    unsigned int i;

    // Memory arrays
    float vIn1[SIZE ],  vIn2[SIZE ];
    float vOut[DSIZE ];

    // Init
    for(i = 0; i < SIZE; ++i) {
        vIn1[i ] =  (float)i;
        vIn2[i ] = -(float)i;
    }

    {
        // Stream arrays
        float sIn1<SIZE >,  sIn2<SIZE >;
        float sOut<DSIZE >;
        // Load
        streamRead(sIn1, vIn1 );
        streamRead(sIn2, vIn2 );
        // Kernel
        ker(sIn1, sIn2, sOut );
        // Save
        streamWrite(sOut, vOut );
    }

    // Print
    for(i = 0; i < 8; i++)
        printf("vOut[%i] = (%7.3f);\n", i, vOut[i ] );
}

0 Likes
10 Replies
emuller
Journeyman III

Isn't the max 1D resource size 2^13 = 8192?  For the 1<<20 case, did you check the streams allocate without error?

 

 

0 Likes

Thanks for trying to help, Emuller. According to the user guide (2.2.1.2) maximum size for a stream is 2^26 elements, if you use a large 1D stream Brook+ should automatically enable address virtualization.

In fact, there is a test in "samples/legacy/tests/address_translation" that tests this feature, but fails because address virtualization isn't working. Try and run it with command line "address_translation.exe -e -p -t -q -x 128 -y 128". This will do the test with 16384 elements.

0 Likes

It is a regression with Catalyst 9.5 or 9.4 in which CAL reports max1DWidth to be more than 8192, but resource alocation fails with width > 8192. It works fine with Catalyst 9.2 and 9.3.

0 Likes

Thanks Gaurav, using Catalyst 9.2 the test code I wrote in the first post returns a bad result. If I run it in GPU mode I get:

vOut = { 0, 0, 1, 0, 2, 0, 3, 0, ... }

However if I try using CPU backend I get the right result:

vOut = { 0 , -0, 1, -1, 2, -2, 3, -3, ...}

Any hint on this? Do you get the same results?

Note: In my installations setting environment variable "BRT_RUNTIME = CPU" for using CPU backend no longer works, I had to add the following code before the "// Init" comment to force CPU mode, could it be that Brook+ 1.4 doesn't read BRT_RUNTIME variable for runtime?:

...
unsigned int count;
Device* device;
device = getDevices("cpu", &count);
useDevices(device, 1, NULL);
// Init
...

WinXP 32, MSVC 2005, Radeon 3870x2, Brook+ 1.4, Catalyst 9.2

0 Likes

Check error and errorLog() on your streams and see if it gives any information.

BRT_RUNTIME still works, but if you have used useDevices(), it gets more precedence over BRT_RUNTIME.

0 Likes

Using Catalyst 9.2 there are no error messages, but the result is wrong.

Using Catalyst 9.5 reports memory allocation failure if width > 8192 as you said.

 

Updated test case, now executes first in CPU backend, prints results, and then executes in GPU backend. It also checks and prints Brook+ stream errors.

 

File "ker.br"

---------------------------------

kernel void
ker(float in1[ ], float in2[ ], out float out1< > )
{
    int pos_x = instance().x;
    int i = pos_x >> 1;
    float t;
    if(pos_x & 0x01) {
        t = in2[i ];
    } else {
        t = in1[i ];
    }
    // Some operations with t
    out1 = t;
}

 

File "main.cpp"

---------------------------------

#include "brook/Stream.h"
#include "brook/Device.h"
#include "built/ker.h"

using namespace std;
using namespace brook;

float *vIn1, *vIn2, *vOut;

void test(const char *backend, unsigned int size) {
    unsigned int i, count, dsize = 2 * size;
    Device* device = getDevices(backend, &count);
    useDevices(device, 1, NULL);
    printf("\nUsing %s backend\n", backend);

    // Stream arrays
    Stream<float> sIn1(1, &size);
    Stream<float> sIn2(1, &size);
    Stream<float> sOut(1, &dsize);
    
    // Load
    sIn1.read(vIn1);
    sIn2.read(vIn2);
    if(sIn1.error() ) puts(sIn1.errorLog() );
    if(sIn2.error() ) puts(sIn2.errorLog() );

    // Kernel
    ker(sIn1, sIn2, sOut );
    if(sIn1.error() ) puts(sIn1.errorLog() );
    if(sIn2.error() ) puts(sIn2.errorLog() );
    if(sOut.error() ) puts(sOut.errorLog() );

    // Save
    sOut.write(vOut);

    // Print
    for(i = 0; i < 8; i++)
        printf("vOut[%i] = (%7.3f);\n", i, vOut[i ] );
}

int main(int argc, char** argv) {

    unsigned int i, SIZE = 1 << 20; // *** Wrong result using big sizes ***
    // const int SIZE = 1 << 10; // *** Right using small streams ***

    // Memory arrays
    vIn1 = (float*)malloc(    SIZE * sizeof(float) );
    vIn2 = (float*)malloc(    SIZE * sizeof(float) );
    vOut = (float*)malloc(2 * SIZE * sizeof(float) );
    if(!vIn1 || ! vIn2 || !vOut) {
        printf("Enlarge project heap memory first\n");
        exit(0);
    }

    // Init
    for(i = 0; i < SIZE; ++i) {
        vIn1[i ] =  (float)i;
        vIn2[i ] = -(float)i;
    }

    test("cpu", SIZE);
    test("gpu", SIZE);

    // Free memory
    free(vIn1); free(vIn2); free(vOut);
    return 0;
}

 

// WinXP 32, MSVC 2005, Radeon 3870x2, Brook+ 1.4, Catalyst 9.2

0 Likes
Ceq
Journeyman III

Looks like the test "/samples/legacy/test/domain" also fails, at least using Catalyst 9.5 (It works with 9.2).

Please, can anybody confirm the output of the previous post? I just want to make sure that it isn't related to my installation or a mistake.

Thanks.

0 Likes

Domain test works fine for me with Catalyst 9.5. Are you using the default dimensions?

0 Likes

That's strange... yes, I was using default dimensions and building in x64 mode, note that it is the legacy test, the CPP version works fine. Well, I reverted to Catalyst 9.2 (I needed address translation) and this test is OK now.

Does the code I wrote returns the right result? If so, maybe there is something wrong with my installation because of going back and forth with Catalyst drivers to test programs. I'm currently using WinXP 32, Catalyst 9.2 and CPU and CAL outputs for that code differ, this is the output:


Using cpu backend
vOut[0] = (  0.000);
vOut[1] = ( -0.000);
vOut[2] = (  1.000);
vOut[3] = ( -1.000);
vOut[4] = (  2.000);
vOut[5] = ( -2.000);
vOut[6] = (  3.000);
vOut[7] = ( -3.000);

Using gpu backend
vOut[0] = (  0.000);
vOut[1] = (  0.000);
vOut[2] = (  1.000);
vOut[3] = (  0.000);
vOut[4] = (  2.000);
vOut[5] = (  0.000);
vOut[6] = (  3.000);
vOut[7] = (  0.000);

 

WinXP 32 SP3, MSVC 2005, Radeon 3870x2, Brook+ 1.4, Catalyst 9.2

0 Likes

Hi, this is just to report that I've tried Catalyst 9.6, but address virtualization is still not working.

Originally posted by: gaurav.garg It is a regression with Catalyst 9.5 or 9.4 in which CAL reports max1DWidth to be more than 8192, but resource alocation fails with width > 8192. It works fine with Catalyst 9.2 and 9.3.


0 Likes