cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

yarr
Journeyman III

Could use some help with arrays

weird index behavior

Hello everyone!

Trying to get into brook+ GPU programming, I encountered a weird error with array indexes. Here's my test kernel:

kernel void testkernel(uint2 ker_in<>, uint ker_key[2], uint ker_s87[256], out uint2 ker_out<>)
{
 uint ker_n1, ker_n2, t, z, mhm;
 uint2 test;
 
 ker_n1 = (uint)ker_in.x;
 ker_n2 = (uint)ker_in.y;

 t = (ker_n1+ker_key[0]) & (uint)0xFFFFFFFF;
 mhm=t>>(uint)24 &(uint)255;
 z = ker_s87[mhm];

 test.x = z;
 test.y = mhm;
 ker_out = (uint2)test;
}

Everything works okay, except of the highlited part.
Somehow "z" just returns zero.

"mhm" itself returns proper values: from zero to 255, and also if I just manually substitute the index in the "z" expression, for example:

z = ker_s87[158];

I get proper result, with z returning the needed element.

I can't really understand what's wrong here and why doesnt it work in proper way. 😞

Thanks in advance for any ideas or suggestions!

0 Likes
43 Replies
rick_weber
Adept II

When indexing into a stream, I usually use floating point indices. I'm not sure you can index using uint, though I could be wrong. Thus, try changing z = ker_s87[mhm] to z = ker_s87[(float)mhm].

0 Likes

Na-ah, still a no-go.

However, stating "z = ker_s87[(uint)169]" does return a proper value. So, I guess, indexing by (u)ints seems to work.

0 Likes

Sometimes the compiler doesn't generates what you expect... I spent a couple of hours wondering what was going on with a piece of code just to discover that the compiled IL was not correct.

Well, i don't know if this is the case here, but you can maybe try the following equivalent code to yours and check the behaviour...

kernel void test2(uint2 ker_in<>, uint ker_key[2], uint ker_s87[256], out uint2 ker_out<>
{
 uint t, z, mhm;
 
 t = ker_in.x + ker_key[0];

 mhm = (t>>(uint)24) & (uint)255;

 z = ker_s87[mhm];

 ker_out.x = z;
 ker_out.y = mhm;
}

0 Likes

That's how initial code looked like pretty much. ^^ Still works the same way tho: "mhm" operation returns proper results and "z" just wont work.

Actually it looked like this at first: "z = ker_s87[(t>>(uint)24) & (uint)255]" and the "mhm" itself was put in there to check whether there's maybe a problem with "t" or the "t>>(uint)24) & (uint)255" operation. But well, they seem to work just fine and it indeed looks like the compiler wont generate proper code for "z = ker_s87['something different than plain number']". 😞

0 Likes

use -c flag and run again. Mostly it will work

0 Likes

Using "-c" causes it to even fail building this kernel with the following error:

error C2664: 'void __test2::operator ()(brook::Stream,brook::Stream,brook::Stream,brook::Stream)' : cannot convert parameter 2 from 'uint [2]' to 'brook::Stream'
with [T=uint2] and [T=uint]
No constructor could take the source type, or constructor overload resolution was ambiguous

Also, I did some more testing; if I just state the parameter in the kernel and pass the variable to "z":

mhm = 139;
z = ker_s87[mhm];

I get the proper result returned. So I guess its not about passing the index as variable. The problem is that I need it to be computed inside the kernel, so I'm stuck here again... anyone got some ideas?

0 Likes
yarr
Journeyman III

Did some more testing, wrote this simple kernel:

kernel void test3(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>, uint index_test)
{
ker_out.x=index_test;
ker_out.y=ker_s87[index_test];
}

and behavior is very similar to the previous kernel: "ker_out.y" just returns zero.

However, with:

kernel void test4(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>)
{
uint index_test = (uint)255;
ker_out.x=index_test;
ker_out.y=ker_s87[index_test];
}

I get the last array element returned.
So it looks like those arrays only accept constants as indices?

Bug?



0 Likes

Yarr,

I have written following test case based on your discussions.

could you please conform  whether test case reproduces the problem you are facing?

kernel void testkernel(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>,uint index_test)

{

    ker_out.x=index_test;

    ker_out.y=ker_s87[index_test];

}

 

int main(int argc, char** argv)

{

    uint *i0 = NULL;

    uint  *i1 = NULL;

    uint *o0 = NULL;

    uint2 streami0<4>;

    uint2 streamo0<4>;

    uint c = 200;

    int i = 0, j = 0;

    int mismatched = 0;

   

    i0 = (uint*)malloc(4 * 2 * sizeof(uint));

    i1 = (uint*)malloc(256 * sizeof(uint));

    o0 = (uint*)malloc(4 * 2 * sizeof(uint));

 

    for(j = 0; j < 256; j++)

    {

        i1 = (uint)j;

    }

 

 

    //! Brook code

    streamRead(streami0, i0);

    testkernel(streami0, i1, streamo0, c);

    streamWrite(streamo0, o0);

 

    for(i = 0; i < 4; i++)

    {

        if(o0[i * 2 + 1] != i1)

        {

            mismatched = 1;

            break;

        }

    }

 

    if(mismatched)

        printf("Failed!!\n");

    else

        printf("Passed!!\n");

 

    free(i0);

    free(i1);

    free(o0);

}

 

 

0 Likes

error C2664: 'void __test2:perator ()(brook::Stream,brook::Stream,brook::Stream,brook::Stream)' : cannot convert parameter 2 from 'uint [2]' to 'brook::Stream' with [T=uint2] and [T=uint] No constructor could take the source type, or constructor overload resolution was ambiguous

 

Compiling brook kernel with -c flag means it changes a conatant array into stream. So, it requires you to pass this parameter as Stream and not as uint [2] (as mentioned in error)

0 Likes

Thanks alot for responding, that indeed made the program run properly.

 But, according to manual (also if I understood it correctly :)), not using constant buffer thingie is should be way slower? Rephrasing, I drastically reduce performance by swapping from constant buffer to just passing the data as another stream?
And this also means, that constant buffer behavior is somewhat borked because I couldnt get elemets from there?

Also, let me explain why performance is so important to me:
I'm a student and me + some of my buddies picked the task of implementing russian GOST 28147-89 (http://en.wikipedia.org/wiki/GOST_28147-89) block cipher algorythm on various "unusual" devices as our graduation work. I picked AMD GPU, while 2 another guys picked nvidia GPU and CELL BE processor in sony PS3 respectively. So basically it resulted in some kind of "race" here of who gets the cipher to run fastest. 🙂

Anyways, I already got some running program, but its embarrasingly slow compared to other "competitors". Running tests with 128mb of input data produces speeds of around 50-60mb/s (which is around the performance I get on my CPU) while the cell guy has ~150mb/s per core and nvidia guy (he's using g92 card) has around 2gb/sec.

Judging by theoretical speed of my card (I'm using 4870 512Mb running at 775/4000 btw), it should easily outperform both of those, but it simply doesnt. 🙂

This brings me to thinking something's REALLY wrong with my program (although it really looks alike to CUDA version for example). So could you please look through it and point out maybe some obvious bottlenecks which can be fixed to improve the working speed?

I'll just paste kernels I'm using and shortly explain what they do:
kernel uint sbox_gpu(uint t, uint ker_s[4][256]) //this is subkernel for applying the substitution boxes
//works byte-by-byte
{
uint x;
x=ker_s[0][t>>(uint)24 & (uint)255] << (uint)24 | ker_s[1][t>>(uint)16 & (uint)255] << (uint)16 | ker_s[2][t>>

(uint)8 & (uint)255] << (uint)8 | ker_s[3][t & (uint)255];
return x<<(uint)11 | x>>(uint)21;
}

//here's the main kernel, it accepts the input of: 1) 64 bit blocks, streamed
//2) array of 256 bit cipher key sliced into 8 pieces
//3) array of s-boxes to pass 'em to subkernel
kernel void gostcrypt_gpu(uint2 ker_in<>, uint ker_key[], uint ker_s[][], out uint2 ker_out<>)
{
uint n1, n2;

n1 = (uint)ker_in.x;
n2 = (uint)ker_in.y;

//swapping names each round instead of swapping blocks
//-------------First block of 8 iterations-------------
n2 = n2^sbox_gpu((n1+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[2]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[3]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[4]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[5]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[6]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[7]) & (uint)0xFFFFFFFF, ker_s);

//-------------Second block of 8 iterations-------------
// repeat the above part, I skipped it to conserve space.
//-------------Third block of 8 iterations-------------
// same deal, repeat the first part.
//-------------Last (reversed) block of 8 iterations-------------
n2 = n2^sbox_gpu((n1+ker_key[7]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[6]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[5]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[4]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[3]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[2]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);

// There is no swap after the last round
ker_out.x = n2;
ker_out.y = n1;
}

You can see the whole sourcecode here: http://annihilator.comtv.ru/source.txt

I'll be really grateful for any ideas about the ways of optimizing this code, huge thanks in advance.



0 Likes

Hi yarr,

Some easy changes that I think can help you-

1. change ker_in and ker_out into 2D stream of 4096, 4096 (It should not affect your algorithm).

2. Change ker_key into constant buffer. You can keep both constant buffer and gather array(remove specified sizes from kernel declaration) in your program.

kernel void gostcrypt_gpu(uint2 ker_in<>, uint ker_key[8], uint ker_s[][], out uint2 ker_out<>

Compile it without -c. It should work.

3. Convert uint2 ker_in and ker_out streams into uint4 stream. It reduces your number of threads to half and increases your ALU usage while keeping only one fetch instuction.

4. Last step might be to increase number of outputs in a single kernel. Writing to multiple outputs helps you decrease total threads and increase ALU intensity of kernel. But, looks like this step won't be very helpful in your case.

Please follow each step one by one and see performance after each step. Let me know about your results.

0 Likes

Thanks gaurav, your advices really seemed to speed it up somewhat. 🙂

Posting execution times for those kernels:
First (original) one with 1D array of 128mb input data: ~2.32 secs.
1) After changing those to 2D of 4096x4096: down to ~1.93.
2) After changing keys array to be in constant buffer: down to ~1.46.
That should be around 87mb/sec, which is still really really slow, but way better than what I had previously. 🙂

Third step is not that trivial (req's some tinkering with sboxes) and might even result in the kernel being slowed down instead, but I'll try to rewrite it anyways.

Btw, tried the modified "fastest" kernel with sbox subkernel just returning the input value (thus, the sbox routine being pretty much disabled, I guess) and got results of around 0.4 secs, which still looks quite slow...

Getting to the fourth step I get only one idea of maybe writing n1 and n2 to 2 separate outputs, but that'll require some extra time afterwards to rearrange them. Still worth a shot, I guess, so I'm on it. 🙂

EDIT: 4) Okay, tried 2 separate outputs: got slower by approx 0.2 to 0.3 secs, guess that wont really work.

EDIT2: 3) Tried some pretty straightforward, "brute" conversion to uint4 in/out streams by pretty much doubling kernel instructions:

n1 = (uint)ker_in.x;
n2 = (uint)ker_in.y;
n3 = (uint)ker_in.z;
n4 = (uint)ker_in.w;

n2 = n2^sbox_gpu((n1+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
n4 = n4^sbox_gpu((n3+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
n3 = n3^sbox_gpu((n4+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
[...]
and so on.

And well I get around 4.2 secs for executing this.
Removing s-boxes I get around 2.2 secs.

Even considering I doubled the input data (256 mb now), it's still a slowdown.
Either I need more "graceful" vector conversion, or it simply wont work.
The main problem with proper vectorization are those substitutions.

P.S. I'm running Vista ultimate x64 SP1 if that matters much.
SDK I'm using is of latest (1.3) version.

Thanks again.

0 Likes

Originally posted by: gaurav.garg



1. change ker_in and ker_out into 2D stream of 4096, 4096 (It should not affect your algorithm).



I have seen this mentioned before, is this true for all 1D streams? That they would get better performance if put into 2D streams?

And by how much (aka: is it worth the coding time)?

0 Likes

First of all, wavefronts are created from quads(2X2 thhreads). So, if you have 1D stream, you are actually using only half of SIMD cores.

Second reason is some overhead of Brook+ virualization. With CAL you can allocate 1D resource of max size 8192(2^13), but Brook+ allows you 1D streams of size 2^26. It has to generate some extra code for mapping from 1D stream to 2D texture space and vice-versa. Keep in mind that this overhead only appears if your stream is of size > 8192.

0 Likes

This is true and in most cases it is a fairly large performance increase(as long as you are not using global buffer). There are two reasons for this. One is that 1D streams have a limit on the size that is a magnitude smaller than 2D streams and if the 1D stream is over that hardware limit, then it requires address translation via a generic algorithm. If you can program your algorithm with 2D streams then you loose this overhead increasing performance.

Also, even though our hardware is becoming more generalized, it is still a piece of hardware optimized for graphics which is inherently 2D. So by going the 2D stream route and using vector data types then you hit the fast paths that are highly optimized, thereby improving performance.
0 Likes

Originally posted by: MicahVillmow

Also, even though our hardware is becoming more generalized, it is still a piece of hardware optimized for graphics which is inherently 2D. So by going the 2D stream route and using vector data types then you hit the fast paths that are highly optimized, thereby improving performance.


By vector types you mean float4 or int4, etc?

Another question:

How much of a difference do you get between these two pieces of code for float4s:

out = in;

AND

out.w = in.w;
out.x = in.x;
out.y = in.y;
out.z = in.z;

Is the former treated as a vector and the latter treated as scalars (even though they are of vector type)?
0 Likes

Sorry, double post.
0 Likes

Nevermind. Thanks. Sorry to waste space.
0 Likes

ryta,
That is correct on vector types.

The cal compiler should generate equivalent code for both of those examples if there is not complicated code in between each of the assignments.
0 Likes

Have a few questions about 2D arrays:

1. streamRead/streamWrite won't take a double pointer? How do you copy over a dynamic 2D array into a 2D stream of the same size?

2. Looking at the following kernel I am having some problems porting it from 1D to 2D.

Here is the 1D code which works fine:

int x, y, idx;
idx = instance().x;
x = idx%gx;
y = (int)floor((float)idx/(float)gx);
Fs1to4 = Fin1to4[idx];
Fs5to8 = Fin5to8[idx];
Fs9 = Fin9[idx];
if ((y > bk-1) && (y <= my-bk+1))
{
if (idx=gx*y) //x==0
{
Fs1to4 = Fin1to4[(mx-1)+gx*y];
Fs5to8 = Fin5to8[(mx-1)+gx*y];
Fs9.w = Fin9[(mx-1)+gx*y].w;
}
if (idx == ((mx+1)+(gx*y))) //x==mx+1
{
Fs1to4 = Fin1to4[2+gx*y];
Fs5to8 = Fin5to8[2+gx*y];
Fs9.w = Fin9[2+gx*y].w;
}
}

And here is the 2D code which does not work:

int x, y, idx;
x = instance().x;
y = instance().y;
idx = x+gx*y;
Fs1to4 = Fin1to4;
Fs5to8 = Fin5to8;
Fs9 = Fin9;
if ((y > bk-1) && (y <= my-bk+1))
{
if (x==0)
{
Fs1to4 = Fin1to4[(mx-1)];
Fs5to8 = Fin5to8[(mx-1)];
Fs9.w = Fin9[(mx-1)].w;
}
if (x==(mx+1))
{
Fs1to4 = Fin1to4[2];
Fs5to8 = Fin5to8[2];
Fs9.w = Fin9[2].w;
}
}

Now, I know the x==0 part works because I have tried it in the first code with no problems; however, the second part seems to not work properly even if I substitute it in the 1D code. I think I am missing something as far as how the 2D domain and instance() work.
0 Likes

1. Thers is no way you can pass double pointer to streamRead. It has same behavior as cpu memcpy.

2. When you index into 2D gather array, it has to be C-style indexing.

a where k - row number and j - column number.

0 Likes

Originally posted by: gaurav.garg

1. Thers is no way you can pass double pointer to streamRead. It has same behavior as cpu memcpy.




2. When you index into 2D gather array, it has to be C-style indexing.




a where k - row number and j - column number.



1. So there is no way to use dynamically allocated multidimensional arrays in Brook+? That's a pretty big limitation no?

2. Yes, I understand this. I'm not sure I understand how this helps my 2nd question at all, could you elaborate?

0 Likes

As in your 1D code you have written -

x = idx%gx;
y = (int)floor((float)idx/(float)gx);

That made me thought x is column index and y is row index. Is that right?

If it's right, you need to change your code to - Fin1to4;

0 Likes

Originally posted by: gaurav.garg

As in your 1D code you have written -


x = idx%gx;
y = (int)floor((float)idx/(float)gx);


That made me thought x is column index and y is row index. Is that right?


If it's right, you need to change your code to - Fin1to4;


So from this I could just swap the instance of:

x = instance().y;
y = instance().x;

Does that make sense? Just wondering because that doesn't seem to work. So are arrays in kernels in [column][row]? Are they put that way in streamRead?

ALSO, how do you pass a dynamically allocated 2D array into a stream since you can't use streamRead with double pointers???

0 Likes

Originally posted by: ryta1203 ALSO, how do you pass a dynamically allocated 2D array into a stream since you can't use streamRead with double pointers???


 

Stream* xGPU[32/8];
float* x = (float*)malloc(sizeof(x[0])*npt);
 int coordDim[] = {8192, 8}; 

for(int curBlock = 0; curBlock < 32; curBlock+=8)
{
xGPU[curBlock/8] = new Stream(2,coordDim); 
xGPU[curBlock/8]->read(&x[8*8192*i]);
}
This is something I cooked up a few months ago. Alternatively, you can call streamRead(*xGPU, x);

 

 

 



 

0 Likes

Did you try changing other gather array access. Try this-

int x, y, idx;
x = instance().x;
y = instance().y;
idx = x+gx*y;
Fs1to4 = Fin1to4;
Fs5to8 = Fin5to8;
Fs9 = Fin9;
if ((y > bk-1) && (y <= my-bk+1))
{
if (x==0)
{
Fs1to4 = Fin1to4[(mx-1)];
Fs5to8 = Fin5to8[(mx-1)];
Fs9.w = Fin9[(mx-1)].w;
}
if (x==(mx+1))
{
Fs1to4 = Fin1to4[2];
Fs5to8 = Fin5to8[2];
Fs9.w = Fin9[2].w;
}
}

0 Likes

Gaurav,

This doesn't work (I tried) and I wouldn't expect it too either. The conditions are based on the indexes of the columns and rows but have not been switched. I tried to switch them also and I get the same results.

The results I get are as if the kernel is not called at all, meaning that it's either not making it into the first conditional or that it's not making it into the other two.

This was one of the easiest kernels to implement in my 1D solution but for 2D it seems much trickier, although it really shouldn't be since the original code (CPU) is in 2D also.

What I want to know is this:

1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.

2. If I understand you correctly, instance().x returns the column index and instance().y returns the row index, correct?

In my 1D case, if I use idx==(mx+1)+gx*y then I get the correct results; however if I use x == mx+1 OR y == mx+1 I get incorrect results, so it might be that this problem is not specific to the 2D case just that I can't access the 2D using the 1D solution.


EDIT: I thought it might be useful to post the original C code I am converting:

for(y=BK;y<=MY-BK+1;y++)
for(k=0 ; k < Q ; k + + )
{
F[0]=F[MX-1];
F[MX+1]=F[2];
}


EDIT2: As a side note, I have seen significant speedup using 2D just through 2 (out of the 6) kernels and that solution is using a lot more streamRead/streamWrites than the full solution (as in the 1D version), so this is great if I can get it working!!
0 Likes

Can you post your kernel signature and runtime code for 2D case?

0 Likes

kernel signature? runtime code?

I'm not familar with the workings of the code created by the brcc. Do you mean the generated IL?
0 Likes

Kernel signature - function signature for you brook kernel giving information about kernel parameters

runtime code - Part of code where you declare streams and call kernel.

0 Likes

void mcollid()
{
int step, x=0, y=0; float Norm1, Norm2, error1, error2;
float4 Fs1to4_1< gx, gy>;
float4 Fs1to4_2< gx, gy>;
float4 Fs5to8_1< gx, gy>;
float4 Fs5to8_2< gx, gy>;
float4 Fs9_1< gx, gy>;
float4 Fs9_2< gx, gy>;

float4 fs1to4_1< gx, gy>;
float4 fs5to8_1< gx, gy>;
float4 fs9_1< gx, gy>;
float4 fs1to4_2< gx, gy>;
float4 fs5to8_2< gx, gy>;
float4 fs9_2< gx, gy>;
float GEOs< gx, gy>;
float ss< 9>;
float rs< 9>;
int es< 18>;

streamRead(fs1to4_1, f1to4);
streamRead(fs5to8_1, f5to8);
streamRead(fs9_1, f9);
streamRead(Fs1to4_1, F1to4);
streamRead(Fs5to8_1, F5to8);
streamRead(Fs9_1, F9);
streamRead(ss, s);
streamRead(GEOs, GEO);
streamRead(rs, r);
streamRead(es, e);

mcollid_s(Fs1to4_1, Fs5to8_1, Fs9_1, fs1to4_1, fs5to8_1, fs9_1, GEOs, ss, G, Fs9_2, Fs5to8_2, Fs1to4_2);
Fs9_2.error();

advection1_s(Fs1to4_2, Fs5to8_2, Fs9_2, GEOs, es, gx, mx, my, rs, Fs9_1, Fs5to8_1, Fs1to4_1);
Fs9_1.error();

advection2_s(Fs1to4_1, Fs5to8_1, Fs9_1, gx, mx, my, bk, Fs9_2, Fs5to8_2, Fs1to4_2);
Fs9_2.error();

streamWrite(Fs1to4_2, F1to4);
streamWrite(Fs5to8_2, F5to8);
streamWrite(Fs9_2, F9);
}




kernel void advection2_s(float4 Fin1to4[][], float4 Fin5to8[][], float4 Fin9[][], int gx,
int mx, int my, int bk, out float4 Fs9<>, out float4 Fs5to8<>, out float4 Fs1to4<>)
{
int x, y;
x = instance().x;
y = instance().y;
Fs1to4 = Fin1to4;
Fs5to8 = Fin5to8;
Fs9 = Fin9;
if ((y > bk-1) && (y <= my-bk+1))
{
if (x==0)
{
Fs1to4 = Fin1to4[(mx-1)];
Fs5to8 = Fin5to8[(mx-1)];
Fs9.w = Fin9[(mx-1)].w;
}
if (x==(mx+1))
{
Fs1to4 = Fin1to4[2];
Fs5to8 = Fin5to8[2];
Fs9.w = Fin9[2].w;
}
}
}
0 Likes

I assume in float4 Fs5to8_2< gx, gy>;  gx is height and gy is width.

The only other part I have doubt is your gather array accesing. All of your indexing have y(current row index) as second component. Not sure if you wanted to do this.

1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.


I give you an example that shows how to do this-

float a<height, width>;
float aPtr[height][width]; // or float* aPtr = new float[height*width];
streamRead(a, aPtr);

0 Likes

yes, gx is row and gy is column but this isn't relevant since gx=gy in my example.

Here is how I have it:

float4 F5to8[gx][gy];
float4 Fs5to8< gx,gy >;
streamRead(Fs5to8, F5to8);

This should be right since this works for the first two kernels, no?

If you look at the original code, the domain is y=bk to y<=my-bk+1 and inside this domain only the rows that equal 0 and mx-1 are set; otherwise everything else stays the same.

I'm just not sure why instance().x == 0 and instance().x == mx-1 OR instance().y ==0 and instance().y == mx-1 don't work unless 1) instance() doesn't work as advertised OR (and the greater possibility) 2) I'm using instance() incorrectly.
0 Likes

Originally posted by: gaurav.garg
1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.


I give you an example that shows how to do this-

float a; float aPtr[height][width]; // or float* aPtr = new float[height*width]; streamRead(a, aPtr);



From my experience that is not working in a consistent way when using GPU or CPU backend. Just an example, when defining a stream

  unsigned int dimc[] = {height,width};

::brook::Stream<double2> s(2,dimc);

 and then invoking a kernel

kernel void test (*some optional input here*, out double2 s<>){

    s.x = instance().x;

    s.y = instance().y

}

one sees at least with the CPU backend that instance().x is actually running from 0 to height-1, and instance().y from 0 to width-1. With the GPU-backend it is different (didn't get it working yet), but the indexing appears to be borked. To sum it up, with the CPU backend an array defined as

double t_a[height][width];

has to be fitted with a stream



::brook::Stream t_s (2,{width, height});

otherwise it is not working over here.



 



PS: Why is the forum software appending empty lines to my post? I can't edit it away, it only adds another two mor lines everytime I edit my post :confused;

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 



 

0 Likes

unsigned int dimc[] = {height,width};


When using C++ style constructor it has to be -

unsigned int dimc[] = {width, height};

0 Likes

Is this a bug? The aforementioned code I posted works if BRT_RUNTIME=cpu, for example:

kernel void advection2_s(float4 Fin1to4[], float4 Fin5to8[], float4 Fin9[], int gx,
int mx, int my, int bk, out float4 Fs9<>, out float4 Fs5to8<>, out float4 Fs1to4<>)
{
int x, y, idx;
idx = instance().x;
x = idx%gx;
y = (int)floor((float)idx/(float)gx);
Fs1to4 = Fin1to4[idx];
Fs5to8 = Fin5to8[idx];
Fs9 = Fin9[idx];
if ((y > bk-1) && (y <= my-bk+1))
{
if (x==0)//(idx == gx*y)
{
Fs1to4 = Fin1to4[(mx-1)+gx*y];
Fs5to8 = Fin5to8[(mx-1)+gx*y];
Fs9.w = Fin9[(mx-1)+gx*y].w;
}
if (x==mx+1)//(idx == (mx+1)+gx*y)
{
Fs1to4 = Fin1to4[2+gx*y];
Fs5to8 = Fin5to8[2+gx*y];
Fs9.w = Fin9[2+gx*y].w;
}
}
}

This works when BRT_RUNTIME=CPU just fine but fails (gives incorrect results) when BRT_RUNTIME=CAL.

I'm not familiar with the large differences between these two runtimes but I would think they would be built to act the same, no?
0 Likes

Did you check errorLog on your output stream? Something like-

if(Fs9.error())
{
    std::cout << Fs9.errorLog();
}

 

0 Likes

Absolutely! There is no error on the output stream, no log is being printed hence the conditional is not being taken.

I think it's more of a logical error, but I found it odd that in my 1D example above if(x==mx+1) works for CPU but not CAL.
0 Likes

Assign one output with x, y and idx values and see if you can find out the problem.

My guess is brcc is having issue with x and y calculation. e.g. calculate x like this-

x = idx-y*gx;

Try different combinations for x & y calculation.

0 Likes