Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- AMD Community
- Communities
- Developers
- Devgurus Archives
- Archives Discussions
- Could use some help with arrays

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-12-2009
06:57 PM

Could use some help with arrays

Hello everyone!

Trying to get into brook+ GPU programming, I encountered a weird error with array indexes. Here's my test kernel:

kernel void testkernel(uint2 ker_in<>, uint ker_key[2], uint ker_s87[256], out uint2 ker_out<>)

{

uint ker_n1, ker_n2, t, z, mhm;

uint2 test;

ker_n1 = (uint)ker_in.x;

ker_n2 = (uint)ker_in.y;

t = (ker_n1+ker_key[0]) & (uint)0xFFFFFFFF;

mhm=t>>(uint)24 &(uint)255;** z = ker_s87[mhm];**

test.x = z;

test.y = mhm;

ker_out = (uint2)test;

}

Everything works okay, except of the highlited part.

Somehow "z" just returns zero.

"mhm" itself returns proper values: from zero to 255, and also if I just manually substitute the index in the "z" expression, for example:

*z = ker_s87[158];*

I get proper result, with z returning the needed element.

I can't really understand what's wrong here and why doesnt it work in proper way. 😞

Thanks in advance for any ideas or suggestions!

43 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-12-2009
07:02 PM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-12-2009
07:07 PM

Na-ah, still a no-go.

However, stating "z = ker_s87[(uint)169]" *does* return a proper value. So, I guess, indexing by (u)ints seems to work.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-12-2009
08:00 PM

Sometimes the compiler doesn't generates what you expect... I spent a couple of hours wondering what was going on with a piece of code just to discover that the compiled IL was not correct.

Well, i don't know if this is the case here, but you can maybe try the following equivalent code to yours and check the behaviour...

*kernel void test2(uint2 ker_in<>, uint ker_key[2], uint ker_s87[256], out uint2 ker_out<>{ uint t, z, mhm; t = ker_in.x + ker_key[0]; mhm = (t>>(uint)24) & (uint)255; z = ker_s87[mhm]; ker_out.x = z; ker_out.y = mhm;}*

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-13-2009
02:29 AM

That's how initial code looked like pretty much. ^^ Still works the same way tho: "mhm" operation returns proper results and "z" just wont work.

Actually it looked like this at first: "z = ker_s87[(t>>(uint)24) & (uint)255]" and the "mhm" itself was put in there to check whether there's maybe a problem with "t" or the "t>>(uint)24) & (uint)255" operation. But well, they seem to work just fine and it indeed looks like the compiler wont generate proper code for "z = ker_s87['something different than plain number']". 😞

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-13-2009
08:47 AM

use -c flag and run again. Mostly it will work

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-13-2009
02:10 PM

Using "-c" causes it to even fail building this kernel with the following error:

error C2664: 'void __test2::operator ()(brook::Stream,brook::Stream,brook::Stream,brook::Stream)' : cannot convert parameter 2 from 'uint [2]' to 'brook::Stream'

with [T=uint2] and [T=uint]

No constructor could take the source type, or constructor overload resolution was ambiguous

Also, I did some more testing; if I just state the parameter in the kernel and pass the variable to "z":

mhm = 139;

z = ker_s87[mhm];

I get the proper result returned. So I guess its not about passing the index as variable. The problem is that I need it to be computed inside the kernel, so I'm stuck here again... anyone got some ideas?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-13-2009
05:01 PM

Did some more testing, wrote this simple kernel:

kernel void test3(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>, uint index_test)

{

ker_out.x=index_test;

ker_out.y=ker_s87[index_test];

}

and behavior is very similar to the previous kernel: "ker_out.y" just returns zero.

However, with:

kernel void test4(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>)

{

uint index_test = (uint)255;

ker_out.x=index_test;

ker_out.y=ker_s87[index_test];

}

I get the last array element returned.

So it looks like those arrays only accept constants as indices?

Bug?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-30-2009
03:14 PM

Yarr,

I have written following test case based on your discussions.

could you please conform whether test case reproduces the problem you are facing?

kernel void testkernel(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>,uint index_test)

{

ker_out.x=index_test;

ker_out.y=ker_s87[index_test];

}

int main(int argc, char** argv)

{

uint *i0 = NULL;

uint *i1 = NULL;

uint *o0 = NULL;

uint2 streami0<4>;

uint2 streamo0<4>;

uint c = 200;

int i = 0, j = 0;

int mismatched = 0;

i0 = (uint*)malloc(4 * 2 * sizeof(uint));

i1 = (uint*)malloc(256 * sizeof(uint));

o0 = (uint*)malloc(4 * 2 * sizeof(uint));

for(j = 0; j < 256; j++)

{

i1

}

//! Brook code

streamRead(streami0, i0);

testkernel(streami0, i1, streamo0, c);

streamWrite(streamo0, o0);

for(i = 0; i < 4; i++)

{

if(o0[i * 2 + 1] != i1

{

mismatched = 1;

break;

}

}

if(mismatched)

printf("Failed!!\n");

else

printf("Passed!!\n");

free(i0);

free(i1);

free(o0);

}

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-14-2009
04:55 AM

error C2664: 'void __test2:perator ()(brook::Stream,brook::Stream,brook::Stream,brook::Stream)' : cannot convert parameter 2 from 'uint [2]' to 'brook::Stream' with [T=uint2] and [T=uint] No constructor could take the source type, or constructor overload resolution was ambiguous

Compiling brook kernel with -c flag means it changes a conatant array into stream. So, it requires you to pass this parameter as Stream and not as uint [2] (as mentioned in error)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-19-2009
02:38 PM

Thanks alot for responding, that indeed made the program run properly.

But, according to manual (also if I understood it correctly :)), not using constant buffer thingie is should be way slower? Rephrasing, I drastically reduce performance by swapping from constant buffer to just passing the data as another stream?

And this also means, that constant buffer behavior is somewhat borked because I couldnt get elemets from there?

Also, let me explain why performance is so important to me:

I'm a student and me + some of my buddies picked the task of implementing russian GOST 28147-89 (http://en.wikipedia.org/wiki/GOST_28147-89) block cipher algorythm on various "unusual" devices as our graduation work. I picked AMD GPU, while 2 another guys picked nvidia GPU and CELL BE processor in sony PS3 respectively. So basically it resulted in some kind of "race" here of who gets the cipher to run fastest. 🙂

Anyways, I already got some running program, but its embarrasingly slow compared to other "competitors". Running tests with 128mb of input data produces speeds of around 50-60mb/s (which is around the performance I get on my CPU) while the cell guy has ~150mb/s per core and nvidia guy (he's using g92 card) has around 2gb/sec.

Judging by theoretical speed of my card (I'm using 4870 512Mb running at 775/4000 btw), it should easily outperform both of those, but it simply doesnt. 🙂

This brings me to thinking something's REALLY wrong with my program (although it really looks alike to CUDA version for example). So could you please look through it and point out maybe some obvious bottlenecks which can be fixed to improve the working speed?

I'll just paste kernels I'm using and shortly explain what they do:

kernel uint sbox_gpu(uint t, uint ker_s[4][256]) //this is subkernel for applying the substitution boxes

//works byte-by-byte

{

uint x;

x=ker_s[0][t>>(uint)24 & (uint)255] << (uint)24 | ker_s[1][t>>(uint)16 & (uint)255] << (uint)16 | ker_s[2][t>>

(uint)8 & (uint)255] << (uint)8 | ker_s[3][t & (uint)255];

return x<<(uint)11 | x>>(uint)21;

}

//here's the main kernel, it accepts the input of: 1) 64 bit blocks, streamed

//2) array of 256 bit cipher key sliced into 8 pieces

//3) array of s-boxes to pass 'em to subkernel

kernel void gostcrypt_gpu(uint2 ker_in<>, uint ker_key[], uint ker_s[][], out uint2 ker_out<>)

{

uint n1, n2;

n1 = (uint)ker_in.x;

n2 = (uint)ker_in.y;

//swapping names each round instead of swapping blocks

//-------------First block of 8 iterations-------------

n2 = n2^sbox_gpu((n1+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);

n1 = n1^sbox_gpu((n2+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);

n2 = n2^sbox_gpu((n1+ker_key[2]) & (uint)0xFFFFFFFF, ker_s);

n1 = n1^sbox_gpu((n2+ker_key[3]) & (uint)0xFFFFFFFF, ker_s);

n2 = n2^sbox_gpu((n1+ker_key[4]) & (uint)0xFFFFFFFF, ker_s);

n1 = n1^sbox_gpu((n2+ker_key[5]) & (uint)0xFFFFFFFF, ker_s);

n2 = n2^sbox_gpu((n1+ker_key[6]) & (uint)0xFFFFFFFF, ker_s);

n1 = n1^sbox_gpu((n2+ker_key[7]) & (uint)0xFFFFFFFF, ker_s);

//-------------Second block of 8 iterations-------------

// repeat the above part, I skipped it to conserve space.

//-------------Third block of 8 iterations-------------

// same deal, repeat the first part.

//-------------Last (reversed) block of 8 iterations-------------

n2 = n2^sbox_gpu((n1+ker_key[7]) & (uint)0xFFFFFFFF, ker_s);

n1 = n1^sbox_gpu((n2+ker_key[6]) & (uint)0xFFFFFFFF, ker_s);

n2 = n2^sbox_gpu((n1+ker_key[5]) & (uint)0xFFFFFFFF, ker_s);

n1 = n1^sbox_gpu((n2+ker_key[4]) & (uint)0xFFFFFFFF, ker_s);

n2 = n2^sbox_gpu((n1+ker_key[3]) & (uint)0xFFFFFFFF, ker_s);

n1 = n1^sbox_gpu((n2+ker_key[2]) & (uint)0xFFFFFFFF, ker_s);

n2 = n2^sbox_gpu((n1+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);

n1 = n1^sbox_gpu((n2+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);

// There is no swap after the last round

ker_out.x = n2;

ker_out.y = n1;

}

You can see the whole sourcecode here: http://annihilator.comtv.ru/source.txt

I'll be really grateful for any ideas about the ways of optimizing this code, huge thanks in advance.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-19-2009
03:22 PM

Hi yarr,

Some easy changes that I think can help you-

1. change ker_in and ker_out into 2D stream of 4096, 4096 (It should not affect your algorithm).

2. Change ker_key into constant buffer. You can keep both constant buffer and gather array(remove specified sizes from kernel declaration) in your program.

kernel void gostcrypt_gpu(uint2 ker_in<>, uint ker_key[8], uint ker_s[][], out uint2 ker_out<>

Compile it without -c. It should work.

3. Convert uint2 ker_in and ker_out streams into uint4 stream. It reduces your number of threads to half and increases your ALU usage while keeping only one fetch instuction.

4. Last step might be to increase number of outputs in a single kernel. Writing to multiple outputs helps you decrease total threads and increase ALU intensity of kernel. But, looks like this step won't be very helpful in your case.

Please follow each step one by one and see performance after each step. Let me know about your results.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-19-2009
04:19 PM

Thanks gaurav, your advices really seemed to speed it up somewhat. 🙂

Posting execution times for those kernels:

First (original) one with 1D array of 128mb input data: ~2.32 secs.

1) After changing those to 2D of 4096x4096: down to ~1.93.

2) After changing keys array to be in constant buffer: down to ~1.46.

That should be around 87mb/sec, which is still really really slow, but way better than what I had previously. 🙂

Third step is not that trivial (req's some tinkering with sboxes) and might even result in the kernel being slowed down instead, but I'll try to rewrite it anyways.

Btw, tried the modified "fastest" kernel with sbox subkernel just returning the input value (thus, the sbox routine being pretty much disabled, I guess) and got results of around 0.4 secs, which still looks quite slow...

Getting to the fourth step I get only one idea of maybe writing n1 and n2 to 2 separate outputs, but that'll require some extra time afterwards to rearrange them. Still worth a shot, I guess, so I'm on it. 🙂

**EDIT:** 4) Okay, tried 2 separate outputs: got slower by approx 0.2 to 0.3 secs, guess that wont really work.

**EDIT2:** 3) Tried some pretty straightforward, "brute" conversion to uint4 in/out streams by pretty much doubling kernel instructions:

*n1 = (uint)ker_in.x;n2 = (uint)ker_in.y;n3 = (uint)ker_in.z;n4 = (uint)ker_in.w;*

*n2 = n2^sbox_gpu((n1+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);n4 = n4^sbox_gpu((n3+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);n1 = n1^sbox_gpu((n2+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);n3 = n3^sbox_gpu((n4+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);[...]and so on.*

*And well I get around 4.2 secs for executing this.Removing s-boxes I get around 2.2 secs.*

*Even considering I doubled the input data (256 mb now), it's still a slowdown.Either I need more "graceful" vector conversion, or it simply wont work.The main problem with proper vectorization are those substitutions.*

P.S. I'm running Vista ultimate x64 SP1 if that matters much.

SDK I'm using is of latest (1.3) version.

Thanks again.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-19-2009
05:09 PM

Originally posted by:gaurav.garg

1. change ker_in and ker_out into 2D stream of 4096, 4096 (It should not affect your algorithm).

I have seen this mentioned before, is this true for all 1D streams? That they would get better performance if put into 2D streams?

And by how much (aka: is it worth the coding time)?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-19-2009
05:24 PM

First of all, wavefronts are created from quads(2X2 thhreads). So, if you have 1D stream, you are actually using only half of SIMD cores.

Second reason is some overhead of Brook+ virualization. With CAL you can allocate 1D resource of max size 8192(2^13), but Brook+ allows you 1D streams of size 2^26. It has to generate some extra code for mapping from 1D stream to 2D texture space and vice-versa. Keep in mind that this overhead only appears if your stream is of size > 8192.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-19-2009
05:19 PM

Also, even though our hardware is becoming more generalized, it is still a piece of hardware optimized for graphics which is inherently 2D. So by going the 2D stream route and using vector data types then you hit the fast paths that are highly optimized, thereby improving performance.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-19-2009
07:40 PM

Originally posted by:MicahVillmow

Also, even though our hardware is becoming more generalized, it is still a piece of hardware optimized for graphics which is inherently 2D. So by going the 2D stream route and using vector data types then you hit the fast paths that are highly optimized, thereby improving performance.

By vector types you mean float4 or int4, etc?

Another question:

How much of a difference do you get between these two pieces of code for float4s:

out = in;

AND

out.w = in.w;

out.x = in.x;

out.y = in.y;

out.z = in.z;

Is the former treated as a vector and the latter treated as scalars (even though they are of vector type)?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-23-2009
12:50 PM

Sorry, double post.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-23-2009
12:53 PM

Nevermind. Thanks. Sorry to waste space.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-19-2009
08:28 PM

That is correct on vector types.

The cal compiler should generate equivalent code for both of those examples if there is not complicated code in between each of the assignments.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-20-2009
01:56 PM

1. streamRead/streamWrite won't take a double pointer? How do you copy over a dynamic 2D array into a 2D stream of the same size?

2. Looking at the following kernel I am having some problems porting it from 1D to 2D.

Here is the 1D code which works fine:

int x, y, idx;

idx = instance().x;

x = idx%gx;

y = (int)floor((float)idx/(float)gx);

Fs1to4 = Fin1to4[idx];

Fs5to8 = Fin5to8[idx];

Fs9 = Fin9[idx];

if ((y > bk-1) && (y <= my-bk+1))

{

if (idx=gx*y) //x==0

{

Fs1to4 = Fin1to4[(mx-1)+gx*y];

Fs5to8 = Fin5to8[(mx-1)+gx*y];

Fs9.w = Fin9[(mx-1)+gx*y].w;

}

if (idx == ((mx+1)+(gx*y))) //x==mx+1

{

Fs1to4 = Fin1to4[2+gx*y];

Fs5to8 = Fin5to8[2+gx*y];

Fs9.w = Fin9[2+gx*y].w;

}

}

And here is the 2D code which does not work:

int x, y, idx;

x = instance().x;

y = instance().y;

idx = x+gx*y;

Fs1to4 = Fin1to4

Fs5to8 = Fin5to8

Fs9 = Fin9

if ((y > bk-1) && (y <= my-bk+1))

{

if (x==0)

{

Fs1to4 = Fin1to4[(mx-1)]

Fs5to8 = Fin5to8[(mx-1)]

Fs9.w = Fin9[(mx-1)]

}

if (x==(mx+1))

{

Fs1to4 = Fin1to4[2]

Fs5to8 = Fin5to8[2]

Fs9.w = Fin9[2]

}

}

Now, I know the x==0 part works because I have tried it in the first code with no problems; however, the second part seems to not work properly even if I substitute it in the 1D code. I think I am missing something as far as how the 2D domain and instance() work.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-20-2009
02:21 PM

1. Thers is no way you can pass double pointer to streamRead. It has same behavior as cpu memcpy.

2. When you index into 2D gather array, it has to be C-style indexing.

a

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-20-2009
02:42 PM

Originally posted by:gaurav.garg

1. Thers is no way you can pass double pointer to streamRead. It has same behavior as cpu memcpy.

2. When you index into 2D gather array, it has to be C-style indexing.

awhere k - row number and j - column number.

1. So there is no way to use dynamically allocated multidimensional arrays in Brook+? That's a pretty big limitation no?

2. Yes, I understand this. I'm not sure I understand how this helps my 2nd question at all, could you elaborate?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-20-2009
02:52 PM

As in your 1D code you have written -

x = idx%gx;

y = (int)floor((float)idx/(float)gx);

That made me thought x is column index and y is row index. Is that right?

If it's right, you need to change your code to - Fin1to4

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-20-2009
05:49 PM

Originally posted by:gaurav.garg

As in your 1D code you have written -

x = idx%gx;

y = (int)floor((float)idx/(float)gx);

That made me thought x is column index and y is row index. Is that right?

If it's right, you need to change your code to - Fin1to4;

So from this I could just swap the instance of:

x = instance().y;

y = instance().x;

Does that make sense? Just wondering because that doesn't seem to work. So are arrays in kernels in [column][row]? Are they put that way in streamRead?

ALSO, how do you pass a dynamically allocated 2D array into a stream since you can't use streamRead with double pointers???

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-20-2009
07:18 PM

Originally posted by:ryta1203ALSO, how do you pass a dynamically allocated 2D array into a stream since you can't use streamRead with double pointers???

* *

*Stream* xGPU[32/8]; float* x = (float*)malloc(sizeof(x[0])*npt); int coordDim[] = {8192, 8}; *

{

xGPU[curBlock/8] = new Stream

xGPU[curBlock/8]->read(&x[8*8192*i]);

}

This is something I cooked up a few months ago. Alternatively, you can call streamRead(*xGPU*, x);*

* *

* *

* *

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-21-2009
04:49 AM

Did you try changing other gather array access. Try this-

int x, y, idx;

x = instance().x;

y = instance().y;

idx = x+gx*y;

Fs1to4 = Fin1to4

Fs5to8 = Fin5to8

Fs9 = Fin9

if ((y > bk-1) && (y <= my-bk+1))

{

if (x==0)

{

Fs1to4 = Fin1to4

Fs5to8 = Fin5to8

Fs9.w = Fin9

}

if (x==(mx+1))

{

Fs1to4 = Fin1to4

Fs5to8 = Fin5to8

Fs9.w = Fin9

}

}

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-21-2009
11:16 AM

This doesn't work (I tried) and I wouldn't expect it too either. The conditions are based on the indexes of the columns and rows but have not been switched. I tried to switch them also and I get the same results.

The results I get are as if the kernel is not called at all, meaning that it's either not making it into the first conditional or that it's not making it into the other two.

This was one of the easiest kernels to implement in my 1D solution but for 2D it seems much trickier, although it really shouldn't be since the original code (CPU) is in 2D also.

What I want to know is this:

1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.

2. If I understand you correctly, instance().x returns the column index and instance().y returns the row index, correct?

In my 1D case, if I use idx==(mx+1)+gx*y then I get the correct results; however if I use x == mx+1 OR y == mx+1 I get incorrect results, so it might be that this problem is not specific to the 2D case just that I can't access the 2D using the 1D solution.

EDIT: I thought it might be useful to post the original C code I am converting:

for(y=BK;y<=MY-BK+1;y++)

for(k=0 ; k < Q ; k + + )

{

F[0]

F[MX+1]

}

EDIT2: As a side note, I have seen significant speedup using 2D just through 2 (out of the 6) kernels and that solution is using a lot more streamRead/streamWrites than the full solution (as in the 1D version), so this is great if I can get it working!!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-21-2009
11:50 AM

Can you post your kernel signature and runtime code for 2D case?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-21-2009
11:55 AM

I'm not familar with the workings of the code created by the brcc. Do you mean the generated IL?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-21-2009
12:04 PM

Kernel signature - function signature for you brook kernel giving information about kernel parameters

runtime code - Part of code where you declare streams and call kernel.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-21-2009
12:07 PM

{

int step, x=0, y=0; float Norm1, Norm2, error1, error2;

float4 Fs1to4_1< gx, gy>;

float4 Fs1to4_2< gx, gy>;

float4 Fs5to8_1< gx, gy>;

float4 Fs5to8_2< gx, gy>;

float4 Fs9_1< gx, gy>;

float4 Fs9_2< gx, gy>;

float4 fs1to4_1< gx, gy>;

float4 fs5to8_1< gx, gy>;

float4 fs9_1< gx, gy>;

float4 fs1to4_2< gx, gy>;

float4 fs5to8_2< gx, gy>;

float4 fs9_2< gx, gy>;

float GEOs< gx, gy>;

float ss< 9>;

float rs< 9>;

int es< 18>;

streamRead(fs1to4_1, f1to4);

streamRead(fs5to8_1, f5to8);

streamRead(fs9_1, f9);

streamRead(Fs1to4_1, F1to4);

streamRead(Fs5to8_1, F5to8);

streamRead(Fs9_1, F9);

streamRead(ss, s);

streamRead(GEOs, GEO);

streamRead(rs, r);

streamRead(es, e);

mcollid_s(Fs1to4_1, Fs5to8_1, Fs9_1, fs1to4_1, fs5to8_1, fs9_1, GEOs, ss, G, Fs9_2, Fs5to8_2, Fs1to4_2);

Fs9_2.error();

advection1_s(Fs1to4_2, Fs5to8_2, Fs9_2, GEOs, es, gx, mx, my, rs, Fs9_1, Fs5to8_1, Fs1to4_1);

Fs9_1.error();

advection2_s(Fs1to4_1, Fs5to8_1, Fs9_1, gx, mx, my, bk, Fs9_2, Fs5to8_2, Fs1to4_2);

Fs9_2.error();

streamWrite(Fs1to4_2, F1to4);

streamWrite(Fs5to8_2, F5to8);

streamWrite(Fs9_2, F9);

}

kernel void advection2_s(float4 Fin1to4[][], float4 Fin5to8[][], float4 Fin9[][], int gx,

int mx, int my, int bk, out float4 Fs9<>, out float4 Fs5to8<>, out float4 Fs1to4<>)

{

int x, y;

x = instance().x;

y = instance().y;

Fs1to4 = Fin1to4

Fs5to8 = Fin5to8

Fs9 = Fin9

if ((y > bk-1) && (y <= my-bk+1))

{

if (x==0)

{

Fs1to4 = Fin1to4[(mx-1)]

Fs5to8 = Fin5to8[(mx-1)]

Fs9.w = Fin9[(mx-1)]

}

if (x==(mx+1))

{

Fs1to4 = Fin1to4[2]

Fs5to8 = Fin5to8[2]

Fs9.w = Fin9[2]

}

}

}

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-21-2009
12:29 PM

I assume in float4 Fs5to8_2< gx, gy>; gx is height and gy is width.

The only other part I have doubt is your gather array accesing. All of your indexing have y(current row index) as second component. Not sure if you wanted to do this.

1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.

I give you an example that shows how to do this-

float a<height, width>;

float aPtr[height][width]; // or float* aPtr = new float[height*width];

streamRead(a, aPtr);

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-21-2009
12:49 PM

Here is how I have it:

float4 F5to8[gx][gy];

float4 Fs5to8< gx,gy >;

streamRead(Fs5to8, F5to8);

This should be right since this works for the first two kernels, no?

If you look at the original code, the domain is y=bk to y<=my-bk+1 and inside this domain only the rows that equal 0 and mx-1 are set; otherwise everything else stays the same.

I'm just not sure why instance().x == 0 and instance().x == mx-1 OR instance().y ==0 and instance().y == mx-1 don't work unless 1) instance() doesn't work as advertised OR (and the greater possibility) 2) I'm using instance() incorrectly.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-21-2009
09:31 PM

Originally posted by:gaurav.garg1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.

I give you an example that shows how to do this-

float a; float aPtr[height][width]; // or float* aPtr = new float[height*width]; streamRead(a, aPtr);

From my experience that is not working in a consistent way when using GPU or CPU backend. Just an example, when defining a stream

unsigned int dimc[] = {height,width};

::brook::Stream<double2> s(2,dimc);

and then invoking a kernel

kernel void test (*some optional input here*, out double2 s<>){

s.x = instance().x;

s.y = instance().y

}

one sees at least with the CPU backend that instance().x is actually running from 0 to height-1, and instance().y from 0 to width-1. With the GPU-backend it is different (didn't get it working yet), but the indexing appears to be borked. To sum it up, with the CPU backend an array defined as

double t_a[height][width];

has to be fitted with a stream

::brook::Stream t_s (2,{width, height});

otherwise it is not working over here.

PS: Why is the forum software appending empty lines to my post? I can't edit it away, it only adds another two mor lines everytime I edit my post :confused;

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-22-2009
02:48 AM

unsigned int dimc[] = {height,width};

When using C++ style constructor it has to be -

unsigned int dimc[] = {width, height};

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-22-2009
09:56 AM

kernel void advection2_s(float4 Fin1to4[], float4 Fin5to8[], float4 Fin9[], int gx,

int mx, int my, int bk, out float4 Fs9<>, out float4 Fs5to8<>, out float4 Fs1to4<>)

{

int x, y, idx;

idx = instance().x;

x = idx%gx;

y = (int)floor((float)idx/(float)gx);

Fs1to4 = Fin1to4[idx];

Fs5to8 = Fin5to8[idx];

Fs9 = Fin9[idx];

if ((y > bk-1) && (y <= my-bk+1))

{

if (x==0)//(idx == gx*y)

{

Fs1to4 = Fin1to4[(mx-1)+gx*y];

Fs5to8 = Fin5to8[(mx-1)+gx*y];

Fs9.w = Fin9[(mx-1)+gx*y].w;

}

if (x==mx+1)//(idx == (mx+1)+gx*y)

{

Fs1to4 = Fin1to4[2+gx*y];

Fs5to8 = Fin5to8[2+gx*y];

Fs9.w = Fin9[2+gx*y].w;

}

}

}

This works when BRT_RUNTIME=CPU just fine but fails (gives incorrect results) when BRT_RUNTIME=CAL.

I'm not familiar with the large differences between these two runtimes but I would think they would be built to act the same, no?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-22-2009
10:17 AM

Did you check errorLog on your output stream? Something like-

if(Fs9.error())

{

std::cout << Fs9.errorLog();

}

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-22-2009
10:57 AM

I think it's more of a logical error, but I found it odd that in my 1D example above if(x==mx+1) works for CPU but not CAL.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

01-22-2009
11:36 AM

Assign one output with *x, y* and *idx* values and see if you can find out the problem.

My guess is brcc is having issue with x and y calculation. e.g. calculate x like this-

x = idx-y*gx;

Try different combinations for x & y calculation.