Hello everyone!
Trying to get into brook+ GPU programming, I encountered a weird error with array indexes. Here's my test kernel:
kernel void testkernel(uint2 ker_in<>, uint ker_key[2], uint ker_s87[256], out uint2 ker_out<>)
{
uint ker_n1, ker_n2, t, z, mhm;
uint2 test;
ker_n1 = (uint)ker_in.x;
ker_n2 = (uint)ker_in.y;
t = (ker_n1+ker_key[0]) & (uint)0xFFFFFFFF;
mhm=t>>(uint)24 &(uint)255;
z = ker_s87[mhm];
test.x = z;
test.y = mhm;
ker_out = (uint2)test;
}
Everything works okay, except of the highlited part.
Somehow "z" just returns zero.
"mhm" itself returns proper values: from zero to 255, and also if I just manually substitute the index in the "z" expression, for example:
z = ker_s87[158];
I get proper result, with z returning the needed element.
I can't really understand what's wrong here and why doesnt it work in proper way. 😞
Thanks in advance for any ideas or suggestions!
When indexing into a stream, I usually use floating point indices. I'm not sure you can index using uint, though I could be wrong. Thus, try changing z = ker_s87[mhm] to z = ker_s87[(float)mhm].
Na-ah, still a no-go.
However, stating "z = ker_s87[(uint)169]" does return a proper value. So, I guess, indexing by (u)ints seems to work.
Sometimes the compiler doesn't generates what you expect... I spent a couple of hours wondering what was going on with a piece of code just to discover that the compiled IL was not correct.
Well, i don't know if this is the case here, but you can maybe try the following equivalent code to yours and check the behaviour...
kernel void test2(uint2 ker_in<>, uint ker_key[2], uint ker_s87[256], out uint2 ker_out<>
{
uint t, z, mhm;
t = ker_in.x + ker_key[0];
mhm = (t>>(uint)24) & (uint)255;
z = ker_s87[mhm];
ker_out.x = z;
ker_out.y = mhm;
}
That's how initial code looked like pretty much. ^^ Still works the same way tho: "mhm" operation returns proper results and "z" just wont work.
Actually it looked like this at first: "z = ker_s87[(t>>(uint)24) & (uint)255]" and the "mhm" itself was put in there to check whether there's maybe a problem with "t" or the "t>>(uint)24) & (uint)255" operation. But well, they seem to work just fine and it indeed looks like the compiler wont generate proper code for "z = ker_s87['something different than plain number']". 😞
use -c flag and run again. Mostly it will work
Using "-c" causes it to even fail building this kernel with the following error:
error C2664: 'void __test2::operator ()(brook::Stream,brook::Stream,brook::Stream,brook::Stream)' : cannot convert parameter 2 from 'uint [2]' to 'brook::Stream'
with [T=uint2] and [T=uint]
No constructor could take the source type, or constructor overload resolution was ambiguous
Also, I did some more testing; if I just state the parameter in the kernel and pass the variable to "z":
mhm = 139;
z = ker_s87[mhm];
I get the proper result returned. So I guess its not about passing the index as variable. The problem is that I need it to be computed inside the kernel, so I'm stuck here again... anyone got some ideas?
Did some more testing, wrote this simple kernel:
kernel void test3(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>, uint index_test)
{
ker_out.x=index_test;
ker_out.y=ker_s87[index_test];
}
and behavior is very similar to the previous kernel: "ker_out.y" just returns zero.
However, with:
kernel void test4(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>)
{
uint index_test = (uint)255;
ker_out.x=index_test;
ker_out.y=ker_s87[index_test];
}
I get the last array element returned.
So it looks like those arrays only accept constants as indices?
Bug?
Yarr,
I have written following test case based on your discussions.
could you please conform whether test case reproduces the problem you are facing?
kernel void testkernel(uint2 ker_in<>, uint ker_s87[256], out uint2 ker_out<>,uint index_test)
{
ker_out.x=index_test;
ker_out.y=ker_s87[index_test];
}
int main(int argc, char** argv)
{
uint *i0 = NULL;
uint *i1 = NULL;
uint *o0 = NULL;
uint2 streami0<4>;
uint2 streamo0<4>;
uint c = 200;
int i = 0, j = 0;
int mismatched = 0;
i0 = (uint*)malloc(4 * 2 * sizeof(uint));
i1 = (uint*)malloc(256 * sizeof(uint));
o0 = (uint*)malloc(4 * 2 * sizeof(uint));
for(j = 0; j < 256; j++)
{
i1
}
//! Brook code
streamRead(streami0, i0);
testkernel(streami0, i1, streamo0, c);
streamWrite(streamo0, o0);
for(i = 0; i < 4; i++)
{
if(o0[i * 2 + 1] != i1
{
mismatched = 1;
break;
}
}
if(mismatched)
printf("Failed!!\n");
else
printf("Passed!!\n");
free(i0);
free(i1);
free(o0);
}
error C2664: 'void __test2:perator ()(brook::Stream,brook::Stream,brook::Stream,brook::Stream)' : cannot convert parameter 2 from 'uint [2]' to 'brook::Stream' with [T=uint2] and [T=uint] No constructor could take the source type, or constructor overload resolution was ambiguous
Compiling brook kernel with -c flag means it changes a conatant array into stream. So, it requires you to pass this parameter as Stream and not as uint [2] (as mentioned in error)
Thanks alot for responding, that indeed made the program run properly.
But, according to manual (also if I understood it correctly :)), not using constant buffer thingie is should be way slower? Rephrasing, I drastically reduce performance by swapping from constant buffer to just passing the data as another stream?
And this also means, that constant buffer behavior is somewhat borked because I couldnt get elemets from there?
Also, let me explain why performance is so important to me:
I'm a student and me + some of my buddies picked the task of implementing russian GOST 28147-89 (http://en.wikipedia.org/wiki/GOST_28147-89) block cipher algorythm on various "unusual" devices as our graduation work. I picked AMD GPU, while 2 another guys picked nvidia GPU and CELL BE processor in sony PS3 respectively. So basically it resulted in some kind of "race" here of who gets the cipher to run fastest. 🙂
Anyways, I already got some running program, but its embarrasingly slow compared to other "competitors". Running tests with 128mb of input data produces speeds of around 50-60mb/s (which is around the performance I get on my CPU) while the cell guy has ~150mb/s per core and nvidia guy (he's using g92 card) has around 2gb/sec.
Judging by theoretical speed of my card (I'm using 4870 512Mb running at 775/4000 btw), it should easily outperform both of those, but it simply doesnt. 🙂
This brings me to thinking something's REALLY wrong with my program (although it really looks alike to CUDA version for example). So could you please look through it and point out maybe some obvious bottlenecks which can be fixed to improve the working speed?
I'll just paste kernels I'm using and shortly explain what they do:
kernel uint sbox_gpu(uint t, uint ker_s[4][256]) //this is subkernel for applying the substitution boxes
//works byte-by-byte
{
uint x;
x=ker_s[0][t>>(uint)24 & (uint)255] << (uint)24 | ker_s[1][t>>(uint)16 & (uint)255] << (uint)16 | ker_s[2][t>>
(uint)8 & (uint)255] << (uint)8 | ker_s[3][t & (uint)255];
return x<<(uint)11 | x>>(uint)21;
}
//here's the main kernel, it accepts the input of: 1) 64 bit blocks, streamed
//2) array of 256 bit cipher key sliced into 8 pieces
//3) array of s-boxes to pass 'em to subkernel
kernel void gostcrypt_gpu(uint2 ker_in<>, uint ker_key[], uint ker_s[][], out uint2 ker_out<>)
{
uint n1, n2;
n1 = (uint)ker_in.x;
n2 = (uint)ker_in.y;
//swapping names each round instead of swapping blocks
//-------------First block of 8 iterations-------------
n2 = n2^sbox_gpu((n1+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[2]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[3]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[4]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[5]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[6]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[7]) & (uint)0xFFFFFFFF, ker_s);
//-------------Second block of 8 iterations-------------
// repeat the above part, I skipped it to conserve space.
//-------------Third block of 8 iterations-------------
// same deal, repeat the first part.
//-------------Last (reversed) block of 8 iterations-------------
n2 = n2^sbox_gpu((n1+ker_key[7]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[6]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[5]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[4]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[3]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[2]) & (uint)0xFFFFFFFF, ker_s);
n2 = n2^sbox_gpu((n1+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
// There is no swap after the last round
ker_out.x = n2;
ker_out.y = n1;
}
You can see the whole sourcecode here: http://annihilator.comtv.ru/source.txt
I'll be really grateful for any ideas about the ways of optimizing this code, huge thanks in advance.
Hi yarr,
Some easy changes that I think can help you-
1. change ker_in and ker_out into 2D stream of 4096, 4096 (It should not affect your algorithm).
2. Change ker_key into constant buffer. You can keep both constant buffer and gather array(remove specified sizes from kernel declaration) in your program.
kernel void gostcrypt_gpu(uint2 ker_in<>, uint ker_key[8], uint ker_s[][], out uint2 ker_out<>
Compile it without -c. It should work.
3. Convert uint2 ker_in and ker_out streams into uint4 stream. It reduces your number of threads to half and increases your ALU usage while keeping only one fetch instuction.
4. Last step might be to increase number of outputs in a single kernel. Writing to multiple outputs helps you decrease total threads and increase ALU intensity of kernel. But, looks like this step won't be very helpful in your case.
Please follow each step one by one and see performance after each step. Let me know about your results.
Thanks gaurav, your advices really seemed to speed it up somewhat. 🙂
Posting execution times for those kernels:
First (original) one with 1D array of 128mb input data: ~2.32 secs.
1) After changing those to 2D of 4096x4096: down to ~1.93.
2) After changing keys array to be in constant buffer: down to ~1.46.
That should be around 87mb/sec, which is still really really slow, but way better than what I had previously. 🙂
Third step is not that trivial (req's some tinkering with sboxes) and might even result in the kernel being slowed down instead, but I'll try to rewrite it anyways.
Btw, tried the modified "fastest" kernel with sbox subkernel just returning the input value (thus, the sbox routine being pretty much disabled, I guess) and got results of around 0.4 secs, which still looks quite slow...
Getting to the fourth step I get only one idea of maybe writing n1 and n2 to 2 separate outputs, but that'll require some extra time afterwards to rearrange them. Still worth a shot, I guess, so I'm on it. 🙂
EDIT: 4) Okay, tried 2 separate outputs: got slower by approx 0.2 to 0.3 secs, guess that wont really work.
EDIT2: 3) Tried some pretty straightforward, "brute" conversion to uint4 in/out streams by pretty much doubling kernel instructions:
n1 = (uint)ker_in.x;
n2 = (uint)ker_in.y;
n3 = (uint)ker_in.z;
n4 = (uint)ker_in.w;
n2 = n2^sbox_gpu((n1+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
n4 = n4^sbox_gpu((n3+ker_key[0]) & (uint)0xFFFFFFFF, ker_s);
n1 = n1^sbox_gpu((n2+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
n3 = n3^sbox_gpu((n4+ker_key[1]) & (uint)0xFFFFFFFF, ker_s);
[...]
and so on.
And well I get around 4.2 secs for executing this.
Removing s-boxes I get around 2.2 secs.
Even considering I doubled the input data (256 mb now), it's still a slowdown.
Either I need more "graceful" vector conversion, or it simply wont work.
The main problem with proper vectorization are those substitutions.
P.S. I'm running Vista ultimate x64 SP1 if that matters much.
SDK I'm using is of latest (1.3) version.
Thanks again.
Originally posted by: gaurav.garg
1. change ker_in and ker_out into 2D stream of 4096, 4096 (It should not affect your algorithm).
First of all, wavefronts are created from quads(2X2 thhreads). So, if you have 1D stream, you are actually using only half of SIMD cores.
Second reason is some overhead of Brook+ virualization. With CAL you can allocate 1D resource of max size 8192(2^13), but Brook+ allows you 1D streams of size 2^26. It has to generate some extra code for mapping from 1D stream to 2D texture space and vice-versa. Keep in mind that this overhead only appears if your stream is of size > 8192.
Originally posted by: MicahVillmow
Also, even though our hardware is becoming more generalized, it is still a piece of hardware optimized for graphics which is inherently 2D. So by going the 2D stream route and using vector data types then you hit the fast paths that are highly optimized, thereby improving performance.
1. Thers is no way you can pass double pointer to streamRead. It has same behavior as cpu memcpy.
2. When you index into 2D gather array, it has to be C-style indexing.
a
Originally posted by: gaurav.garg
1. Thers is no way you can pass double pointer to streamRead. It has same behavior as cpu memcpy.
2. When you index into 2D gather array, it has to be C-style indexing.
awhere k - row number and j - column number.
As in your 1D code you have written -
x = idx%gx;
y = (int)floor((float)idx/(float)gx);
That made me thought x is column index and y is row index. Is that right?
If it's right, you need to change your code to - Fin1to4
Originally posted by: gaurav.garg
As in your 1D code you have written -
x = idx%gx;
y = (int)floor((float)idx/(float)gx);
That made me thought x is column index and y is row index. Is that right?
If it's right, you need to change your code to - Fin1to4;
Originally posted by: ryta1203 ALSO, how do you pass a dynamically allocated 2D array into a stream since you can't use streamRead with double pointers???
Stream* xGPU[32/8];
float* x = (float*)malloc(sizeof(x[0])*npt);
int coordDim[] = {8192, 8};
Did you try changing other gather array access. Try this-
int x, y, idx;
x = instance().x;
y = instance().y;
idx = x+gx*y;
Fs1to4 = Fin1to4
Fs5to8 = Fin5to8
Fs9 = Fin9
if ((y > bk-1) && (y <= my-bk+1))
{
if (x==0)
{
Fs1to4 = Fin1to4
Fs5to8 = Fin5to8
Fs9.w = Fin9
}
if (x==(mx+1))
{
Fs1to4 = Fin1to4
Fs5to8 = Fin5to8
Fs9.w = Fin9
}
}
Can you post your kernel signature and runtime code for 2D case?
Kernel signature - function signature for you brook kernel giving information about kernel parameters
runtime code - Part of code where you declare streams and call kernel.
I assume in float4 Fs5to8_2< gx, gy>; gx is height and gy is width.
The only other part I have doubt is your gather array accesing. All of your indexing have y(current row index) as second component. Not sure if you wanted to do this.
1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.
I give you an example that shows how to do this-
float a<height, width>;
float aPtr[height][width]; // or float* aPtr = new float[height*width];
streamRead(a, aPtr);
Originally posted by: gaurav.garg 1. If you take a 2D array in C [row][column] and call StreamRead, is it read in as [column][row]? If so, then does the dimensions of the stream need to be in [column][row] to fit? Currently my column and row are the same size but this might not always be the case.
I give you an example that shows how to do this-
float a; float aPtr[height][width]; // or float* aPtr = new float[height*width]; streamRead(a, aPtr);
From my experience that is not working in a consistent way when using GPU or CPU backend. Just an example, when defining a stream
unsigned int dimc[] = {height,width};
::brook::Stream<double2> s(2,dimc);
and then invoking a kernel
kernel void test (*some optional input here*, out double2 s<>){
s.x = instance().x;
s.y = instance().y
}
one sees at least with the CPU backend that instance().x is actually running from 0 to height-1, and instance().y from 0 to width-1. With the GPU-backend it is different (didn't get it working yet), but the indexing appears to be borked. To sum it up, with the CPU backend an array defined as
double t_a[height][width];
has to be fitted with a stream
::brook::Stream t_s (2,{width, height});
otherwise it is not working over here.
PS: Why is the forum software appending empty lines to my post? I can't edit it away, it only adds another two mor lines everytime I edit my post :confused;
unsigned int dimc[] = {height,width};
When using C++ style constructor it has to be -
unsigned int dimc[] = {width, height};
Did you check errorLog on your output stream? Something like-
if(Fs9.error())
{
std::cout << Fs9.errorLog();
}
Assign one output with x, y and idx values and see if you can find out the problem.
My guess is brcc is having issue with x and y calculation. e.g. calculate x like this-
x = idx-y*gx;
Try different combinations for x & y calculation.