In my program - a factoring tool for large integers - I have to multiply two 500 bit unsigned (emulated by 16 x 32-bit unsigned integers).
Becauce brook doesn't support local arrays and it is impossible to return more then a 128 bit value of a inlined kernel function, I had to write everythin manually - unrolling every single loop! This results in a .br-file with about 500.000 lines of code!
The problem is: Using "brcc -P cal" results in one single .h file of about 380 megabyte - my gcc is not able to compile this ( virtual memory exhausted ) - I am using a 3 ghz Intel Core2Duo, 4gb of system memory and a Radeon 4850 on Ubuntu 9.04 (32 bit - but ggc always uses only 2,4 gbyte before breaking - so missing memory doesn't seem to be a problem).
Is there a way to split the .h file so I can compile?
Another way to reduce the size is to disable address translation (use -r flag to compile .br file). But, you should use this flag only if you are not using any 1D stream with width > 2^13 or using a 3D stream.
Afaik there is no GPU in the world that could process such a huge (shader) program anyway.
Is it possible to split your program into several subsequent tasks, each of which can be executed in a separate kernel program? That might help you to solve your problem.
Well, the program itself is not sooooo large.
In detail: I have 10 modular multiplikations of large integers. Each needs 3 loops of the form
for (int i = 0; i < 16, i++) {
for (int j = 0; j < 16, j++) {
Within this loops are only some if's for comparison and 4 integer multiplications .... all in all somehow 25 lines of code ....
So if I compile this in a normal cpu program the size never exeeds a few kb of data, but in this case - brook doesn't have local array support I have to unroll every loop - resulting in more than 20.000 lines of code for just one modular multiplication (compare: same cpu code has below 100 lines!) - that more than 10 times and program size blows up
All in all this code should not be a big problem for my GPU, but the missing local array feature makes stream nearly unusable for my needs
Ps: The -r param helped a lot - now size of the .h file is below 190 megabyte ... but still to large for my gcc compiler.
I think you need to rethink your problem into a stream model!?
Well, what exactly do you mean?
Except of two global constants (the number to factor and one precomputed number) all of my inputs are normal streams - I am generating one 512 bit unsigned by using four uint4 - streams ... only output has to be done by scather, because only 8 streams can be used in full. But that doesn't matter - the process needs several hours on cpu, so I think wirting to scatter once in 7 hours my be ok
But every Kernel writes to his own memory and reads from its own input - so I think it's not far away of a normal stream model - I just have to emulate integers of 512 bit and more because architecture is not providing them - everything else is like a normal GPU program ^^
Can you post the non-unrolled kernel?
Sorry, I meant the non-unrolled kernel.
The Code behaves like this:
kernel factor ( uint4 input10<>, ..., uint4 input14<>, uint4 input20<>, ..., uint4 input24<>, uint4 n[4], uint primeslist[], out uint4 return[][]) {
uint A[16];
uint B[16];
uint C[16];
A[0] = input10.w;
....... // copy all inputs into working variables
for (i = 0; i < pl; i++ ) { //pl stores the length of my primes list - this loop has not to be unrolled
curprime = primeslist;
..........
for (j = 0; j < 16; j++) { // this two loops are ten times within the code .... and have to be unrolled !
tmp1 = A
for (k= 0; k < 16; k++) {
tmp2 = B
tmpRes.w = A
tmpRes.x = tmp1 * tmp2;
tmpRes.y = tmp1 * B
tmpRes.z = tmp2 * A
// From here some not very interesting calkulations ... adding integers and proof if there are overflows ... all together i think 20 lines of simple code (adding, ifs ... )
}
}
.......
}
In the end of my calkulations I am writing my results into the scatter - because I expect calkulation to need some hours I don't think this will affect my performance much.
}
All together this is a relative normal program - but has to be executed a few hundred times - thats why I wanted it to be done on gpu - the many independend cores and the arithmetik performance.
But unroling the two loops rapidly blows up the code - I just started to rewrite my program to work mainly on CPU so only the mass arithmetik is done on gpu - but thats much less effective than if it would be possible to run all the code on gpu.
So you have 3 for loops for EACH element in return?
Why do you need to copy the inputs into working variables?
They don't have to be copied.... but the code is much more simple if I am doing this because the first loop allways works with intermed. results ... and when its runnig the first time with my params. The 3 loops realy are needed - because of the algorithm working on my input.
In detail: I have got an elliptic curve and a point on it (x,y) - kooridnates which satisfies an equation like y^2 = x^3 + ay + b .... this point have a mathematical group structure - doesn't matter how this works in detail. now for every prime p and exponent q so that p^q < B1 < p^(p+1) for an calculation limit B1 I calculate (p^q)*P ... afterwards taking next prime p' and calculate (p'^q')*(p^q)P and so on ...
The outer loop is just the loop over the primes.
I thought this would work fine on gpu, because for the whole algorithm I have to do so with thousand of curves - so I thought it would be possible for every kernel to work on his own curve with own points - every single curve needs some hours on cpu, so it would be great if a gpu can do 160 curves in slightly larger time.
While writing the kernel I had a look on memory-use of my kernels ... ervery one handels his operations within 1kbyte, so I expect my variables to fit into the cache of the stream processors - its all written to be very computation-expensive without using much memory. Thats why I think if array where possible it should work.
Instead of the unrolled loop you're using, you can create a circular buffer, then use a normal loop to run the "head" of the buffer through a call to a function that does the partial product stuff to multiply the two 512-bit numbers.
Rotate the circular buffer "one notch" per loop:
a0.x=a0.y;a0.y=a0.z ;
a0.z=a0.w;a0.w=a1.x ;
etc.
Obviously you need to rotate C and B in step in the inner loop, i. And rotate A in the j loop. There might be some futzing with the alignment of C as you iterate through A - haven't really thought it all through
Then instead of using the literal "16" in your for loops, make it a constant passed into the kernel. This stops the compiler seeing the literal and unrolling your code 16 times
Obviously this adds a few dozen instructions in a pretty bad place, the inner loop. But I think there's a decent chance you will at least get something that compiles.
Alternatively, IL supports indexed registers (called indexed_temp_array). So you can code a partial-product manipulator as a function (really a subroutine). Then code your pair of nested loops. Easier said than done... particularly as:
http://forums.amd.com/forum/me...id=328&threadid=110498
Another advantage of IL is that functions can "return" more than 128-bits to the caller. That's because all registers are global, there's no scope.
IL's pretty fiddly though.
Although thinking about this, I'm sure in the past I've coded a struct that allowed a function to return more than 128 bits - but I can't find the code. Hmm...
Jawed