cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

WTrei
Journeyman III

How to compile large GPU Programs

In my program - a factoring tool for large integers - I have to multiply two 500 bit unsigned (emulated by 16 x 32-bit unsigned integers).

Becauce brook doesn't support local arrays and it is impossible to return more then a 128 bit value of a inlined kernel function, I had to write everythin manually  - unrolling every single loop! This results in a .br-file with about 500.000 lines of code!

The problem is: Using "brcc -P cal" results in one single .h file of about 380 megabyte - my gcc is not able to compile this ( virtual memory exhausted ) - I am using a 3 ghz Intel Core2Duo, 4gb of system memory and a Radeon 4850 on Ubuntu 9.04 (32 bit - but ggc always uses only 2,4 gbyte before breaking - so missing memory doesn't seem to be a problem).

Is there a way to split the .h file so I can compile?

0 Likes
10 Replies
gaurav_garg
Adept I

Another way to reduce the size is to disable address translation (use -r flag to compile .br file). But, you should use this flag only if you are not using any 1D stream with width > 2^13 or using a 3D stream.

 

0 Likes

Afaik there is no GPU in the world that could process such a huge (shader) program anyway.

Is it possible to split your program into several subsequent tasks, each of which can be executed in a separate kernel program? That might help you to solve your problem.

0 Likes

Well, the program itself is not sooooo large.

In detail: I have 10 modular multiplikations of large integers. Each needs 3 loops of the form

for (int i = 0; i < 16, i++) {

     for (int j = 0; j < 16, j++) {

 

Within this loops are only some if's for comparison and 4 integer multiplications .... all in all somehow 25 lines of code ....

So if I compile this in a normal cpu program the size never exeeds a few kb of data, but in this case - brook doesn't have local array support I have to unroll every loop - resulting in more than 20.000 lines of code for just one modular multiplication (compare: same cpu code has below 100 lines!) - that more than 10 times and program size blows up

All in all this code should not be a big problem for my GPU, but the missing local array feature makes stream nearly unusable for my needs

 

Ps: The -r param helped a lot - now size of the .h file is below 190 megabyte ... but still to large for my gcc compiler.

 

0 Likes

I think you need to rethink your problem into a stream model!?

0 Likes

Well, what exactly do you mean?

Except of two global constants (the number to factor and one precomputed number) all of my inputs are normal streams - I am generating one 512 bit unsigned by using four uint4 - streams ... only output has to be done by scather, because only 8 streams can be used in full. But that doesn't matter - the process needs several hours on cpu, so I think wirting to scatter once in 7 hours my be ok

But every Kernel writes to his own memory and reads from its own input - so I think it's not far away of a normal stream model - I just have to emulate integers of 512 bit and more because architecture is not providing them - everything else is like a normal GPU program ^^

0 Likes

Can you post the non-unrolled kernel?

Sorry, I meant the non-unrolled kernel.

0 Likes

The Code behaves like this:

kernel factor ( uint4 input10<>, ...,   uint4 input14<>, uint4 input20<>, ...,   uint4 input24<>, uint4 n[4], uint primeslist[], out uint4 return[][]) {

uint A[16];

uint B[16];

uint C[16];

A[0] = input10.w;

.......                                        // copy all inputs into working variables

for (i = 0; i < pl; i++ ) {               //pl stores the length of my primes list - this loop has not to be unrolled
        curprime = primeslist;

        ..........

        for (j = 0; j < 16; j++) {                                            // this two loops are ten times within the code .... and have to be unrolled !

                 tmp1 = A >> 16;                

                for (k= 0; k < 16; k++) {

                             tmp2 = B >> 16;

                             tmpRes.w = A * B;

                             tmpRes.x = tmp1 * tmp2;

                             tmpRes.y = tmp1 * B;

                             tmpRes.z = tmp2 * A;

                             // From here some not very interesting calkulations ... adding integers and proof if there are overflows ... all together i think 20 lines of simple code (adding, ifs ... )

                }

       }

 

      .......

}

 

In the end of my calkulations I am writing my results into the scatter - because I expect calkulation to need some hours I don't think this will affect my performance much.

}

 

 

All together this is a relative normal program - but has to be executed a few hundred times - thats why I wanted it to be done on gpu - the many independend cores and the arithmetik performance.

But unroling the two loops rapidly blows up the code - I just started to rewrite my program to work mainly on CPU so only the mass arithmetik is done on gpu - but thats much less effective than if it would be possible to run all the code on gpu.

0 Likes

So you have 3 for loops for EACH element in return?

Why do you need to copy the inputs into working variables?

0 Likes

They don't have to be copied.... but the code is much more simple if I am doing this because the first loop allways works with intermed. results ... and when its runnig the first time with my params. The 3 loops realy are needed - because of the algorithm working on my input.

In detail: I have got an  elliptic curve and a point on it (x,y) - kooridnates which satisfies an equation like y^2 = x^3 + ay + b .... this point have a mathematical group structure - doesn't matter how this works in detail. now for every prime p and exponent q so that p^q < B1 < p^(p+1) for an calculation limit B1 I calculate (p^q)*P ... afterwards taking next prime p' and calculate (p'^q')*(p^q)P and so on ...

The outer loop is just the loop over the primes.

I thought this would work fine on gpu, because for the whole algorithm I have to do so with thousand of curves - so I thought it would be possible for every kernel to work on his own curve with own points - every single curve needs some hours on cpu, so it would be great if a gpu can do 160 curves in slightly larger time.

While writing the kernel I had a look on memory-use of my kernels ... ervery one handels his operations within 1kbyte, so I expect my variables to fit into the cache of the stream processors - its all written to be very computation-expensive without using much memory. Thats why I think if array where possible it should work.

0 Likes

Instead of the unrolled loop you're using, you can create a circular buffer, then use a normal loop to run the "head" of the buffer through a call to a function that does the partial product stuff to multiply the two 512-bit numbers.

Rotate the circular buffer "one notch" per loop:

   a0.x=a0.y;a0.y=a0.z ;   
   a0.z=a0.w;a0.w=a1.x ;

etc.

Obviously you need to rotate C and B in step in the inner loop, i. And rotate A in the j loop. There might be some futzing with the alignment of C as you iterate through A - haven't really thought it all through

Then instead of using the literal "16" in your for loops, make it a constant passed into the kernel. This stops the compiler seeing the literal and unrolling your code 16 times

Obviously this adds a few dozen instructions in a pretty bad place, the inner loop. But I think there's a decent chance you will at least get something that compiles.

Alternatively, IL supports indexed registers (called indexed_temp_array). So you can code a partial-product manipulator as a function (really a subroutine). Then code your pair of nested loops. Easier said than done... particularly as:

http://forums.amd.com/forum/me...id=328&threadid=110498

Another advantage of IL is that functions can "return" more than 128-bits to the caller. That's because all registers are global, there's no scope.

IL's pretty fiddly though.

Although thinking about this, I'm sure in the past I've coded a struct that allowed a function to return more than 128 bits - but I can't find the code. Hmm...

Jawed

0 Likes