cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

WTrei
Journeyman III

How to compile large GPU Programs

In my program - a factoring tool for large integers - I have to multiply two 500 bit unsigned (emulated by 16 x 32-bit unsigned integers).

Becauce brook doesn't support local arrays and it is impossible to return more then a 128 bit value of a inlined kernel function, I had to write everythin manually  - unrolling every single loop! This results in a .br-file with about 500.000 lines of code!

The problem is: Using "brcc -P cal" results in one single .h file of about 380 megabyte - my gcc is not able to compile this ( virtual memory exhausted ) - I am using a 3 ghz Intel Core2Duo, 4gb of system memory and a Radeon 4850 on Ubuntu 9.04 (32 bit - but ggc always uses only 2,4 gbyte before breaking - so missing memory doesn't seem to be a problem).

Is there a way to split the .h file so I can compile?

0 Likes
10 Replies
gaurav_garg
Adept I

How to compile large GPU Programs

Another way to reduce the size is to disable address translation (use -r flag to compile .br file). But, you should use this flag only if you are not using any 1D stream with width > 2^13 or using a 3D stream.

 

0 Likes
karx11erx
Journeyman III

How to compile large GPU Programs

Afaik there is no GPU in the world that could process such a huge (shader) program anyway.

Is it possible to split your program into several subsequent tasks, each of which can be executed in a separate kernel program? That might help you to solve your problem.

0 Likes
WTrei
Journeyman III

How to compile large GPU Programs

Well, the program itself is not sooooo large.

In detail: I have 10 modular multiplikations of large integers. Each needs 3 loops of the form

for (int i = 0; i < 16, i++) {

     for (int j = 0; j < 16, j++) {

 

Within this loops are only some if's for comparison and 4 integer multiplications .... all in all somehow 25 lines of code ....

So if I compile this in a normal cpu program the size never exeeds a few kb of data, but in this case - brook doesn't have local array support I have to unroll every loop - resulting in more than 20.000 lines of code for just one modular multiplication (compare: same cpu code has below 100 lines!) - that more than 10 times and program size blows up

All in all this code should not be a big problem for my GPU, but the missing local array feature makes stream nearly unusable for my needs

 

Ps: The -r param helped a lot - now size of the .h file is below 190 megabyte ... but still to large for my gcc compiler.

 

0 Likes
ryta1203
Journeyman III

How to compile large GPU Programs

I think you need to rethink your problem into a stream model!?

0 Likes
WTrei
Journeyman III

How to compile large GPU Programs

Well, what exactly do you mean?

Except of two global constants (the number to factor and one precomputed number) all of my inputs are normal streams - I am generating one 512 bit unsigned by using four uint4 - streams ... only output has to be done by scather, because only 8 streams can be used in full. But that doesn't matter - the process needs several hours on cpu, so I think wirting to scatter once in 7 hours my be ok

But every Kernel writes to his own memory and reads from its own input - so I think it's not far away of a normal stream model - I just have to emulate integers of 512 bit and more because architecture is not providing them - everything else is like a normal GPU program ^^

0 Likes
ryta1203
Journeyman III

How to compile large GPU Programs

Can you post the non-unrolled kernel?

Sorry, I meant the non-unrolled kernel.

0 Likes
WTrei
Journeyman III

How to compile large GPU Programs

The Code behaves like this:

kernel factor ( uint4 input10<>, ...,   uint4 input14<>, uint4 input20<>, ...,   uint4 input24<>, uint4 n[4], uint primeslist[], out uint4 return[][]) {

uint A[16];

uint B[16];

uint C[16];

A[0] = input10.w;

.......                                        // copy all inputs into working variables

for (i = 0; i < pl; i++ ) {               //pl stores the length of my primes list - this loop has not to be unrolled
        curprime = primeslist;

        ..........

        for (j = 0; j < 16; j++) {                                            // this two loops are ten times within the code .... and have to be unrolled !

                 tmp1 = A >> 16;                

                for (k= 0; k < 16; k++) {

                             tmp2 = B >> 16;

                             tmpRes.w = A * B;

                             tmpRes.x = tmp1 * tmp2;

                             tmpRes.y = tmp1 * B;

                             tmpRes.z = tmp2 * A;

                             // From here some not very interesting calkulations ... adding integers and proof if there are overflows ... all together i think 20 lines of simple code (adding, ifs ... )

                }

       }

 

      .......

}

 

In the end of my calkulations I am writing my results into the scatter - because I expect calkulation to need some hours I don't think this will affect my performance much.

}

 

 

All together this is a relative normal program - but has to be executed a few hundred times - thats why I wanted it to be done on gpu - the many independend cores and the arithmetik performance.

But unroling the two loops rapidly blows up the code - I just started to rewrite my program to work mainly on CPU so only the mass arithmetik is done on gpu - but thats much less effective than if it would be possible to run all the code on gpu.

0 Likes
ryta1203
Journeyman III

How to compile large GPU Programs

So you have 3 for loops for EACH element in return?

Why do you need to copy the inputs into working variables?

0 Likes
WTrei
Journeyman III

How to compile large GPU Programs

They don't have to be copied.... but the code is much more simple if I am doing this because the first loop allways works with intermed. results ... and when its runnig the first time with my params. The 3 loops realy are needed - because of the algorithm working on my input.

In detail: I have got an  elliptic curve and a point on it (x,y) - kooridnates which satisfies an equation like y^2 = x^3 + ay + b .... this point have a mathematical group structure - doesn't matter how this works in detail. now for every prime p and exponent q so that p^q < B1 < p^(p+1) for an calculation limit B1 I calculate (p^q)*P ... afterwards taking next prime p' and calculate (p'^q')*(p^q)P and so on ...

The outer loop is just the loop over the primes.

I thought this would work fine on gpu, because for the whole algorithm I have to do so with thousand of curves - so I thought it would be possible for every kernel to work on his own curve with own points - every single curve needs some hours on cpu, so it would be great if a gpu can do 160 curves in slightly larger time.

While writing the kernel I had a look on memory-use of my kernels ... ervery one handels his operations within 1kbyte, so I expect my variables to fit into the cache of the stream processors - its all written to be very computation-expensive without using much memory. Thats why I think if array where possible it should work.

0 Likes