cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

niravshah00
Journeyman III

Multithreaded Brook+ algorithm from a nested for loop

Hi ,

I am new to Brook+  programming . I read a the Brook+ programming guide and could not figure out how to create mutiple threads in the kernel to take advantage of the GPU .

The Goal is to convert the a algorithm with 4 nested for loop to a multi threaded Brook+ program so as to improve the perfomance .

Is there anything where i can learn how to do this ?

 

Thanks,

Nirav

0 Likes
54 Replies

Also can anyone tell me how do i return results from my kernel the thing is i don't want to use the output stream, as barely 1 or 2 threads would give me result , so I don't want to filter the whole output stream in  the host code ,since the no. of threads would possibly scale (atleast i am trying to do that)

here is my kernel kernel void threadABC(int startRange,out int a<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; A = instance().x+startRange; B = instance().y+startRange; C = instance().z+startRange; gcdAB = findGcd(A,B); gcdAC = findGcd(A,C); gcdBC = findGcd(B,C); if(gcdAB==1 && gcdAC==1 && gcdBC==1){ //threadXY(instance().x+1000,instance().y+1000,instance()+1000.z,a); for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { float sum = pow((float)A, (float)X)+pow((float)B, (float)Y); float Z = (log((float)sum)/log((float)C)); float epsillon = 10E-4f; if(isWithinRange(Z,epsillon)){ } } } } } // and here is my host code int main(int argc, char ** argv) { // int i,j; unsigned int dim[] = {2,1,1}; brook::Stream<int> aStream(2,dim); //int a<10,10,10>; threadABC.domainOffset(uint4(0,0,0,0)); threadABC.domainSize(uint4(2*2,3*1,2*1)); threadABC(1000,aStream); return 0; }

0 Likes

I am very close to closing this .

 

0 Likes

Domain of execution (calling domainOffset and domainSize) is not supported for 3D streams as of now. If you check error or errorLog on your stream, you should get an error saying this feature is unsupported

0 Likes

Well when i am executing the project i dont get any error as such but "indexof called on bogus address" on the command prompt .

Is there any other way i can get what i want ?

Also can your tell me how can i return result from these threads.
There is a possiblity that none of these threads would give an answer or only few of those would .

I don't want to filter the a huge array when i know only few of them would actually have a solution.

0 Likes

I have a question for niravshah00. Why do you use Brook+ ? It's unsupported, it's slow and the idea of using streams for parallel programming simply failed. Except for some simple cases it's much harder to write efficient code in brook then in cuda or opencl ( both are almost the same ).

If you don't have hardware for opencl you can use CAL++ ( it's quite similar to OpenCL and works on all cards supported by CAL ).

 

0 Likes

Well when i am executing the project i dont get any error as such but "indexof called on bogus address" on the command prompt .

 

Is there any other way i can get what i want ?



You will not get these errors on commandline. You need to check error on your stream. Something like-

if(outputStream.error())

{

    std::cout << outputStream.errorLog();

}

0 Likes

Originally posted by: hazeman I have a question for niravshah00. Why do you use Brook+ ? It's unsupported, it's slow and the idea of using streams for parallel programming simply failed. Except for some simple cases it's much harder to write efficient code in brook then in cuda or opencl ( both are almost the same ).

 

If you don't have hardware for opencl you can use CAL++ ( it's quite similar to OpenCL and works on all cards supported by CAL ).

 

 

 

Hi ,

Well I dont have a hardware on my laptop but in the lab I have AMD FireStream 9170.
The reason because i started using brook+ was i started this project in May 2009 when there was no support for OpenCl and now I am stuck with Brook+ because I want to finish off this by May 2010 inorder to graduate by August 2010 .
Also OpenCL does not support FireStream 9170 !
I know my question migh appear that i don't know anything about programming, but there is were limited (infact no material) resources which might help me learn brook+ .
Secondly  the equation i am sloving is very trivial so i thought it would be simpler to do with Brook+ .

I tried reading CAL tutorials in the SDK but it all looked Latin to me
I am open to suggestion and guidance .

0 Likes

Originally posted by: gaurav.garg
Well when i am executing the project i dont get any error as such but "indexof called on bogus address" on the command prompt .

 

 

 

Is there any other way i can get what i want ?



 

You will not get these errors on commandline. You need to check error on your stream. Something like-

 

if(outputStream.error())

 

{

 

    std::cout << outputStream.errorLog();

 

}

 

Gaurav ,

There is no other way for me to create more threads.
So can i argue that the best way to solve my problem is to use 2D array with domain size and then each thread in turn uses sequential loop for the parameter C and x and y (if u remember i have 5 variables A,B,C with a large range and x ,y with smaller range).
As of now that seems to be the only solution to me.

0 Likes

The samples aren't very helpful

0 Likes

then how do we go about

0 Likes

Hi gaurav,

So you think I should change from brook+ to Open CL or CAL++.
As you know my requirements so do you think i can accomplish what i want in brook+ for should i switch.
I would want to finish this asap your help would mean a lot.

0 Likes

Sorry for delay in answer. I was busy with something else and was not checking my mails.

I am not sure if I understand your algorithm very well. It will be good if you can post your host algorithm.

You need to understand that you have to change your algorithm based on GPU architecture and limitations.

I would suggest you to first try a basic Brook+ implementation and then go for optimizations.

IIUC, you are doing something like this-

for a 1000:10000

for b 1000:10000

for c 1000:10000

for x 3:10

for y 3:10

for z 3:10

First you can try to write a kernel that encapsulates last 3 loops (for 'x', 'y', and 'z'). You can create a 2D stream for implicit loop on 'b' & 'c' (If there is size limitations then, you can do processing in tiles). And you can keep loop on 'a' on host side.

0 Likes

Hi,

Thanks for your reply.
I can send you my code that i have written in Java.
So far you understanding has been correct. My equation is A^x  + B^y = C^z

I am sloving for z . So there are basically 5 variables .Since the range has to be flexible i want to utilize the GPU to as much as i can .

In my lab i have a machine which has four Firestream 9170. (with dual quadcore processor)

Secondly I need to figure out by which i can send result i.e all 6 variables only if a z is within the range 10^-8  i dont want to scan the entire stream on the host .Since the only few of the threads would return results .

Let me know if you would want to see my java (serial) code

 

Thanks avery much

0 Likes

Any sugesstions on how can i return my results from kernel code to host code?

0 Likes

here is the Java code for my algorithm

import java.io.File;
import java.io.PrintStream;



public class FindPossibleCounterExamples
{

    /**
     * @param args
     */
    public static void main(String[] args)
    {
        FindPossibleCounterExamples finder = new FindPossibleCounterExamples();
        try
        {
            pOStream = new PrintStream(file);
        }
        catch(Exception e)
        {
            System.out.println(e.getMessage());
        }
        finder.findSuitableC();
    }
    private float BASE_MIN = 1000;
    private float BASE_MAX = 1006;
    private int POW_MAX = 10;
    private int POW_MIN = 3;
    private static final File file = new File("BealsPossibleCounterExamples.txt");
    private static PrintStream pOStream=null;

    private void findSuitableC()
    {

        for(float iA=BASE_MIN; iA<BASE_MAX; iA++)
        {
            for(float iB=BASE_MIN; iB<BASE_MAX; iB++)
            {
                if(iB>iA && gcd(iA,iB)>1.0)
                    continue;
                for(float iC =BASE_MIN; iC < BASE_MAX; iC++)
                {
                    // Beal says if A^X+B^Y = C^Z then A,B,C have a common prime factor.
                    // if the gcd is one for each, it means they dont have a common prime factor.
                    if(gcd(iB,iC)==1 && gcd(iC,iA)==1 && gcd(iA,iB)==1)
                    {
                        // for all C's that dont have a common factor with A and B,
                        // run through values of X,Y and find a value for Z.
                        findZ(iA,iB,iC);
                    }

                }
            }
        }
        pOStream.flush();
        pOStream.close();
        //oStream.close();

    }

    private void findZ(float A, float B, float C)
    {
        for(int X = POW_MIN; X<POW_MAX; X++)
        {
            for(int Y = POW_MIN; Y<POW_MAX; Y++)
            {
                double sum =  Math.pow((double)A, (double)X)+Math.pow((double)B, (double)Y);
                double Z = (Math.log((double)sum)/Math.log((double)C));
                double epsillon = 10E-4f;
                if(isWithinRange(Z,epsillon))
                {   
                    String toPrint = ""+A+"^"+X+" + "+B+"^"+Y+" = "+C+"^"+Z+"\n";
                    pOStream.append(toPrint);
                    System.out.print(toPrint);
                }
            }
        }
    }
    private double gcd(double x, double y)
    {
        if (y==0) return x;
        return gcd(y,x%y);
    }

    private boolean isWithinRange(double z, double epsillon)
    {
        double ceil = Math.ceil((double)z);
        //float floor = Math.floor((double)z);
        if((ceil - z)<=epsillon )
            return true;
        else
            return false;
    }

}

0 Likes