cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

niravshah00
Journeyman III

How to return results from kernel?

I have kerne running with and want to return a set of six integers back to the host code.

Well i dont know how many set i would be getting in advance . It is possible that i might not get any.

Any thoughts on how can i do this i am attaching the kernel code here

kernel void threadABC(int startRange,out int a<> )
{
    int X,Y,Z;
    int A,B,C;
    int gcdAB,gcdAC,gcdBC;
    float N = 4093.0f;   
   
    //using the index of the output stream as the values for A,B,C
    A = instance().x+startRange;
    B = instance().y+startRange;
    C = instance().z+startRange;
   
    // intialising a to 0 so that when we filter the reuslts we can know that 0 means that
    //location does not have a reuslt.
    a=0;

     gcdAB = findGcd(A,B);
     gcdAC = findGcd(A,C);
     gcdBC = findGcd(B,C);
    if(gcdAB==1 && gcdAC==1 && gcdBC==1){
        for( X = 3; X < 10; X++)
        {
            for( Y = 3; Y < 10; Y++)
            {
                for( Z = 3; Z < 10; Z++)
                {
                    float sum =  modulusPower((float)A,X)+modulusPower((float)B, Y);
                    float cpowerZ = modulusPower((float)C,Z);
                    sum = fmod(sum,N);
                    if(cpowerZ == sum){
                        // here the possible solution should be stored and returned to host code
                        //have to figure out the way to return the values of A,B,C,X,Y,Z to host
                      
                    }
                   
                }
            }
        }

    }

0 Likes
19 Replies
niravshah00
Journeyman III

Is this question so difficult or is it so stupid ?
I don't get it

0 Likes

Hi niravshah00, I would like to help you, but I'm afraid that there is no way to directly implement what you want in Brook+. Even in OpenCL your problem looks complicated (that's why it is an unproved theorem).

Computing the values can be easy, but returning them requires some kind of variable length queue or list, which is not possible in Brook+. You can at most return a few solutions per thread (up to eight float4 values). Maybe if you could use RV870 append/consume buffers there would be a way, but I think you should look for another way.

 

There is something that may be helpful though:

- Create a big 2D stream, for example 4096x4096. Each thread of your kernel will be assigned an unique identifier and an output address.

- Using the thread identifier assign to each thread a subdomain of the original problem (not a single element), proportional to the problem domain divided by the output stream size (you can use kernel scalar parameters to configure the offset and size of the execution).

- Instead of storing the solutions in the thread's output you store whether there is solution or not in each sub-somain. Now you have a stream that contains if a particular sub-domain has a solution or not, you can save or graph it.

- If you want more detail about a sub-domain that has solutions according to the output stream, you can use the CPU or launch again your GPU kernel with that sub-domain, which will be divided in smaller portions again.

 

I think that using this workaround you could have a GPU implementation that can be used to analyze large domains in much less time than using the CPU alone.

0 Likes

Create 6 output buffers instead of one, each holding a single integer. See attached code.

When you want to get fancy you can pack these outputs. e.g. make one output stream, called XYZ and another output stream called ABC and define both as int3. Then you can do this

XYZ.x = X ;
XYZ.y = Y ;
XYZ.z = Z ;
ABC.x = A ;
ABC.y = B ;
ABC.z = C ;

The advantage of this style of packing is it allows you to return more data (should you need to). Brook+ has a limit of 32 integers or floats, packed into 8 output streams. So in this alternative code I have set up 2 output streams, each containing 3 values.

Jawed

kernel void threadABC(int startRange, out int x<>, out int y<>, out int z<>,out int a<>,out int b<>,out int c<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; float N = 4093.0f; //using the index of the output stream as the values for A,B,C A = instance().x+startRange; B = instance().y+startRange; C = instance().z+startRange; // intialising a to 0 so that when we filter the reuslts we can know that 0 means that //location does not have a reuslt. x=y=z=a=b=c=0; gcdAB = findGcd(A,B); gcdAC = findGcd(A,C); gcdBC = findGcd(B,C); if(gcdAB==1 && gcdAC==1 && gcdBC==1){ for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { for( Z = 3; Z < 10; Z++) { float sum = modulusPower((float)A,X)+modulusPower((float)B, Y); float cpowerZ = modulusPower((float)C,Z); sum = fmod(sum,N); if(cpowerZ == sum){ // here the possible solution should be stored and returned to host code //have to figure out the way to return the values of A,B,C,X,Y,Z to host x = X ; y = Y ; z = Z ; a = A ; b = B ; c = C ; } } } } } }

0 Likes

Hi  Jawed,

Thanks for the reply.

The solution you suggested has a problem.

Now in my kernel code the stream 'a' is a 3d stream which serves as a values for A,B,C  and for the other three variables i use nested loop.

Basically each instance of stream 'a' is a thread that would loop over  for values of X,Y,Z

So now i possible instance of A,B,C would have many possible solution.

for example say for A=1000,B=1001,C=1002

there might me solution like X=2,Y=3,Z=4
                                             X=3,Y=4,Z=5

so this solution will be written on the same location.

for the same instance of A,B,C we have 2 solutions.

I hope i have explained this clearly

There has to be a work around for this.

Thanks very much for the reply.

Hopefully you will reply again

 

0 Likes

If you're expecting multiple solutions for each element in the domain of execution then one approach is to launch the kernel multiple times.

On the first launch, the kernel returns the first solution it finds for each element in the domain of execution.

On the second launch the kernel returns the second solution. etc. Simply pass in a constant to the kernel which tells it how many solutions it should skip, before finally returning.

So the first kernel launch would have skip=0, second call would have skip=1 etc.

If you want to get clever than you make the domain of execution index into a stream which contains the explicit values of A, B, C that you want to evaluate (i.e. an input stream: uint3 evaluateABC<>). Use this instead of the instance() based technique in that earlier kernel.

This way the second kernel launch runs on a smaller domain (presuming that some elements in the first kernel's domain had no solution). And if a third launch is required, that will have an even smaller domain. etc. Just keep going until there are no more solutions.

Jawed

0 Likes

Hi Jawed ,

Can you give me an example i could not find any such examples in the samples in the sdk.

It would be great if I get this working .

 

Thanks

Nirav

0 Likes

My first suggestion is so simple it's trivial and the second suggestion has no effect on the kernel.

0 Likes

Well by first solution u mean the one with 6 output streams?

I did not understand the second solution  with multiple kermels

i will send u my host code as well so that u have a better idea.

host code:

#include "brookgenfiles\beals.h"
#include "conio.h"
#include "brook\stream.h"
using namespace brook;

int main(int argc, char ** argv)
{
   
    int i,j,k,range;   
    int startRange =1000;
    int endRange = 10000;

    unsigned int dim[] = {10,10,10};
   

   
    for(i=0;i<(endRange - startRange)
    {
        if((endRange - startRange-i)<8192)
                    dim[0] = endRange - startRange-i;
                else
                    dim[0] = 8192;
        for(j=0;j<(endRange - startRange)
        {
            if((endRange - startRange-j)<90)
                    dim[1] = endRange - startRange-j;
                else
                    dim[1] = 90;
            for(k=0;k<(endRange - startRange)
            {
               
                if((endRange - startRange-k)<90)
                    dim[2] = endRange - startRange-k;
                else
                    dim[2] = 90;           
                Stream<int>  aStream(3,dim);
                threadABC(startRange+i,startRange+j,startRange+k,aStream);
                // results from the kernel to be written to a file
                // want to do this in parallel
                k+=90;
            }
            j+=90;
        }
        i+=8192;
    }


   
    //display the result
    //streamWrite(aStream,solution);
   
    /*for(i=0;i<10;i++)
        for(j=0;j<10;j++)
            for(k=0;k<10;k++)
            {
                //check for non zero values since the stream is intialized to zero.
                if(solution!=0)
                printf("a =%d,b =%d,c =%d,z =%d\n" ,i+1000,j+1000,k+1000,solution
);
            }*/
    getch();
    return 0;
}

0 Likes

Hi Jawed ,

There is one more way to deal with my problem but I am not sure is that possible and if it is how to do it.

Instead of creating threads for each value of A,B,C if some how i could create a thread for A,B,C,X,Y,Z then it would be great and then i don't have to deal with multiple solution within single thread.

Hoping that you would reply.

 

0 Likes

Hi niravshah00, I don't think that creating a thread for each varaible would solve anything because you still have to perform a recombination step. If you want to do it anyway you could do something like this:

kernel void
thread6D(int3 baseABC, int3 baseXYZ, ... )
{
    int2 pos2D = instance( ).xy;


    // Identifiers for ABC
    int idA = baseABC.x + (pos2D.x & 0x000f );
    int idB = baseABC.y + (pos2D.x & 0x00f0 >> 4 );
    int idC = baseABC.z + (pos2D.x & 0x0f00 >> 8 );


    // Identifiers for XYZ
    int idX = baseXYZ.x + (pos2D.y & 0x000f);
    int idY = baseXYZ.y + (pos2D.y & 0x00f0 >> 4 );
    int idZ = baseXYZ.z + (pos2D.y & 0x0f00 >> 8 );
   
    // Now, if you use a 4096x4096 texture you have a small 6D domain
    // of 16A x 16B x 16C x 16X x 16Y x 16Z threads
    // You can decompose the domain in several subdomains, you just have
    // to launch several executions changing baseABC and baseXYZ

...   

}

0 Likes

Hi Ceq,

Well what i was suggesting was if i could make a 6D stream instead of a 3D stream which i am making right now then per  thread so i could do something like

   a = 1 ;     (a is my 6d stream and then the position would give me the values of the variables)

i.e  if( a  == 1) then i,j,k,x,y,z is the solution


Well then tell me what is the best solution for this problem i have run out of option and time .

please tell me

0 Likes

I think the dimension of the stream isn't that relevant. As you know, you can store a 2D matrix using the memory allocated with a single malloc, all you have to do are the proper index translations.

I think that Brook+ does not support 6D streams, but even using 3D streams may result in a performance penalty as slow automatic address translation code will be generated. However, manipulating the index values, you can perfectly store 6D data on a 2D texture.

Note that as the whole domain is too big, I think you should divide it in small blocks that can be processed independently. That's why probably you'll need a parameter with some kind of offset in your kernel.

I would like to be more helpful, but currently I'm quite busy on a work due to a deadline, so for moment this is all I can do, sorry.

0 Likes

Thanks a lot Ceq.

I will really appreciate for all ur help.

I can totally understand how it feel to be on a deadline.

Thanks once again

0 Likes

Can someone give me a way where i can solve this problem ?
I mean there has to be some way i can do this.

Isnt there global memory where the threads can write just by acquiring the lock on the memory.

I mean that would be great.

0 Likes

The only solution I can think of is to have nested loop for X,Y,Z in the host code so that each thread return just one result and then I can put the results in the output stream and not have to worry about multiple results.

I dont know how much would it hurt my preformance but that the only way to do it.

0 Likes

Can we return a variable length array from kernel in openCL??

0 Likes

Thanks Ceq

Well i don't completely understand your suggestion but let me spend some time on what you have suggested and then get back to you.

I really appreciate your help.

I will get back to you in 2 days.

 

 

0 Likes

variable length arrays are not supported in OpenCL. This is specified in 6.8.d of the OpenCL spec.
0 Likes

Well what I meant was i need the kernel to give me a array with result but i would also want to know how many results are there in the array.

So that I need not scan the whole array on the host code to get the result .

Some counter which each thread could access and update when it finds a solution and ofcource acquire a lock as well on the counter.

If u have this in brook+ that would be great .Something like global memory

0 Likes