
How to return results from kernel?
niravshah00 May 28, 2010 3:31 AM (in response to niravshah00)Is this question so difficult or is it so stupid ?
I don't get it
How to return results from kernel?
Ceq May 28, 2010 2:15 PM (in response to niravshah00)Hi niravshah00, I would like to help you, but I'm afraid that there is no way to directly implement what you want in Brook+. Even in OpenCL your problem looks complicated (that's why it is an unproved theorem).
Computing the values can be easy, but returning them requires some kind of variable length queue or list, which is not possible in Brook+. You can at most return a few solutions per thread (up to eight float4 values). Maybe if you could use RV870 append/consume buffers there would be a way, but I think you should look for another way.
There is something that may be helpful though:
 Create a big 2D stream, for example 4096x4096. Each thread of your kernel will be assigned an unique identifier and an output address.
 Using the thread identifier assign to each thread a subdomain of the original problem (not a single element), proportional to the problem domain divided by the output stream size (you can use kernel scalar parameters to configure the offset and size of the execution).
 Instead of storing the solutions in the thread's output you store whether there is solution or not in each subsomain. Now you have a stream that contains if a particular subdomain has a solution or not, you can save or graph it.
 If you want more detail about a subdomain that has solutions according to the output stream, you can use the CPU or launch again your GPU kernel with that subdomain, which will be divided in smaller portions again.
I think that using this workaround you could have a GPU implementation that can be used to analyze large domains in much less time than using the CPU alone.

How to return results from kernel?
Jawed May 28, 2010 11:05 PM (in response to Ceq)Create 6 output buffers instead of one, each holding a single integer. See attached code.
When you want to get fancy you can pack these outputs. e.g. make one output stream, called XYZ and another output stream called ABC and define both as int3. Then you can do this
XYZ.x = X ;
XYZ.y = Y ;
XYZ.z = Z ;
ABC.x = A ;
ABC.y = B ;
ABC.z = C ;The advantage of this style of packing is it allows you to return more data (should you need to). Brook+ has a limit of 32 integers or floats, packed into 8 output streams. So in this alternative code I have set up 2 output streams, each containing 3 values.
Jawed
kernel void threadABC(int startRange, out int x<>, out int y<>, out int z<>,out int a<>,out int b<>,out int c<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; float N = 4093.0f; //using the index of the output stream as the values for A,B,C A = instance().x+startRange; B = instance().y+startRange; C = instance().z+startRange; // intialising a to 0 so that when we filter the reuslts we can know that 0 means that //location does not have a reuslt. x=y=z=a=b=c=0; gcdAB = findGcd(A,B); gcdAC = findGcd(A,C); gcdBC = findGcd(B,C); if(gcdAB==1 && gcdAC==1 && gcdBC==1){ for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { for( Z = 3; Z < 10; Z++) { float sum = modulusPower((float)A,X)+modulusPower((float)B, Y); float cpowerZ = modulusPower((float)C,Z); sum = fmod(sum,N); if(cpowerZ == sum){ // here the possible solution should be stored and returned to host code //have to figure out the way to return the values of A,B,C,X,Y,Z to host x = X ; y = Y ; z = Z ; a = A ; b = B ; c = C ; } } } } } }

How to return results from kernel?
niravshah00 Jun 2, 2010 2:46 AM (in response to Jawed)Hi Jawed,
Thanks for the reply.
The solution you suggested has a problem.
Now in my kernel code the stream 'a' is a 3d stream which serves as a values for A,B,C and for the other three variables i use nested loop.
Basically each instance of stream 'a' is a thread that would loop over for values of X,Y,Z
So now i possible instance of A,B,C would have many possible solution.
for example say for A=1000,B=1001,C=1002
there might me solution like X=2,Y=3,Z=4
X=3,Y=4,Z=5so this solution will be written on the same location.
for the same instance of A,B,C we have 2 solutions.
I hope i have explained this clearly
There has to be a work around for this.
Thanks very much for the reply.
Hopefully you will reply again

How to return results from kernel?
Jawed Jun 2, 2010 8:17 PM (in response to niravshah00)If you're expecting multiple solutions for each element in the domain of execution then one approach is to launch the kernel multiple times.
On the first launch, the kernel returns the first solution it finds for each element in the domain of execution.
On the second launch the kernel returns the second solution. etc. Simply pass in a constant to the kernel which tells it how many solutions it should skip, before finally returning.
So the first kernel launch would have skip=0, second call would have skip=1 etc.
If you want to get clever than you make the domain of execution index into a stream which contains the explicit values of A, B, C that you want to evaluate (i.e. an input stream: uint3 evaluateABC<>). Use this instead of the instance() based technique in that earlier kernel.
This way the second kernel launch runs on a smaller domain (presuming that some elements in the first kernel's domain had no solution). And if a third launch is required, that will have an even smaller domain. etc. Just keep going until there are no more solutions.
Jawed

How to return results from kernel?
niravshah00 Jun 3, 2010 12:07 AM (in response to Jawed)Hi Jawed ,
Can you give me an example i could not find any such examples in the samples in the sdk.
It would be great if I get this working .
Thanks
Nirav

How to return results from kernel?
Jawed Jun 4, 2010 7:05 AM (in response to niravshah00)My first suggestion is so simple it's trivial and the second suggestion has no effect on the kernel.

How to return results from kernel?
niravshah00 Jun 4, 2010 3:55 PM (in response to Jawed)Well by first solution u mean the one with 6 output streams?
I did not understand the second solution with multiple kermelsi will send u my host code as well so that u have a better idea.
host code:
#include "brookgenfiles\beals.h"
#include "conio.h"
#include "brook\stream.h"
using namespace brook;
int main(int argc, char ** argv)
{
int i,j,k,range;
int startRange =1000;
int endRange = 10000;
unsigned int dim[] = {10,10,10};
for(i=0;i<(endRange  startRange)
{
if((endRange  startRangei)<8192)
dim[0] = endRange  startRangei;
else
dim[0] = 8192;
for(j=0;j<(endRange  startRange)
{
if((endRange  startRangej)<90)
dim[1] = endRange  startRangej;
else
dim[1] = 90;
for(k=0;k<(endRange  startRange)
{
if((endRange  startRangek)<90)
dim[2] = endRange  startRangek;
else
dim[2] = 90;
Stream<int> aStream(3,dim);
threadABC(startRange+i,startRange+j,startRange+k,aStream);
// results from the kernel to be written to a file
// want to do this in parallel
k+=90;
}
j+=90;
}
i+=8192;
}
//display the result
//streamWrite(aStream,solution);
/*for(i=0;i<10;i++)
for(j=0;j<10;j++)
for(k=0;k<10;k++)
{
//check for non zero values since the stream is intialized to zero.
if(solution[j][k]!=0)
printf("a =%d,b =%d,c =%d,z =%d\n" ,i+1000,j+1000,k+1000,solution[j][k]);
}*/
getch();
return 0;
}
How to return results from kernel?
niravshah00 Jun 8, 2010 1:51 AM (in response to niravshah00)Hi Jawed ,
There is one more way to deal with my problem but I am not sure is that possible and if it is how to do it.
Instead of creating threads for each value of A,B,C if some how i could create a thread for A,B,C,X,Y,Z then it would be great and then i don't have to deal with multiple solution within single thread.
Hoping that you would reply.

How to return results from kernel?
Ceq Jun 8, 2010 9:25 AM (in response to niravshah00)Hi niravshah00, I don't think that creating a thread for each varaible would solve anything because you still have to perform a recombination step. If you want to do it anyway you could do something like this:
kernel void
thread6D(int3 baseABC, int3 baseXYZ, ... )
{
int2 pos2D = instance( ).xy;
// Identifiers for ABC
int idA = baseABC.x + (pos2D.x & 0x000f );
int idB = baseABC.y + (pos2D.x & 0x00f0 >> 4 );
int idC = baseABC.z + (pos2D.x & 0x0f00 >> 8 );
// Identifiers for XYZ
int idX = baseXYZ.x + (pos2D.y & 0x000f);
int idY = baseXYZ.y + (pos2D.y & 0x00f0 >> 4 );
int idZ = baseXYZ.z + (pos2D.y & 0x0f00 >> 8 );
// Now, if you use a 4096x4096 texture you have a small 6D domain
// of 16A x 16B x 16C x 16X x 16Y x 16Z threads
// You can decompose the domain in several subdomains, you just have
// to launch several executions changing baseABC and baseXYZ...
}
How to return results from kernel?
niravshah00 Jun 8, 2010 9:14 PM (in response to Ceq)Hi Ceq,
Well what i was suggesting was if i could make a 6D stream instead of a 3D stream which i am making right now then per thread so i could do something like
a = 1 ; (a is my 6d stream and then the position would give me the values of the variables)
i.e if( a[j][k][x][y][z] == 1) then i,j,k,x,y,z is the solution
Well then tell me what is the best solution for this problem i have run out of option and time .please tell me

How to return results from kernel?
Ceq Jun 8, 2010 9:47 PM (in response to niravshah00)I think the dimension of the stream isn't that relevant. As you know, you can store a 2D matrix using the memory allocated with a single malloc, all you have to do are the proper index translations.
I think that Brook+ does not support 6D streams, but even using 3D streams may result in a performance penalty as slow automatic address translation code will be generated. However, manipulating the index values, you can perfectly store 6D data on a 2D texture.
Note that as the whole domain is too big, I think you should divide it in small blocks that can be processed independently. That's why probably you'll need a parameter with some kind of offset in your kernel.
I would like to be more helpful, but currently I'm quite busy on a work due to a deadline, so for moment this is all I can do, sorry.

How to return results from kernel?
niravshah00 Jun 9, 2010 3:57 AM (in response to Ceq)Thanks a lot Ceq.
I will really appreciate for all ur help.
I can totally understand how it feel to be on a deadline.
Thanks once again

How to return results from kernel?
niravshah00 Jun 11, 2010 6:37 PM (in response to niravshah00)Can someone give me a way where i can solve this problem ?
I mean there has to be some way i can do this.Isnt there global memory where the threads can write just by acquiring the lock on the memory.
I mean that would be great.

How to return results from kernel?
niravshah00 Jun 12, 2010 2:49 PM (in response to niravshah00)The only solution I can think of is to have nested loop for X,Y,Z in the host code so that each thread return just one result and then I can put the results in the output stream and not have to worry about multiple results.
I dont know how much would it hurt my preformance but that the only way to do it.

How to return results from kernel?
niravshah00 Jun 22, 2010 11:07 PM (in response to niravshah00)Can we return a variable length array from kernel in openCL??














How to return results from kernel?
niravshah00 Jun 2, 2010 2:39 AM (in response to Ceq)Thanks Ceq
Well i don't completely understand your suggestion but let me spend some time on what you have suggested and then get back to you.
I really appreciate your help.
I will get back to you in 2 days.



How to return results from kernel?
MicahVillmow Jun 23, 2010 12:18 AM (in response to niravshah00)variable length arrays are not supported in OpenCL. This is specified in 6.8.d of the OpenCL spec.
How to return results from kernel?
niravshah00 Jun 23, 2010 2:32 AM (in response to MicahVillmow)Well what I meant was i need the kernel to give me a array with result but i would also want to know how many results are there in the array.
So that I need not scan the whole array on the host code to get the result .
Some counter which each thread could access and update when it finds a solution and ofcource acquire a lock as well on the counter.
If u have this in brook+ that would be great .Something like global memory
