Hi ,
Don't get me wrong here .
I am not asking anyone to do my project. I just need someone who can be my guide in the project so that I can learn and complete my project.
Considering that there is virtually no help for brook+ apart from this forum I see no other way to finish my project within the deadline .
I need some one to guide me with my project.
This might sound a little wierd but trust me I have put in a lot of effort trying to go through the examples(basically i am a slow learner).
I don't know how to put this in a better way .
Thanks
If you are having trouble with Stream SDK document. I guess these links can help you.
http://www.nada.kth.se/~tomaso/Stream2008/M1.pdf
Change M1.pdf to M2.pdf, M3.pdf etc. to get all the modules.
Hi niravshah00, I think it would be good to explain a bit more about your project. The kind of work you are doing, the mathematical algorithms, the current CPU execution times and your expected results... any information can help.
Note that it may be possible that you could be having difficulties not because of your skills, but because Brook+ programming model could not be the better approach. Stream programming has some restrictions, so depending on the algorithm it may not be easy or the final performance may not very good. That's why I ask you about the algorithms you're implementing.
A good start is to implement an OpenMP version of the program under the assumption that you are going to have thousands of threads. If you can't dispatch work to every one of them you'll have to rethink the parallelization strategy, on the other hand if you use a lot of interthread communication or atomic/critical sections you may need a more flexible language like OpenCL/CUDA.
A bit about my project .
I have to solve the equation A^x + B^y = C^z.
So i am sloving for 'z' like z = log(A^x + B^y)/log(C) .
Now the value ranges for A,B,C would be flexible and might run for 1000  10,000 or even more and x and y for smaller range like 3  10.
So my current approach was to start a kernel with a 3D stream so as to get all possbile combination for A,B,C and then each thread running the for loops for x and y.
My idea was to create as many threads as possible so as to take advantage of the processing power of GPU.May be this approach is wrong.
But currently i am stuck as the strean size is limited to 8192*8192.
Secondly each thread will return the corressponding values of A,B,C,x,y,z only is z is with some specified range. So I am not sure how do i do that from kernel since i dont want to have a outputstream having 6 numbers and then read the stream on host code and then filter the whole stream . Possible only very few like 34 thread would find a solution. I dont want to filter the stream of 10,000 or 20,000 .
Few suggestion i got from the forum is to use domainSize but its doesn work for 3D stream.
I am not sure if by creating huge number of threads would improve my performance.
Another option(also from forum) is I might have to work in Tiles like solve for A,B,C form 1000 2000, then 2000 3000 and so on .
I have the code in Java and also in brook+(currently working ) if u need i will attach the code.
Thanks
Hi niravshah00, I'm not sure to understand well what you want to do.
The arithmetic intensity of that equation is quite small, however looks like you want to obtain all the solutions having 5 unrestricted variables, so the number of combinations is huge and the problem is no longer trivial.
If you were using less variables or you can lock the value of some variables to a certain number, I think the best approach would be to use numerical analysis instead of using brute force.
From your equations, I think you are trying to obtain a program to solve Diophantine equations, however in other post you said that z was not an integer but a floating point value and that confused me a bit.
Returning whether there is a solution for a certain (A, B, C) is easy, however returning a list of solutions does not seem that flexible/efficient in Brook+. If you only need a boolean value to know whether there is a solution, you could assign not one but a range of computations to each GPU thread in the output stream, thus eliminating the size limit restriction. Each thread may even return the first solution it found, if any.
Hi ,
The equation i am solving is beals conjecture.
About Z being float value , I am sloving the equation for Z and then only if Z in within the range of 10^12 (which is as close to interger ) i would return that combination as possible solution.
Secondly i did not understand ur suggestion for returning the result i.e 6 variables.
Also any suggestion for the approach i should use .
I am attaching the code with this post
//Kernel Code #include<stdio.h> kernel int findGcd(int u,int v) { int gcd = 1; int r ; int num1=u; int num2 =v; while (1) { if (num2 == 0) { gcd = num1; break; } else { r = num1 % num2; num1 = num2; num2 = r; } } return gcd; } kernel int isWithinRange(float z, float epsillon) { float fractional = frac(z); //float floor = Math.floor((double)z); if(fractional<=epsillon ) return 1; else return 0; } kernel void threadABC(int startRange,out int a<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; A = instance().x+startRange; B = instance().y+startRange; C = instance().z+startRange; gcdAB = findGcd(A,B); gcdAC = findGcd(A,C); gcdBC = findGcd(B,C); if(gcdAB==1 && gcdAC==1 && gcdBC==1){ //threadXY(instance().x+1000,instance().y+1000,instance()+1000.z,a); for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { // will have to use modulo since the values might go out of range float sum = pow((float )A, (float )X)+pow((float )B, (float )Y); float Z = (log((float )sum)/log((float)C)); float epsillon = 10E4f; if(isWithinRange(Z,epsillon)){ // here the possible solution should be stored and returned to host code } } } } } //host code #include "brookgenfiles\beals.h" #include "conio.h" int main(int argc, char ** argv) { // just testing the if code works for A,B,C for 100 unsigned int dim[] = {100,100,100}; brook::Stream<int> aStream(2,di threadABC(1000,aStream); getch(); return 0; }
Ok, now I understand a bit more about the problem, looks like Beal's conjecture is a well documented problem and there are several examples on the net.
I hadn't much time to look at your code, but I have a question: how are you going to deal with arbitrary precission arithmetic? The maximum representable float without losing precision is 16777216, and some Brook+ intrinsic functions don't work with double data type. Did you have any solution in mind?
Well for the precision problem i was thinking of using the Mod N where N would be the Largest prime number for float ,
Like (A^X mod N + B^Y mod N )Mod N = C^Z Mod N. (not sure if i get this correct)
I hope this is correct way to do since i have to use log which wont work on float values only.
I have got suggestions of moving to OpenCL ,do you think moving to OpenCL would be helpful tomy project .I have to take a call soon.
Well, I think either OpenCL or CUDA will be more flexible, however looks like the problem is still unproved and there could be many difficult parts in the implementation. Maybe some other more specialized people can give you a more accurate opinion.
well Cuda implementation is already some one elses project (by th way Does CUDA support AMD Firestream cards).
What abt the solution for precision i suggested is it ok?
Also any suggestion for returning the result?
I really appreciate you guys helping me out here.
I know I can me annoying at times asking for small silly things.
I am still ready to pay and my email address is niravshah99@gmail.com if it inconvenient to contact me on the form openly .
i really need to finsih this
I think so!
Originally posted by: niravshah00 Hi ,
The equation i am solving is beals conjecture.
About Z being float value , I am sloving the equation for Z and then only if Z in within the range of 10^12 (which is as close to interger ) i would return that combination as possible solution.
Secondly i did not understand ur suggestion for returning the result i.e 6 variables.
Also any suggestion for the approach i should use .
I am attaching the code with this post
Huafeihua116 i did not understand what you mean by "i think so"
Sorry but am i mising anything.I need some suggestion so that i can work on my way for this.
Help anyone?
Originally posted by: niravshah00 Help anyone?
I see some problem in your code. In host code, aStream has a rank 2 but you are using instance().z in your kernel.
Please do following to complete your sample
1. Implemet reference C code
2. Write corresponding kernel code
3. Run your application on CPU by setting evironment variable BRT_RUNTIME=cpu and compare results with reference C implementation
4. If you are getting expected result, now try to run on GPU by setting environment variable BRT_RUNTIME=gpu
5. If you are facing any problem, past complete code here. Make sure code contains reference c code, kernel code and host code.
1. I have already implemented reference code in Java
2. Have implemented the kernel using the same reference (slightly modified for Brook+)
3. Currently running on cpu only dont have the hardware to run it.(we get the hardware with ready by monday machine with dual quadcore procesesor and 2 AMD firestream 9170 )
4.Cannot check the result as i need to figure out a way to return the result to my host code (need help in that)
5. Attaching the reference java code as well as implemeted kernel code )
See the comment for the areas where i need help.
********************** Java Reference code********************** import java.io.File; import java.io.PrintStream; public class FindPossibleCounterExamples { /** * @param args */ public static void main(String[] args) { FindPossibleCounterExamples finder = new FindPossibleCounterExamples(); try { pOStream = new PrintStream(file); } catch(Exception e) { System.out.println(e.getMessage()); } finder.findSuitableC(); } private float BASE_MIN = 1000; private float BASE_MAX = 1006; private int POW_MAX = 10; private int POW_MIN = 3; private static final File file = new File("BealsPossibleCounterExamples.txt"); private static PrintStream pOStream=null; private void findSuitableC() { for(float iA=BASE_MIN; iA<BASE_MAX; iA++) { for(float iB=BASE_MIN; iB<BASE_MAX; iB++) { if(iB>iA && gcd(iA,iB)>1.0) continue; for(float iC =BASE_MIN; iC < BASE_MAX; iC++) { // Beal says if A^X+B^Y = C^Z then A,B,C have a common prime factor. // if the gcd is one for each, it means they dont have a common prime factor. if(gcd(iB,iC)==1 && gcd(iC,iA)==1 && gcd(iA,iB)==1) { // for all C's that dont have a common factor with A and B, // run through values of X,Y and find a value for Z. findZ(iA,iB,iC); } } } } pOStream.flush(); pOStream.close(); //oStream.close(); } private void findZ(float A, float B, float C) { for(int X = POW_MIN; X<POW_MAX; X++) { for(int Y = POW_MIN; Y<POW_MAX; Y++) { double sum = Math.pow((double)A, (double)X)+Math.pow((double)B, (double)Y); double Z = (Math.log((double)sum)/Math.log((double)C)); double epsillon = 10E4f; if(isWithinRange(Z,epsillon)) { String toPrint = ""+A+"^"+X+" + "+B+"^"+Y+" = "+C+"^"+Z+"\n"; pOStream.append(toPrint); System.out.print(toPrint); } } } } private double gcd(double x, double y) { if (y==0) return x; return gcd(y,x%y); } private boolean isWithinRange(double z, double epsillon) { double ceil = Math.ceil((double)z); //float floor = Math.floor((double)z); if((ceil  z)<=epsillon ) return true; else return false; } } ************************************************************************************ ****************************Kernel Code *************************************** #include<stdio.h> /* Equation to solve is A^X + B^Y = C^Z */ kernel int findGcd(int u,int v) { int gcd = 1; int r ; int num1=u; int num2 =v; while (1) { if (num2 == 0) { gcd = num1; break; } else { r = num1 % num2; num1 = num2; num2 = r; } } return gcd; } //Function to keep the values in the range of float kernel float modulusPower(float number,int exponent) { //biggest prime number less than 2^12 //N is taken as less than 2^12 even though float can store 2^24 as max integer //this is because if N is greater it gives wrong result for the mod (don't know why) float N = 3079.0f; float base = number; int counter = exponent; float result =1.0f; while (counter > 0) { if (counter & 1) { result = fmod((result * base),N); } counter = counter >>1; base = fmod((base * base) ,N); } return result; } kernel void threadABC(int startRange,out int a<>) { int X,Y,Z; int A,B,C; int gcdAB,gcdAC,gcdBC; float N = 3079.0f; //using the index of the output stream as the values for A,B,C A = instance().x+startRange; B = instance().y+startRange; C = instance().z+startRange; gcdAB = findGcd(A,B); gcdAC = findGcd(A,C); gcdBC = findGcd(B,C); if(gcdAB==1 && gcdAC==1 && gcdBC==1){ for( X = 3; X < 10; X++) { for( Y = 3; Y < 10; Y++) { for( Z = 3; Z < 10; Z++) { float sum = modulusPower((float)A,X)+modulusPower((float)B, Y); float cpowerZ = modulusPower((float)C,Z); sum = fmod(sum,N); if(cpowerZ == sum){ // here the possible solution should be stored and returned to host code //have to figure out the way to return the values of A,B,C,X,Y,Z to host } } } } } } **************************************************************************************** ********************************* host code ********************************************** #include "brookgenfiles\beals.h" #include "conio.h" int main(int argc, char ** argv) { int i,j,k; //currently working on small numbers unsigned int dim[] = {10,10,10}; brook::Stream<int> aStream(3,dim);//rank changed as you pointed out. //need to figure out a solution for the restriction on stream size 8192x8192. //The stream size corresspond to the values of A,B,C in the equation. //They would be a very large range.(like 1000  10000) //Might have to work in tiles as suggested by friends on the same forum. //How should I group the tiles (so as to use the gpu to optimum)?????? threadABC(1000,aStream); //display the result getch(); return 0; } ***************************************************************************
where else do i have to look for help
Come on people i need som replies some rela quick ones
Originally posted by: niravshah00 1. I have already implemented reference code in Java
2. Have implemented the kernel using the same reference (slightly modified for Brook+)
3. Currently running on cpu only dont have the hardware to run it.(we get the hardware with ready by monday machine with dual quadcore procesesor and 2 AMD firestream 9170 )
4.Cannot check the result as i need to figure out a way to return the result to my host code (need help in that)
5. Attaching the reference java code as well as implemeted kernel code ) See the comment for the areas where i need help.
Sorry for late reply.
1. Use kernel having 6 output streams to pass values of A, B, C, X, Y, Z to host. see below
kernel void threadABC(int startRange,out int a<>, out int b<>,
out int c<>, out int x<>, out int y<>, out int z<> )
2. I have few questions
a. why A, B, C calculated from instance().x, instance().y and instance().z respectively
b. what values you are going assign if(gcdAB==1 && gcdAC==1 && gcdBC==1)
c. why X, Y and Z are varied from 3 to 10 in for loops of threadABC function
d. how many Values of A, B, C, X, Y and Z from one instance of kernel
e. Please paste C code of your reference implementaion.
Originally posted by: genaganna Originally posted by: niravshah00 1. I have already implemented reference code in Java
2. Have implemented the kernel using the same reference (slightly modified for Brook+)
3. Currently running on cpu only dont have the hardware to run it.(we get the hardware with ready by monday machine with dual quadcore procesesor and 2 AMD firestream 9170 )
4.Cannot check the result as i need to figure out a way to return the result to my host code (need help in that)
5. Attaching the reference java code as well as implemeted kernel code ) See the comment for the areas where i need help.
Sorry for late reply.
1. Use kernel having 6 output streams to pass values of A, B, C, X, Y, Z to host. see below
kernel void threadABC(int startRange,out int a<>, out int b<>,
out int c<>, out int x<>, out int y<>, out int z<> )
2. I have few questions
a. why A, B, C calculated from instance().x, instance().y and instance().z respectively
b. what values you are going assign if(gcdAB==1 && gcdAC==1 && gcdBC==1)
c. why X, Y and Z are varied from 3 to 10 in for loops of threadABC function
d. how many Values of A, B, C, X, Y and Z you want to send to host from one instance of kernel. If you use streams as output, you have send only one values of A, B, C, X, Y and Z values to host.
e. Please paste C code of your reference implementaion.
Originally posted by: genaganna Originally posted by: niravshah00 1. I have already implemented reference code in Java
2. Have implemented the kernel using the same reference (slightly modified for Brook+)
3. Currently running on cpu only dont have the hardware to run it.(we get the hardware with ready by monday machine with dual quadcore procesesor and 2 AMD firestream 9170 )
4.Cannot check the result as i need to figure out a way to return the result to my host code (need help in that)
5. Attaching the reference java code as well as implemeted kernel code ) See the comment for the areas where i need help.
Sorry for late reply.
1. Use kernel having 6 output streams to pass values of A, B, C, X, Y, Z to host. see below
kernel void threadABC(int startRange,out int a<>, out int b<>,
out int c<>, out int x<>, out int y<>, out int z<> )
REPLY 
My idea was if i could write the 3 integer in the same output stream then from the index of 3d array i can find the values of A,B,C and the int3 values would be for the values of x,y,z
but it does not allow me to write in the output streami tried something like this a[AstartRange][BstartRange][CstartRange] = z (considering i need only z values) later i can change the stream to be of type int3 to write values of X,Y,Z in it if this works. this gave me error.

2. I have few questions
a. why A, B, C calculated from instance().x, instance().y and instance().z respectively
REPLY I have used the output stream which is 3 dimension because the index of this stream will server as
the possible values of A,B,C since 3d stream will have all possible combinatination for A,B,Clike A=0,B=0,C=1
A=0,B=0,C=2
A=0,B=0,C=3 so on and i will just add the start range to this index to get the values of A,B,CEasiet way to use the index of the 3d array to serve for vlaues of A,B,C
b. what values you are going assign if(gcdAB==1 && gcdAC==1 && gcdBC==1)
REPLY 
The nested loops for X,Y,Z gets executed ony if A,B,C are all co primes
c. why X, Y and Z are varied from 3 to 10 in for loops of threadABC function
REPLY just for trial purpose later would pass the range for these to the kernel as a parameter , the range for A,B,C and X,Y,Z are different X,Y,Z will have smaller range.
d. how many Values of A, B, C, X, Y and Z you want to send to host from one instance of kernel. If you use streams as output, you have send only one values of A, B, C, X, Y and Z values to host.
only those values of which has a possible solution which id determined from the condition
if(cpowerZ == sum)
e. Please paste C code of your reference implementaion.
I have attached the java serial code in the previous reply already. I have java code as my reference.
The comments in the code i have attached are the doubts,question , concern that I have . Hope you have read those and not considered then as program comments .
I want to create more thread so as to take full advantage of the processing power of the device.
Please let me know if u need to understand the project from the start . I am ready to do that.
1. Use kernel having 6 output streams to pass values of A, B, C, X, Y, Z to host. see below
kernel void threadABC(int startRange,out int a<>, out int b<>,
out int c<>, out int x<>, out int y<>, out int z<> )
REPLY  My idea was if i could write the 3 integer in the same output stream then from the index of 3d array i can find the values of A,B,C and the int3 values would be for the values of x,y,z but it does not allow me to write in the output stream
i tried something like this a[AstartRange][BstartRange][CstartRange] = z (considering i need only z values) later i can change the stream to be of type int3 to write values of X,Y,Z in it if this works. this gave me error.

I am sure it should be possible.
If a is stream then you should write as follows
a = z (considering you need only Z values)
if a is int3 stream then you shoud write as follows
a.x = x;
a.y = y;
a.z = z;
how many Values of A, B, C, X, Y and Z you want to send to host from one instance of kernel. If you use streams as output, you have send only one values of A, B, C, X, Y and Z values to host.
only those values of which has a possible solution which id determined from the condition if(cpowerZ == sum)
In this case, you need to use one more output(say int s<> to know whether if(cpowerZ == sum) is true or not.
At host you can make decision whether output values are valued based on the Value output s.
Note : You need to change your algorithm such that each instance of kernel should return only one value of A, B, C, X, Y and Z.
e. Please paste C code of your reference implementaion.
I have attached the java serial code in the previous reply already. I have java code as my reference.
It takes more time to me to read java code.
The comments in the code i have attached are the doubts,question , concern that I have . Hope you have read those and not considered then as program comments .
I want to create more thread so as to take full advantage of the processing power of the device.
Please let me know if u need to understand the project from the start . I am ready to do that.
Please give pseudo code or c implementation or reference of algorithm.
Please explain algorithm in detail.
You say a = z would work but remember my a is a 3d output stream and z is just a variable.
So by doing a = z where in the 3d array would it write the values .
Also there would be a problem when i read it in the host code since only few of the values would be there and even then i would have to scan the whole 3d array.
Right now i m using only 3d array of size 10,10,10 later i would be running the kernel with 8192,90,90 stream size (since there is a limitation on the stream size i cannot go beyond this ) I wished i could figure a solution for this as well where i can create more threads to so that i dont have to wait till the kernel finishes and send the next lot of data to be operated.
Also i will send u the c  reference code very soon in 1 days time.
**********************************************************
Well let em explain you the project .
The equation is A^X + B^Y = C^Z for X,Y ,Z > 2 then A,B,C have a common factor .
So we are seraching for counter examples which would that A,B, C are co primes and hence the initial check for gcdAB,gcdAC,gcdBC.
So we are implementing this on GPU so as to parallelize it and can run for larger range of values.
Now the way threads are created in brook+ is through streams so i used 3d stream so i could directly get the values for 3 variables A,B,C and then running 3 nested loops for X,Y,Z (only solution i could think of)
So currently stuck with stream size of 8192*8192 that the max stream size allowed by Brook+ and since i am using 3d stream that reduces it to 8192*90*90.
i got suggestion like using domainsize but that doesn work with 3d streams (not sure)
plus the problem of sending back the results .
I hope I have made my self clear.
Thank you very much i really appreciate your help.
Originally posted by: niravshah00 You say a = z would work but remember my a is a 3d output stream and z is just a variable. So by doing a = z where in the 3d array would it write the values .
This way Brook+ works. a = z is valid statement irrespctive whether a is 1D stream, 2D stream or 3D stream.
Also i will send u the c  reference code very soon in 1 days time. **********************************************************
I am waiting for this.
So we are implementing this on GPU so as to parallelize it and can run for larger range of values. Now the way threads are created in brook+ is through streams so i used 3d stream so i could directly get the values for 3 variables A,B,C and then running 3 nested loops for X,Y,Z (only solution i could think of)
One thing i am not clear here.
How many values for X, Y, Z need to be transferd to host from 3 nested loop for a given A, B and C?
Please change your algorithm such that each kernel instance calculates only one value of A, B, C, X, Y and Z.
This way Brook+ works. a = z is valid statement irrespctive whether a is 1D stream, 2D stream or 3D stream.
Reply
But then the value of z would be written at which location in the 3D array ,
Is there anyway where in I just make a new array and only pass the array that has the solution I am a bit reluctant to pass the whole 3d array because that would mean that I have to filter the whole array on the host code.
I want to pass an array with only as much size as many there are solution.what i mean is if say for 1000 values of A,B,C there are only 10 solution then i want to pass an array with size 10 which will have the vlaues of the varaibles rather returning the whole 3D array and then scan the whole 3d array to see where those 100 values are within in the array.
One thing i am not clear here.
How many values for X, Y, Z need to be transferd to host from 3 nested loop for a given A, B and C?
Reply
From the nested loop i would pass the values that satisfy my equation I would not know in advance how many values.
Please change your algorithm such that each kernel instance calculates only one value of A, B, C, X, Y and Z.
Reply
But than how do I parallelize things what the use if i write a kernel what would calculate only for one value of A, B, C, X, Y and Z.
The whole idea is to make the kernel run for mutiple vlaues of A, B, C, X, Y and Z at the same time.
I dont think i understand your point here
Here is the C reference code :
// BealsSerial.cpp : Defines the entry point for the console application.
//
#include "stdafx.h"
#include "conio.h"
int gcd(int x, int y)
{
if (y==0) return x;
return gcd(y,x%y);
}
//modular exponentiation to keep A^X in the range
int modulusPower(int base, int exponent, int modulus) {
int result = 1;
while (exponent > 0) {
if (exponent & 1) {
result = (result * base) % modulus;
}
exponent >>= 1;
base = (base * base) % modulus;
}
return result;
}
void findZ(int A, int B, int C,int N)
{
int sum,X,Y,Z,POW_MIN,POW_MAX,cpowerZ;
POW_MIN=3;
POW_MAX=10;
for( X = POW_MIN; X<POW_MAX; X++)
{
for(Y = POW_MIN; Y<POW_MAX; Y++)
{
for(Z = POW_MIN; Z<POW_MAX; Z++)
{
sum = modulusPower(A,X,N)+modulusPower(B,Y,N);
cpowerZ = modulusPower(C,Z,N);
sum = sum % N;
if(cpowerZ == sum)
{
printf("%d^%d + %d^%d = %d^%d \n",A,X,B,Y,C,Z);
}
}
}
}
}
int _tmain(int argc, _TCHAR* argv[])
{
int iA,iB,iC,BASE_MIN,BASE_MAX,N;
BASE_MIN = 1000;
BASE_MAX = 1010;
N = 4093;
//loop for A,B,C
for(iA=BASE_MIN; iA<BASE_MAX; iA++)
{
for(iB=BASE_MIN; iB<BASE_MAX; iB++)
{
for( iC =BASE_MIN; iC < BASE_MAX; iC++)
{
if(gcd(iB,iC)==1 && gcd(iC,iA)==1 && gcd(iA,iB)==1)
{
//function which loops over the values of X,Y,Z
findZ(iA,iB,iC,N);
}
}
}
}
getch();
}
***********************************************
Also i implemented the solution of returning values from kernel to host by doing
a = Z;
Now the problem here is that it will overwrite the values of since for a particular value of A,B,C there might be many different combination of X,Y,Z
So they all will be written to the same location.
Examle
1001^5 + 1007^4 = 1009^9 (suppose)
1001^4 + 1007^3 = 1009^8 (suppose again)
Now if we use a =Z this both solution will be written to exact same location in a since the values of A,B,C are the same.
Also i would want a more efficient solution
Here is my current implementation
Kernel Code :
#include<stdio.h>
/*
Equation to solve is A^X + B^Y = C^Z
*/
kernel int findGcd(int u,int v)
{
int gcd = 1;
int r ;
int num1=u;
int num2 =v;
while (1)
{
if (num2 == 0)
{
gcd = num1;
break;
}
else
{
r = num1 % num2;
num1 = num2;
num2 = r;
}
}
return gcd;
}
//Function to keep the values in the range of float
kernel float modulusPower(float number,int exponent)
{
//biggest prime number less than 2^12
//N is taken as less than 2^12 even though float can store 2^24 as max integer
//this is because if N is greater it gives wrong result for the mod (don't know why)
float N = 4093.0f;
float base = number;
int counter = exponent;
float result =1.0f;
while (counter > 0) {
if (counter & 1) {
result = fmod((result * base),N);
}
counter = counter >>1;
base = fmod((base * base) ,N);
}
return result;
}
kernel void threadABC(int startRange,out int a<>
{
int X,Y,Z;
int A,B,C;
int gcdAB,gcdAC,gcdBC;
float N = 4093.0f;
//using the index of the output stream as the values for A,B,C
A = instance().x+startRange;
B = instance().y+startRange;
C = instance().z+startRange;
// intialising a to 0 so that when we filter the reuslts we can know that 0 means that
//location does not have a reuslt.
a=0;
gcdAB = findGcd(A,B);
gcdAC = findGcd(A,C);
gcdBC = findGcd(B,C);
if(gcdAB==1 && gcdAC==1 && gcdBC==1){
for( X = 3; X < 10; X++)
{
for( Y = 3; Y < 10; Y++)
{
for( Z = 3; Z < 10; Z++)
{
float sum = modulusPower((float)A,X)+modulusPower((float)B, Y);
float cpowerZ = modulusPower((float)C,Z);
sum = fmod(sum,N);
if(cpowerZ == sum){
// here the possible solution should be stored and returned to host code
//have to figure out the way to return the values of A,B,C,X,Y,Z to host
a= Z;
}
}
}
}
}
}
**********************************************************
Host Code:
#include "brookgenfiles\beals.h"
#include "conio.h"
#include "brook\stream.h"
using namespace brook;
int main(int argc, char ** argv)
{
int i,j,k;
int solution[10][10][10];
unsigned int dim[] = {10,10,10};
Stream<int> aStream(3,dim);//rank changed as you pointed out.
//need to figure out a solution for the restriction on stream size 8192x8192.
//The stream size corresspond to the values of A,B,C in the equation.
//They would be a very large range.(like 1000  10000)
//Might have to work in tiles as suggested by friends on the same forum.
//How should I group the tiles (so as to use the gpu to optimum)??????
threadABC(1000,aStream);
//display the result
streamWrite(aStream,solution);
for(i=0;i<10;i++)
for(j=0;j<10;j++)
for(k=0;k<10;k++)
{
//check for non zero values since the stream is intialized to zero.
if(solution
printf("a =%d,b =%d,c =%d,z =%d\n" ,i+1000,j+1000,k+1000,solution
}
getch();
return 0;
}
I am expecting some help here. Specially in returning the results
Thanks.
Niravshah00,
Now i understand your algorithm. Your algorithm is so complex to implement in Brook+.
Please remember following things to port your algorithm.
1. You can write atmost 32 output values per kernel if output type is stream.(there is one exception: actually kernel can have more than 8 outputs, Presently you assume kernel can have atmost 8 output of any type(float, float2, float3 or float4))
ex: kernel void simple(float4 a, out float4 o0<>, out float4 o1<>, out float4 o2<>, out float4 o3<>, out float4 o4<>, out float4 o5<>, out float4 o6<>, out float4 o0<>
{
o0 = a;
o1 = a;
o2 = a;
o3 = a;
o4 = a;
o5 = a;
o6 = a;
o7 = a;
}
// in the about kernel you are sending 32 floats(8 outputs * 4 floats) to host.
I will implement this algorithm as follows
Steps
1. Find iA, iB and iC from BASE_MIN and BASE_MAX using one kernel
2. Gather vaild values iA, iB and iC on host to pass to next stage
3. Find A, B, C, X, Y and Z from iA, iB, iC, N, POW_MIN and POW_MAX using another kernel
1. Find iA, iB and iC from BASE_MIN and BASE_MAX using one kernel
// status will tell us whether iA, iB and iC have vaild values or invalid values
kernel void calculateiABC(..., out int iA<>, out int iB<>, out int iC<>, out int status<>
{
//Here get ia, ib, ic from BASE_MAX and BASE_MIN
if(gcd(ib, ic) == 1 && gcd(ic, ia) == 1 && gcd(ia, ib) == 1)
{
iA = ia;
iB = ib;
iC = ic;
status = 1;
}
else
status = 0;
}
Hint : if you understand the relations ship b/w size of outputs(iA, iB and iC) and input constants(BASE_MIN and BASE_MAX), it is easy to implement this kernel(calculateiABC)
2. Gather vaild values iA, iB and iC on host to pass to next stage
I hope you understand this step
3. Find A, B, C, X, Y and Z from iA, iB, iC, N, POW_MIN and POW_MAX using another kernel
// status will tell us whether A, B, C, X, Y and Z have valid values or not
kernel void calculateABCXYZ(..., out int A<>, out int B<>, out int C<>, out int X<>, out int Y<>, out int Z<>, out int status<>
{
//Here Get a, b, c, x, y, z and N
sum = modulusPower(a,x,N)+modulusPower(b,y,N);
cpowerZ = modulusPower(c,z,N);
sum = sum % N;
if(cpowerZ == sum)
{
A = a;
B = b;
C = c;
X = x;
Y = y;
Z = z;
status = 1;
}
else
status = 0;
}
Hint : if you understand the relationship b/w size of outputs(A, B, C, X, Y and Z) and input sizes(iA, iB, iC) + input constant( N, POW_MIN and POW_MAX), it is easy to implement this kernel(calculateABCXYZ)
Final note : Above solution is not a optimal solution.
Well ,
I dont think i understand step 3 and also they way u implemented it how do i get all the combination of A,B,C from the first step.
Infact i m not sure what do u mean by first kernel and second kernel.
I still dont understand how does your way of implementing this algorithm would parallelize it.
Look My idea was to create a 3d stream to give be all possible combinatin on the 3 variables the same effect of 3 nested for loops. I dont see that happening in your implementation.
How did u return the result , here the result is a pair A,B,C,X,Y,Z.
I dont see any threads being created.
Would it be advisible to switch to OpenCL considering i have to be done by next month as 20th June ?
Originally posted by: niravshah00 Well ,
I dont think i understand step 3 and also they way u implemented it how do i get all the combination of A,B,C from the first step.
Infact i m not sure what do u mean by first kernel and second kernel.
I still dont understand how does your way of implementing this algorithm would parallelize it.
Look My idea was to create a 3d stream to give be all possible combinatin on the 3 variables the same effect of 3 nested for loops. I dont see that happening in your implementation.
How did u return the result , here the result is a pair A,B,C,X,Y,Z.
I dont see any threads being created.
Would it be advisible to switch to OpenCL considering i have to be done by next month as 20th June ?
You can go for OpenCL. Before going to OpenCL first see your device is listed in OpenCL supported devices.
But to be on the safer side i would need to make brook+ working.
I just need a way with my current implementation to return results
something like making a array with only the reuslt and where each entry in array would be just a set of 6 integers.
If i get that to work i think i am safe and bargain that i have a working code if not the most optimum one.
Secondly would you think with OpenCL it would be a lot better.I am taking it for granted that u will help with OpenCL as well.
Well even for that i would have to understand they way u suggested implenting this algorithm.
Just tell me how can i return the result from my current kernel
Please friends i am running out of time guys
