cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

roger512
Adept II

multi recursive function

Hello,

I know recursivity in opencl is a thorny question since we dont have a call stack, still it is possible to emulate it with a private stack.

I already know how to implement a recursive algorithm that would look like :

rec_func(){

     rec_func();

}

But concerning a recursive algorithm like that, i dont know how to do it.

rec_func(){

     rec_func();     //call 1

     //some random code

     rec_func(); //call 2

}

the problem is after a call of rec_func(),rec_func() is called again (recursive call 1) and it is computed, then the program need to get back to the instructions after call 1 in order to call rec_func() properly (call 2).

I dont know if it is achievable to emulate that kind of behaviour in opencl. Maybe a goto could be usefull here, to get back at the right position in the code after the stack is poped, but there is no goto unfortunatly.

Maybe i'm not posing the problem in a correct way. If anyone got an idea about this ?

0 Likes
1 Solution

Hi,

In OpenCL (and also in the AMD_IL) there is no such thing as goto/call/return. There are only loops because VLIW hardware doesn't supports any kind of goto. It supports only if blocks and loop blocks.

You made it well in that while loop you posted, emulating recursive stuff, but I'm affraid that's all that you can do.

However on the new GCN there are several ways to alter the instruction pointer:

S_BRANCH, S_CBRANCH_xxx – conditional/unconditional branches

S_SETPC, S_SWAPPC, S_GETPC : directly manipulate IP

And some more, you can check in the Southern Islands ISA manual. Let's hope these will be in the reach of OpenCL in the future.

But this jumping is done with the scalar ALU, so you have 64 workitems in a wavefront, and only 1 'tree branch' needs to recurse, then all remaining 63 workitems will be idle. The same way as with while loop.

I think the best you can do is that you maintain an internal work queue in the LDS after N stages of recursion, and then process that queue with 100% workitem utilization. N depends on the actual job: for example if statistically the first recursion is called only for 2% of the workitems, then maybe that is more effective if you start collecting the work of the second recursion into a queue and process them later. This will use more LDS bandwidth (depending on how much data needs to be stored per queued task) but let the ALUs utilized more.

View solution in original post

0 Likes
4 Replies
himanshu_gautam
Grandmaster

Can you tell us how you handled "tail recursion" first?

i.e

rec_func(){

      rec_func();

}

0 Likes

I'm sorry i can't really show you my own code but the following is what i had in mind when i wrote

rec_func(){

      rec_func();

}

this is not my own code :

              

                    int traversalStack[STACK_SIZE + 3];

               bool searchingLeaf = true;

            while (nodeAddr >= 0 && nodeAddr != EntrypointSentinel)

            {

                // Fetch AABBs of the two child nodes.

                float4 n0xy = FETCH_TEXTURE(nodesA, nodeAddr, float4);      // (c0.lo.x, c0.hi.x, c0.lo.y, c0.hi.y)

                float4 n1xy = FETCH_TEXTURE(nodesB, nodeAddr, float4);      // (c1.lo.x, c1.hi.x, c1.lo.y, c1.hi.y)

                float4 nz   = FETCH_TEXTURE(nodesC, nodeAddr, float4);      // (c0.lo.z, c0.hi.z, c1.lo.z, c1.hi.z)

                float4 cnodes=FETCH_TEXTURE(nodesD, nodeAddr, float4);

                // Intersect the ray against the child nodes.

                float c0lox = n0xy.x * aux->idirx - oodx;

                float c0hix = n0xy.y * aux->idirx - oodx;

                float c0loy = n0xy.z * aux->idiry - oody;

                float c0hiy = n0xy.w * aux->idiry - oody;

                float c0loz = nz.x   * aux->idirz - oodz;

                float c0hiz = nz.y   * aux->idirz - oodz;

                float c1loz = nz.z   * aux->idirz - oodz;

                float c1hiz = nz.w   * aux->idirz - oodz;

                float c0min = max4(fminf(c0lox, c0hix), fminf(c0loy, c0hiy), fminf(c0loz, c0hiz), aux->tmin);

                float c0max = min4(fmaxf(c0lox, c0hix), fmaxf(c0loy, c0hiy), fmaxf(c0loz, c0hiz), hitT);

                float c1lox = n1xy.x * aux->idirx - oodx;

                float c1hix = n1xy.y * aux->idirx - oodx;

                float c1loy = n1xy.z * aux->idiry - oody;

                float c1hiy = n1xy.w * aux->idiry - oody;

                float c1min = max4(fminf(c1lox, c1hix), fminf(c1loy, c1hiy), fminf(c1loz, c1hiz), aux->tmin);

                float c1max = min4(fmaxf(c1lox, c1hix), fmaxf(c1loy, c1hiy), fmaxf(c1loz, c1hiz), hitT);

                // Decide where to go next.

                // Differs from "while-while" because this just happened to produce better code here.

                bool traverseChild0 = (c0max >= c0min);

                bool traverseChild1 = (c1max >= c1min);

                nodeAddr           = __float_as_int(cnodes.x);      // stored as int

                int nodeAddrChild1 = __float_as_int(cnodes.y);      // stored as int

                if (!traverseChild1) { nodeAddrChild1 = EntrypointSentinel; }

                if (!traverseChild0) { nodeAddr = nodeAddrChild1; nodeAddrChild1 = EntrypointSentinel; }

                // Neither child was intersected => pop.

                if (nodeAddr == EntrypointSentinel)

                {

                    nodeAddr = traversalStack[stackPtr];

                    --stackPtr;

                }

                // Both children were intersected => push the farther one.

                else if (nodeAddrChild1 != EntrypointSentinel)

                {

                    if (c1min < c0min)

                        swap(nodeAddr, nodeAddrChild1);

                    ++stackPtr;

                    traversalStack[stackPtr] = nodeAddrChild1;

                }

                // First leaf => postpone and continue traversal.

                if (nodeAddr < 0 && leafAddr >= 0)

                {

                    searchingLeaf = false;

                    leafAddr = nodeAddr;

                    nodeAddr = traversalStack[stackPtr];

                    --stackPtr;

                }

                // All SIMD lanes have found a leaf => process them.

                if(!__any(searchingLeaf))

                    break;

            }

the code above is a tree traversal in case of raycasting, node are visited untill a leaf is reached

0 Likes

Hi,

In OpenCL (and also in the AMD_IL) there is no such thing as goto/call/return. There are only loops because VLIW hardware doesn't supports any kind of goto. It supports only if blocks and loop blocks.

You made it well in that while loop you posted, emulating recursive stuff, but I'm affraid that's all that you can do.

However on the new GCN there are several ways to alter the instruction pointer:

S_BRANCH, S_CBRANCH_xxx – conditional/unconditional branches

S_SETPC, S_SWAPPC, S_GETPC : directly manipulate IP

And some more, you can check in the Southern Islands ISA manual. Let's hope these will be in the reach of OpenCL in the future.

But this jumping is done with the scalar ALU, so you have 64 workitems in a wavefront, and only 1 'tree branch' needs to recurse, then all remaining 63 workitems will be idle. The same way as with while loop.

I think the best you can do is that you maintain an internal work queue in the LDS after N stages of recursion, and then process that queue with 100% workitem utilization. N depends on the actual job: for example if statistically the first recursion is called only for 2% of the workitems, then maybe that is more effective if you start collecting the work of the second recursion into a queue and process them later. This will use more LDS bandwidth (depending on how much data needs to be stored per queued task) but let the ALUs utilized more.

0 Likes

Hi,

realhet a écrit:

I think the best you can do is that you maintain an internal work queue in the LDS after N stages of recursion, and then process that queue with 100% workitem utilization.

That's the trick i' am using to bypass the recursion problem, if it does the job for the moment, it's not the same thing.For exemple I can't call another recursive function depending on the result of a previous call of the same stage.

rec_func(){

     bool more_rec = rec_func();     //call 1

     //some random code

     if(more_rec)

          rec_func(); //call 2

}

realhet a écrit:

But this jumping is done with the scalar ALU, so you have 64 workitems in a wavefront, and only 1 'tree branch' needs to recurse, then all remaining 63 workitems will be idle. The same way as with while loop.

Yes, I understood that. My algorithm is designed to avoid branch divergence as much possible.

thank you for the interesting informations about GCN never heard about it before , I'll try to check Southern Islands ISA manual !

0 Likes