Hi all,
I have used an array inside a kernel, in order to perform iterative traversal of a tree. But, the program executes correctly with BRT_RUNTIME=cpu, but when executing with CAL, the results are erroneus. I declare it like this (yeah that's obvious):
float3 stackArray[64];
I would greatly appreciate any help or recommendations. I may put more detail if necessary.
Thanks for the quick reply, ryta.
I need some way to emulate a stack. Even if this sounds like 50 float3-s and a massive bunch of IF-s. I see you have more in-depth knowledge of this Brook+ language. Is there a way to have something like this happen:
float3 elem1;float3 elem2; float3 elem3;...; float3 elem50;
float stackTop = 0.f;
float adressOfFirstElem = getAddressOfElem1();
...
float3 currStackElem = adressOfFirstElem + stackTop;
---------
I want to believe there is a way of getting to the required element, even if it's a hacky way.
Also, stack is quite an important structure
I believe that if AMD have thought about it, they are welcome to share their ideas or info of this being realized in the future.
Ryta, if you have some sort of idea for emualting a stack, even with a limited depth, pls share
And, since the compiler allows me to write such structures, what's really happening with that array and all the assignments?
You can implement a stack using streams. It's kind of perverse, but shouldn't be as bad as the constant thing you suggested. Basically, what you do is make a mask stream for determining whether or not the push/pop actually occurs in that kernel, a current position stream, a next position stream, and the data stream.
All the streams except the data stream are of the same dimensionality as the rest of the domain, except for the data stream which is of dimension n+1. Your functions should look something like:
kernel void pop(int mask<>, float3 data[], float curIndex<>, out float nextIndex<>, out float3 returnedData)
{
float2 index;
index.x = indexof(returnedData).x;
index.y = curIndex;
if(mask == 1)
{
if(curIndex < 1.0f)
{
nextIndex = curIndex - 1.0f;
returnedData = data[index];
}
}
else
{
nextIndex = curIndex;
returnedData = NaN;
}
}
kernel void push(int mask<>, float3 toStore<>, float curIndex<>, out float nextIndex<>, out float3 updatedData[])
{
float2 index;
index.x = indexof(returnedData).x;
index.y = curIndex;
if(mask == 1)
{
nextIndex = curIndex + 1.0f;
updatedData[index] = toStore;
}
else
{
nextIndex = curIndex;
}
}
Unfortunately, this solution requires that the stacks be manipulated from your host code by making top level kernel calls, since you're using streams. Arrays seem tricky to efficiently implement in Brook+ because they can be interpreted several ways. For instance, you may want to unroll a some operation or reduce the amount of typing you have to do. In this case, it would be more appropriate to map the array onto r# in CAL, so there's no runtime indexing. On the other hand, a stack would truly need an array, which would be implemented using the x#[] buffers in CAL.
Ok, I need to take a look more closely to IL and CAL in order to get to what you mean. Thank you very much for the example, I also need to think a while to see if I can organize the traversal in separate kernels using the stack implementation you suggested.
This is a new field for me, so i may need to get the translated IL from the StreamKernel Analyzer and add the stack support by hand.
Is this a common way to proceed, while adding some missing Brook+ functionality?
And, thanks again for all you shared
I go to see how these x#[] scratch registers are used.
Rick, if you can show some startup, I would aprreciate it
@ ryta: even if there were arrays as elements of the struct, how can you read and write them in the same kernel, while using ONE kernel only, so the load balancing of the GPU works better?
Originally posted by: lust
@ ryta: even if there were arrays as elements of the struct, how can you read and write them in the same kernel, while using ONE kernel only, so the load balancing of the GPU works better?
I need a stack for each thread. ray-tracing is the topic. That's why I wondered why the Brook+ compiler did not complain about me having declared local arrays :^)
Is there a way to mix Brook+ kernels with CAL kernels, i.e. sort of inline assembler in C or calling a kernel written in IL inside .br file? That would be cool
Nevertheless, I will try some sort of recursion, but I am not sure what the result will be.
My tests proved that a single kernel is much better than multiple, although with split kernel functionality all rays are in the same state through the tree. Originally, I supposed that memory read coherence and better locality would play a role, but things get worse.
Since there is no way to mask inactive stream elements with some hardware mask, and since there is no stack, I believe GPU's are not that good at processing stream elements with diverse states. Or I am still too much a serial programmer.
Most of the papers on that topic suck IMHO. All reported data and speed is somewhat mystified. How can a CUDA raytracer be 5 times faster than mine? Well, until I find a real demo that runs on my PC with a reasonably complex scene (500+k triangles), I do not believe any results (except for bunny scene )
I have read a number of papers, but their results seem unlikely.
Ok, I have done this:
I manually insert CAL code in the file that Brook has generated ("*_gpu.h"). The results are incorrect for some reason. I use the x#[] ergisters now.
This does not appear to work correctly. As a remainder, my GPU is FireGL V7700.
"dcl_indexed_temp_array x0[64]\n"
"mov r276.xyz_,x0[r278.x].xyz0\n" // get the element from the scratch register
"mov x0[r284.x].xyz_,r285.xyz0\n" // write the passed parameter to the scratch register
I currently cannot publish the whole kernel due to company reasons, but I will appreciate some comments on the usage of x# registers.
r285 holds the float3 that is to be pushed, while r276 is the result variable's register. The traversal appears to be working, but the screen becomes a mess of white point noise and black, and balck is due to the "shading".
So, I may have phibbed a little because I just realized that updatedData[index] = toStore is not normally possible. You might still be able to do pushing as a scatter operation, but I'm not sure. The x#[] registers are 128 bit addressed by integers. You can read about the instructions available and how to write IL kernels in the the IL Language Specification. The Stream Programming guide tells you how to use CAL to interface with an IL kernel and there are several examples in the instal directory. If you do use the x#[] registers you can have each stack be local to each kernal instance.