cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

lust
Journeyman III

Question about arrays in kernels

Hi all,

I have used an array inside a kernel, in order to perform iterative traversal of a tree. But, the program executes correctly with BRT_RUNTIME=cpu, but when executing with CAL, the results are erroneus. I declare it like this (yeah that's obvious):

float3 stackArray[64];

I would greatly appreciate any help or recommendations. I may put more detail if necessary.

 

 





0 Likes
14 Replies
ryta1203
Journeyman III

lust,

Kernel local arrays are not currently supported by Brook+. I REALLY hope that the Brook+ team gets them into the next release, as they are quite important.
0 Likes
lust
Journeyman III

Thanks for the quick reply, ryta.

I need some way to emulate a stack. Even if this sounds like 50 float3-s and a massive bunch of IF-s. I see you have more in-depth knowledge of this Brook+ language. Is there a way to have something like this happen:

float3 elem1;float3 elem2; float3 elem3;...; float3 elem50;

float stackTop = 0.f;

float adressOfFirstElem = getAddressOfElem1();

...

float3 currStackElem = adressOfFirstElem + stackTop;

---------

I want to believe there is a way of getting to the required element, even if it's a  hacky way.

Also, stack is quite an important structure

I believe that if AMD have thought about it, they are welcome to share their ideas or info of this being realized in the future.

Ryta, if you have some sort of idea for emualting a stack, even with a limited depth, pls share

0 Likes
lust
Journeyman III

And, since the compiler allows me to write such structures, what's really happening with that array and all the assignments?

0 Likes

I'm not sure, but do you need a stack for every instance of the domain?

EDIT: I really have no idea, other than to do it the way you mention, which is really going to be a pain and will probably hurt performance.

I know that CAL supports local kernel arrays, so hopefully the Brook+ people (Viz Experts) will get this in soon.

Also, it doesn't look like you can have a struct that has members that are arrays. Also, I couldn't get the struct to work as a Stream Element UNLESS it was declared in the .br file itself... I find this a little annoying, particularly for larger projects.
0 Likes

You can implement a stack using streams. It's kind of perverse, but shouldn't be as bad as the constant thing you suggested. Basically, what you do is make a mask stream for determining whether or not the push/pop actually occurs in that kernel, a current position stream, a next position stream, and the data stream.

All the streams except the data stream are of the same dimensionality as the rest of the domain, except for the data stream which is of dimension n+1. Your functions should look something like:

kernel void pop(int mask<>, float3 data[], float curIndex<>, out float nextIndex<>, out float3 returnedData)

{
float2 index;
index.x = indexof(returnedData).x;
index.y = curIndex; 

if(mask == 1)
{
if(curIndex < 1.0f)
{
nextIndex = curIndex - 1.0f;
returnedData = data[index];

}
else
{
nextIndex = curIndex;
returnedData = NaN; 
}

}

kernel void push(int mask<>, float3 toStore<>, float curIndex<>, out float nextIndex<>, out float3 updatedData[])
{
float2 index;
index.x = indexof(returnedData).x;
index.y = curIndex; 

if(mask == 1)
{
nextIndex = curIndex + 1.0f;
updatedData[index] = toStore;
}
else
{
nextIndex = curIndex;

}

Unfortunately, this solution requires that the stacks be manipulated from your host code by making top level kernel calls, since you're using streams. Arrays seem tricky to efficiently implement in Brook+ because they can be interpreted several ways. For instance, you may want to unroll a some operation or reduce the amount of typing you have to do. In this case, it would be more appropriate to map the array onto r# in CAL, so there's no runtime indexing. On the other hand, a stack would truly need an array, which would be implemented using the x#[] buffers in CAL.

0 Likes

Ok, I need to take a look more closely to IL and CAL in order to get to what you mean.  Thank you very much for the example, I also need to think a while to see if I can organize the traversal in separate kernels using the stack implementation you suggested.

This is a new field for me, so i may need to get the translated IL from the StreamKernel Analyzer and add the stack support by hand.

Is this a common way to proceed, while adding some missing Brook+ functionality?

And, thanks again for all you shared

I go to see how these x#[] scratch registers are used.

Rick, if you can show some startup, I would aprreciate it

0 Likes
lust
Journeyman III

@ ryta: even if there were arrays as elements of the struct, how can you read and write them in the same kernel, while using ONE kernel only, so the load balancing of the GPU works better?

0 Likes

Originally posted by: lust

@ ryta: even if there were arrays as elements of the struct, how can you read and write them in the same kernel, while using ONE kernel only, so the load balancing of the GPU works better?



Yes, there's no way to pass a variable as a parameter to a kernel and have it be read/write, I was thinking locally. Ideally though, it's very hard to guess since I'm not exactly sure what you want to do. Are you trying to implement 1 stack for all instances or do you need 1 stack/instance?
0 Likes

I need a stack for each thread. ray-tracing is the topic. That's why I wondered why the Brook+ compiler did not complain about me having declared local arrays :^)

Is there a way to mix Brook+ kernels with CAL kernels, i.e. sort of inline assembler in C or calling a kernel written in IL inside .br file? That would be cool

Nevertheless, I will try some sort of recursion, but I am not sure what the result will be.

My tests proved that a single kernel is much better than multiple, although with split kernel functionality all rays are in the same state through the tree. Originally, I supposed that memory read coherence and better locality would play a role, but things get worse.

Since there is no way to mask inactive stream elements with some hardware mask, and since there is no stack, I believe GPU's are not that good at processing stream elements with diverse states. Or I am still too much a serial programmer.

0 Likes

Ray Tracing on GPU:

http://graphics.stanford.edu/papers/rtongfx/


This is a good paper, IMO, and if you are trying to do ray tracing on a stream processor it might be worth your time, since that's what they do.

Another:

http://graphics.stanford.edu/p.../gpu-kd-i3d.pdf


There is plenty of literature out there about this topic, have you done some reading on this already?

EDIT: Recursion is not supported in Brook+, as well as local arrays. Not sure if I mentioned that already or if you knew that, just saying.

0 Likes

Most of the papers on that topic suck IMHO. All reported data and speed is somewhat mystified. How can a CUDA raytracer be 5 times faster than mine? Well, until I find a real demo that runs on my PC with a reasonably complex scene (500+k triangles), I do not believe any results (except for bunny scene )

I have read a number of papers, but their results seem unlikely.

0 Likes

Ok, good luck!
0 Likes

Ok, I have done this:

I manually insert CAL code in the file that Brook has generated ("*_gpu.h"). The results are incorrect for some reason. I use the x#[] ergisters now.

This does not appear to work correctly. As a remainder, my GPU is FireGL V7700.

 

 

"dcl_indexed_temp_array x0[64]\n"



 

"mov r276.xyz_,x0[r278.x].xyz0\n" // get the element from the scratch register

 

"mov x0[r284.x].xyz_,r285.xyz0\n" // write the passed parameter to the scratch register

I currently cannot publish the whole kernel due to company reasons, but I will appreciate some comments on the usage of x# registers.

r285 holds the float3 that is to be pushed, while r276 is the result variable's register. The traversal appears to be working, but the screen becomes a mess of white point noise and black, and balck is due to the "shading".

 





0 Likes

So, I may have phibbed a little because I just realized that updatedData[index] = toStore is not normally possible. You might still be able to do pushing as a scatter operation, but I'm not sure. The x#[] registers are 128 bit addressed by integers. You can read about the instructions available and how to write IL kernels in the the IL Language Specification. The Stream Programming guide tells you how to use CAL to interface with an IL kernel and there are several examples in the instal directory. If you do use the x#[] registers you can have each stack be local to each kernal instance.

0 Likes