cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

dukeleto
Adept I

gather kernel with more than one output stream

is it possible?

Hi,
I'm solving a system of 4 coupled partial differential equations with a finite difference style method. In single precision, I can do everything in a single kernel with single4 inputs and outputs. In double precision, I currently have split my kernel into 2, and use double2 input and output for each. I would like to be able to use a single kernel, but that would require two output streams. Is this possible?
If it is possible, will the indexof() function have the same position for each stream, or are they independent?
Thanks!
0 Likes
10 Replies

It is possible to use multiple output streams, but only out scatter stream. Optimized_nlm_denoise in the samples folder uses 4 output streams.
0 Likes
dukeleto
Adept I

Thanks Micah. I am using the linux version of the sdk; I have no optimized_nlm_denoise sample. Are you referring to the NLM_Denoise one? In which case, I cannot see more than
one output stream!
THanks
0 Likes

dukeleto,

I have the Windows version of the sdk and I don't have the optimized_nlm_denoise sample either. Micah as referred to this sample concerning multiple output streams but I don't have the sample, it didn't come with the SDK.

Micah, maybe you are using a newer version than we are?\

Also, I am very curious regarding the positions of streams of different sizes in a kernel. For example, if I have and "index" stream of size 256 and an out stream of size 512, how does that work when accessing the indexof(index) and accessing the indexof(out)?
0 Likes

0 Likes

Ryta,
You indexof variable should be used on your output stream as that determines your domain of execution. Pretty much for every element in your output stream, then your kernel is executed once.
0 Likes

Originally posted by: MicahVillmow

Ryta,

You indexof variable should be used on your output stream as that determines your domain of execution. Pretty much for every element in your output stream, then your kernel is executed once.


I'm not sure what you mean. The kernel executes once per output stream? That doesn't sound right, so that must not be what you are saying.

For example:

kernel void foo(float4 in1[120], float4 in2[120], out float4 out1[240], out float4 out2[360])
{
out1[indexof(out1)] = in1[indexof(out1)];
out2[indexof(out2)] = in2[indexof(out1)];
}

How does this work? When operating on the kernel does the out1 stop when it hits it's end point but the kernel execution for the assignment of out2 continues becuase it still has an index?

This is fairly confusing to me, I must admit.

Also, what about using indexof(index) for some functions?

For example:

kernel void foo(float index<>, float4 in[255], out float4[255*8]
{
int idx = indexof(index);
out[idx] = in[idx];
}

I think I have seen samples do this, where they use indexof() on the input streams. Is that not ok?
0 Likes

Ryta,
When dealing with the streaming model, your execution domain is determined by your output stream size, one output data point equals one execution of the kernel. I have not tested multiple output streams with various sizes, but if they are all the same size it works. Also, there is only one allowable scatter stream, so your example is not possible. Ignoring for a second the scatter example, in a normal output stream, the location is implicit. The only way to know your current location inside the execution domain is to use the indexof operator on the output stream. It is possible to use the indexof operator on an input stream, but that does not give you the location in the execution domain, but the location in the input stream that is being mapped to that output position implicitly.

For the case of scatter, the indexof used on the scatter stream gives you the location in the execution domain based on either the stream size or the number of the threads based on the execDomain function.

For your example:
indexof(index) will return values that are dependent on the length of the input stream in relation to either the execDomain or the output scatter stream size.
For the case where index is of length 255 and out is 255*8, each element of index is duplicated 8 times and therefor indexof(index) should return the same value for 8 consecutive threads. This is obviously not what was intended.
0 Likes

So in Brook+, how might you go about getting a 2D (flattened to 1D) array and store it into a subset of a 3D array (flattened to 1D)? Is this even possible?

The code that I sent you had examples of where I was attempting to use this, for example:

in is an array of size 128*128;
out is an array of size 128*128*9;

kernel void foo(float4 in<>, out float4 out[])
{
out[indexof(in)+128*128*0] = in*2;
out[indexof(in)+128*128*1] = in*3;
out[indexof(in)+128*128*2] = in*4;
....
....
}

I believe this works but from what you are saying it doesn't sound like it should. Is that right?
0 Likes

Well there is the formula to map from 3D indices to 1D indices.
(z * (width * height)) + (y * width) + x
Or to go in the other direction:
z = idx / (width * height)
y = idx % (width * height) / (width)
x = idx % width

So the equivalent would be to create a 9 * height * width stream and just place each element from each dimension in sequential locations

i.e.
(x, y, z)
0, 0, 0<-- location 0
0, 0, 1<-- location 1
0, 0, 2<-- location 2
0, 0, 3<-- location 3
0, 0, 4<-- location 4
0, 0, 5<-- location 5
0, 0, 6<-- location 6
0, 0, 7<-- location 7
0, 0, 8<-- location 8
1, 0, 0<-- location 9
etc...

So the way to address this would be to do the 2D to 1D transform of y * width + x to get the index of a single dimension, then multiply by 9 to take into account the 9 dimensions. After this, to access a different dimension you would just add 0-8 to your index.

Now, for your execution, unless you use execDomain and specify it to only run 16K threads, it will run 144K threads(128 * 128 * 9).

execDomain allows a disconnection between the execution domain and the output domain.


Hope this answers your questions.
0 Likes
dukeleto
Adept I

Thanks for the responses, Micah, and thanks for the really interesting questions, Ryta!
Regarding the multiple output kernel declaration which you indicated Micah, that certainly looks like
something I would like to do. However I'm still not quite sure of understanding, from your answers,
whether I would have to declare a separate index for each output, or whether an index to the first output
would also correctly work for the other outputs? Sorry if you've already answered, but in that case I didn't
understand the answer!
Thanks alot
0 Likes