Hello,
I 'm trying to run the following simple kernel code:
Attribute[GroupSize(64)]
kernel void sum(int4 a<>, int4 b<>, out int4 c[]){
shared int4 lds[64];
int index = instance().x;
lds[index] = instance();
c[index] = lds[index];
}
while I 'm using the following code for the main function:
#include <iostream>
#include "brookgenfiles/kernels.h"
using namespace std;
int main(void){
int4 cpu_a[64];
int4 cpu_b[64];
int4 cpu_c[64];
unsigned int streamSize[] = {64};
unsigned int rank = 1;
brook::Stream<int4> gpu_a(rank, streamSize);
brook::Stream<int4> gpu_b(rank, streamSize);
brook::Stream<int4> gpu_c(rank, streamSize);
gpu_a.read(cpu_a);
gpu_b.read(cpu_b);
sum(gpu_a, gpu_b, gpu_c);
if(gpu_c.error()){
cout << gpu_c.errorLog();
return 1;
}
gpu_c.write(cpu_c);
for(int i=0;i<64;i++)
cout << "(" << cpu_c.x << ", " << cpu_c.y << ", " << cpu_c.z << ", " << cpu_c.w << ")\n";
cout << "\n\n";
return 0;
}
The results tha I 'm getting is the tupple (0, 0, 0, 0) for every element of stream C, which are obviously unexpected (see bold lines of the kernel code).
Am I doing something wrong?
The existing hardware exposes LDS as a memory per thread (Brook+ has exposed it as per group for future purpose). Other threads within a group can read the LDS memory assigned to other threads but they cannot write to it.
If you convert your kernel to the following, it should work-
Attribute[GroupSize(64)]
kernel void sum(int4 a<>, int4 b<>, out int4 c[])
{
shared int4 lds[64];
int index = instance().x;
lds[1 * instanceInGroup().x + 0] = instance();
c[index] = lds[index];
}
The write instruction makes sure that you are writing to its own memory and not to others.
I 'm still getting the same results!
Try
syncGroup();
after setting LDS?
Jawed
Originally posted by: Jawed Try
syncGroup();
after setting LDS?
Jawed
I think, there is no necessity to call syncGroup since each thread reads the same location that writes. Thus, there is no need for a thread to wait all other threads to execute their write instructions.
Nevertheless, I tried it but it leads to the same results.
What is your system configuration and driver version?
My system specs are the following:
GPU: HD4830
CPU: core i7 920
driver version: 8.6
catalyst version: 9.4
os: windows vista 32 bit
Thanks, for your responses!
After extended research, I can conclude that there is a bug. Specifically, it seems that brook+ is not able to handle int4 shared memory consistently.
I convert my kernel to the following:
Attribute[GroupSize(64)]
kernel void sum(float4 a<>, float4 b<>, out float4 c[]){
shared float4 lds[64];
int index = instance().x;
float4 tmp;
tmp.x = (float)instance().x;
tmp.y = (float)instance().y;
tmp.z = (float)instance().z;
tmp.w = (float)instance().w;
lds[1 * instanceInGroup().x + 0] = tmp;
c[index] = lds[index];
}
and now it works!
I confirm this bug... lds of uint4 type does not work. Please see attached pybrook code.
Output:
In [3]: execfile('lds_test.py')
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]
A pretty fundamental bug. Could this be addressed please?
Casting from uint to float is not a viable work-around when I need all the bits of the uint ... for example, for the random number generator I'm working on ... or is there another work-around I'm not seeing?
import os os.environ['BRT_ADAPTER']='1' import kernels import stream import math import numpy def fX(dims,x): b = numpy.array(dims)[::-1] b[-1]*=x return tuple(b) def f4(dims): return fX(dims,4) def f2(dims): return fX(dims,2) dims =(64,) h_out = numpy.zeros(f4(dims),dtype=numpy.uint32) h_in = numpy.ones(f4(dims),dtype=numpy.uint32)*8 d_in = stream.Stream_uint4(dims) d_out = stream.Stream_uint4(dims) d_out.read(h_out) d_in.read(h_in) k = kernels.__lds_test_uint() k.run(d_in,d_out) if d_out.error()!=stream.BRerror.BR_NO_ERROR: print "Error:", s1.errorLog() d_out.write(h_out) print h_out[0::4] k = kernels.__lds_test_float() k.run(d_in,d_out) if d_out.error()!=stream.BRerror.BR_NO_ERROR: print "Error:", s1.errorLog() d_out.write(h_out) print h_out[0::4] *************** end ***************** *********** kernels *************** Attribute[GroupSize (64)] kernel void lds_test_uint(uint4 in_s[], out uint4 out_s[]) { shared uint4 lds[64]; int local_id = instanceInGroup().x; lds[1 * instanceInGroup().x + 0] = in_s[instance().x]; syncGroup(); out_s[instance().x] = lds[1 * instanceInGroup().x + 0]; } Attribute[GroupSize (64)] kernel void lds_test_float(uint4 in_s[], out uint4 out_s[]) { shared float4 lds[64]; int local_id = instanceInGroup().x; lds[1 * instanceInGroup().x + 0] = (float4)in_s[instance().x]; syncGroup(); out_s[instance().x] = (uint4)lds[1 * instanceInGroup().x + 0]; }
Note, the CPU backend, i.e.
os.environ['BRT_RUNTIME']='cpu'
appears to be working:
In [3]: execfile('lds_test.py')
[8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]
[8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]
Looking at the IL code in the kernels_gpu.h file, one can detect and fix the problem in lds_test_uint. There is an unecessary ftou after the lds_read_vec. Removing the ftou results in a working lds_test_uint.
Specifically, something like these two lines (r#s likely differ):
"lds_read_vec r370.xyzw,r426.xy00\n"
"ftou r427.xyzw,r370.xyzw\n"
should acctually be:
"lds_read_vec r427.xyzw,r426.xy00\n"
Manually making this change in the generated IL results in correct behavior for lds_test_uint.
So if brcc could be made to not generate this extraneous ftou after the lds_read_vec, shared uint4 support would be working.
It would be much appreciated if this could be fixed in the brook+ sourceforge svn repo.
Indeed looking in the astbuiltinfunctions.cpp, one finds
BEGIN_BUILTIN("ReadLds", ...)
...
The EffectiveDataType is hardcoded to return float4.
Changing to uint4 produces the desired result ... but then works only for uint4.
There should be a global variable, or something, which contains the type of the user defined lds type for the kernel, and EffectiveDataType should return that type.
@brook developers: Shall I make this change and submit a patch, or would you prefer to take care of it in the next few days? I am presently blocked by the lack of this feature.
@Micah
Presently, in the brook decl "shared Vector4Type lds
Are you concerned about it breaking hand written hlsl code?
You propose adding a ReadLdsX instead of fixing ReadLds for presently broken cases. And then I should change the brook->hlsl code generator to use ReadLdsX, WiteLdsX based on the user defined Vector4Type?
Is hlsl really used outside of brook+?
If not, I think a type dispatch in ReadLds based on Vector4Type is cleaner than many ReadLdsX ... and all present code should continue to work, as its all float4.
If yes, is there a hlsl testsuite? Do you guess a dispatched ReadLds would not pass all the tests?
@ Micah
An afterthought...
I guess your point is that in hlsl there is not a global way (like there is in brook) to declare the type of the shared memory. So you're saying the user should choose the representation of what's written in and read from shmem by chosing the read and write routine appropriately. Then brook should choose the correct read/write pair based on the user declared Vector4Type for sh mem. Is that correct?
Originally posted by: MicahVillmow Another solution at the user space is to do a bitcast between float4 to uint4 instead of a conversion.
How does one do a bitcast in a brook+ kernel ?
Say:
uint x = 0xffffffff;
float y;
How do I get all the bits of x into y? 0xffffffff is not properly represented in a float such that when it comes back to uint via ftou utof it is corrupted.
OK, I'm working on a patch. I'm trying the ReadLds Vector4Type dispatch option ... for this to work, the shared mem type needs to be defined in the hlsl file. Presently the brook shmem type is not forwarded to hlsl. To remedy this, I've added a LocalDataShareType(Vector4Type) attribute to the hlsl [] declaration ... and modified parser.y accordingly. If this option is absent, the type will default to float4, so backwards compatability will be maintained. I'm having the problem that my version of flex (even the flex_old package) (Ubuntu 9.10) seems to be much newer than those origianlly used, and the resulting code does not compile ... the shader_on shader_off defines end up too far down in the file. How can I get them to occur higher up?
OK, I submitted a patch to the sourceforge site. Until it is incorporated into the trunk, the patch is available for download with instructions on how to apply it.