19 Replies Latest reply on Nov 5, 2009 11:17 AM by emuller

    accessing LDS memory problem

    chrisanonym

      Hello,

      I 'm trying to run the following simple kernel code:

      Attribute[GroupSize(64)]
      kernel void sum(int4 a<>, int4 b<>, out int4 c[]){
          shared int4 lds[64];

          int index = instance().x;
          lds[index] = instance();
          c[index] = lds[index];
      }

       

      while I 'm using the following code for the main function:

       

      #include <iostream>
      #include "brookgenfiles/kernels.h"

      using namespace std;

      int main(void){
          int4 cpu_a[64];
          int4 cpu_b[64];
          int4 cpu_c[64];
          unsigned int streamSize[] = {64};
          unsigned int rank = 1;

          brook::Stream<int4> gpu_a(rank, streamSize);
          brook::Stream<int4> gpu_b(rank, streamSize);
          brook::Stream<int4> gpu_c(rank, streamSize);

          gpu_a.read(cpu_a);
          gpu_b.read(cpu_b);

          sum(gpu_a, gpu_b, gpu_c);
          if(gpu_c.error()){
              cout << gpu_c.errorLog();
              return 1;
          }
          gpu_c.write(cpu_c);
          for(int i=0;i<64;i++)
              cout << "(" << cpu_c.x << ", " << cpu_c.y << ", " << cpu_c.z << ", " << cpu_c.w << ")\n";
          cout << "\n\n";
          return 0;
      }

       

      The results tha I 'm getting is the tupple (0, 0, 0, 0) for every element of stream C, which are obviously unexpected (see bold lines of the kernel code).

      Am I doing something wrong?

       

        • accessing LDS memory problem
          gaurav.garg

          The existing hardware exposes LDS as a memory per thread (Brook+ has exposed it as per group for future purpose). Other threads within a group can read the LDS memory assigned to other threads but they cannot write to it.

          If you convert your kernel to the following, it should work-

          Attribute[GroupSize(64)]
          kernel void sum(int4 a<>, int4 b<>, out int4 c[])
          {
              shared int4 lds[64];

              int index = instance().x;
              lds[1 * instanceInGroup().x + 0] = instance();
              c[index] = lds[index];
          }

          The write instruction makes sure that you are writing to its own memory and not to others.

            • accessing LDS memory problem
              chrisanonym

              I 'm still getting the same results!

                • accessing LDS memory problem
                  Jawed

                  Try

                     syncGroup();

                  after setting LDS?

                  Jawed

                    • accessing LDS memory problem
                      chrisanonym

                       

                      Originally posted by: Jawed Try

                       

                         syncGroup();

                       

                      after setting LDS?

                       

                      Jawed

                       

                      I think, there is no necessity to call syncGroup since each thread reads the same location that writes. Thus, there is no need for a thread to wait all other threads to execute their write instructions.

                      Nevertheless, I tried it but it leads to the same results.

                        • accessing LDS memory problem
                          gaurav.garg

                          What is your system configuration and driver version?

                            • accessing LDS memory problem
                              chrisanonym

                              My system specs are the following:

                              GPU: HD4830

                              CPU: core i7 920

                              driver version: 8.6

                              catalyst version: 9.4

                              os: windows vista 32 bit

                                • accessing LDS memory problem
                                  chrisanonym

                                  Thanks, for your responses!

                                  After extended research, I can conclude that there is a bug. Specifically, it seems that brook+ is not able to handle int4 shared memory consistently.

                                  I convert my kernel to the following:

                                  Attribute[GroupSize(64)]
                                  kernel void sum(float4 a<>, float4 b<>, out float4 c[]){
                                      shared float4 lds[64];
                                      int index = instance().x;
                                      float4 tmp;

                                      tmp.x = (float)instance().x;
                                      tmp.y = (float)instance().y;
                                      tmp.z = (float)instance().z;
                                      tmp.w = (float)instance().w;

                                      lds[1 * instanceInGroup().x + 0] = tmp;

                                      c[index] = lds[index];
                                  }

                                   

                                  and now it works!

                                    • accessing LDS memory problem
                                      emuller

                                      I confirm this bug... lds of uint4 type does not work.  Please see attached pybrook code. 

                                      Output:

                                      In [3]: execfile('lds_test.py')
                                      [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                                       0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
                                      [8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
                                       8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]

                                      A pretty fundamental bug.  Could this be addressed please?

                                      Casting from uint to float is not a viable work-around when I need all the bits of the uint ... for example, for the random number generator I'm working on ... or is there another work-around I'm not seeing?

                                       

                                       

                                      import os os.environ['BRT_ADAPTER']='1' import kernels import stream import math import numpy def fX(dims,x): b = numpy.array(dims)[::-1] b[-1]*=x return tuple(b) def f4(dims): return fX(dims,4) def f2(dims): return fX(dims,2) dims =(64,) h_out = numpy.zeros(f4(dims),dtype=numpy.uint32) h_in = numpy.ones(f4(dims),dtype=numpy.uint32)*8 d_in = stream.Stream_uint4(dims) d_out = stream.Stream_uint4(dims) d_out.read(h_out) d_in.read(h_in) k = kernels.__lds_test_uint() k.run(d_in,d_out) if d_out.error()!=stream.BRerror.BR_NO_ERROR: print "Error:", s1.errorLog() d_out.write(h_out) print h_out[0::4] k = kernels.__lds_test_float() k.run(d_in,d_out) if d_out.error()!=stream.BRerror.BR_NO_ERROR: print "Error:", s1.errorLog() d_out.write(h_out) print h_out[0::4] *************** end ***************** *********** kernels *************** Attribute[GroupSize (64)] kernel void lds_test_uint(uint4 in_s[], out uint4 out_s[]) { shared uint4 lds[64]; int local_id = instanceInGroup().x; lds[1 * instanceInGroup().x + 0] = in_s[instance().x]; syncGroup(); out_s[instance().x] = lds[1 * instanceInGroup().x + 0]; } Attribute[GroupSize (64)] kernel void lds_test_float(uint4 in_s[], out uint4 out_s[]) { shared float4 lds[64]; int local_id = instanceInGroup().x; lds[1 * instanceInGroup().x + 0] = (float4)in_s[instance().x]; syncGroup(); out_s[instance().x] = (uint4)lds[1 * instanceInGroup().x + 0]; }

                                        • accessing LDS memory problem
                                          emuller

                                          Note, the CPU backend, i.e.

                                          os.environ['BRT_RUNTIME']='cpu'

                                          appears to be working:

                                          In [3]: execfile('lds_test.py')
                                          [8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
                                           8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]
                                          [8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
                                           8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8]

                                            • accessing LDS memory problem
                                              emuller

                                              Looking at the IL code in the kernels_gpu.h file, one can detect and fix the problem in lds_test_uint.  There is an unecessary ftou after the lds_read_vec.  Removing the ftou results in a working lds_test_uint.

                                              Specifically, something like these two lines (r#s likely differ):

                                                 "lds_read_vec  r370.xyzw,r426.xy00\n"
                                                 "ftou r427.xyzw,r370.xyzw\n"

                                              should acctually be:

                                                   "lds_read_vec  r427.xyzw,r426.xy00\n"

                                              Manually making this change in the generated IL results in correct behavior for lds_test_uint. 

                                              So if brcc could be made to not generate this extraneous ftou after the lds_read_vec, shared uint4 support would be working.

                                              It would be much appreciated if this could be fixed in the brook+ sourceforge svn repo.

                                               

                                                • accessing LDS memory problem
                                                  emuller

                                                  Indeed looking in the astbuiltinfunctions.cpp, one finds

                                                  BEGIN_BUILTIN("ReadLds", ...)

                                                  ...

                                                  The EffectiveDataType is hardcoded to return float4.

                                                  Changing to uint4 produces the desired result ... but then works only for uint4.

                                                  There should be a global variable, or something, which contains the type of the user defined lds type for the kernel, and EffectiveDataType should return that type.

                                                  @brook developers: Shall I make this change and submit a patch, or would you prefer to take care of it in the next few days?  I am presently blocked by the lack of this feature.

                                                   

                                                   

                                                    • accessing LDS memory problem
                                                      MicahVillmow
                                                      emeuller,
                                                      A better approach is to generate a new intrinsic that is ReadLdsUint instead of ReadLds which is for float4. That way the change doesn't break other people's code.
                                                        • accessing LDS memory problem
                                                          emuller

                                                          @Micah

                                                          Presently, in the brook decl "shared Vector4Type lds[x]" the Vector4Type has no effect.  Shared mem is always treated as having type float4, but as far as I can tell, only on the lds_read_vec side.  What I propose is that ReadLds should acctually take the user specified Vector4Type into account.  This should not break any brook+ code, as anything other than a float4 for Vector4Type is producing garbage,results.

                                                          Are you concerned about it breaking hand written hlsl code?

                                                          You propose adding a ReadLdsX instead of fixing ReadLds for presently broken cases.  And then I should change the brook->hlsl code generator to use ReadLdsX, WiteLdsX based on the user defined Vector4Type?

                                                          Is hlsl really used outside of brook+?

                                                          If not, I think a type dispatch in ReadLds based on Vector4Type is cleaner than many ReadLdsX ... and all present code should continue to work, as its all float4.

                                                          If yes, is there a hlsl testsuite?  Do you guess a dispatched ReadLds would not pass all the tests?

                                                           

                                                           

                                                            • accessing LDS memory problem
                                                              MicahVillmow
                                                              emuller,
                                                              I see. I did not know that the type was based on the type defined in the kernel. Your solution is actually better that it should be based on the Vector4Type and not hard coded to float4 or having different function names for different types. Go ahead and submit a patch if you have one.
                                                              • accessing LDS memory problem
                                                                emuller

                                                                @ Micah

                                                                An afterthought...

                                                                I guess your point is that in hlsl there is not a global way (like there is in brook) to declare the type of the shared memory.  So you're saying the user should choose the representation of what's written in and read from shmem by chosing the read and write routine appropriately.  Then brook should choose the correct read/write pair based on the user declared Vector4Type for sh mem.  Is that correct?

                                                                 

                                                                  • accessing LDS memory problem
                                                                    MicahVillmow
                                                                    emuller,
                                                                    I'm not too familiar with brook+ or hlsl as I didn't work on those, but what should happen is that below the brook+ language level, it should specify different read/store routines that are not visible to the Brook+ developer. The brook+ compiler should generate different builtins based on the datatype of the function, i.e. ReadLds_float4, ReadLds_uint4, ReadLds_int4, etc... so that there is no conflicts in datatypes or unwanted conversions as is this case. The ReadLds function should then be mapped to one of these underlying functions transparently, and this would fix the problem. Now if doing it at the HLSL level or at some level below or above that is the correct solution, i'm sure the Brook+ developers know the location where it should occur better than myself. But there definately at some point in the compiler stack should be divergent code flows for each data type.

                                                                    Another solution at the user space is to do a bitcast between float4 to uint4 instead of a conversion.
                                                                      • accessing LDS memory problem
                                                                        emuller

                                                                         

                                                                        Originally posted by: MicahVillmow  Another solution at the user space is to do a bitcast between float4 to uint4 instead of a conversion.


                                                                        How does one do a bitcast in a brook+ kernel ?

                                                                        Say:

                                                                        uint x = 0xffffffff;

                                                                        float y;

                                                                        How do I get all the bits of x into y?  0xffffffff is not properly represented in a float such that when it comes back to uint via ftou utof it is corrupted.

                                                                         

                                                                         

                                                                         

                                                                          • accessing LDS memory problem
                                                                            emuller

                                                                            OK, I'm working on a patch.  I'm trying the ReadLds Vector4Type dispatch option ... for this to work, the shared mem type needs to be defined in the hlsl file.  Presently the brook shmem type is not forwarded to hlsl.  To remedy this, I've added a LocalDataShareType(Vector4Type) attribute to the hlsl [] declaration ... and modified parser.y accordingly.  If this option is absent, the type will default to float4, so backwards compatability will be maintained.  I'm having the problem that my version of flex (even the flex_old package) (Ubuntu 9.10) seems to be much newer than those origianlly used, and the resulting code does not compile ... the shader_on shader_off defines end up too far down in the file.  How can I get them to occur higher up?