78 Replies Latest reply on Jan 18, 2009 11:17 AM by zpdixon

    SDK 1.3 Feedback

    MicahVillmow

      Now that 1.3 has been released to the public, we would like feedback on it un order to further improve future releases of the SDK. we would appreciate your help in providing feedback in this thread so that the information does not get buried in other threads. Please make sure you label each item as a 'Feature Request', 'Bug Reports', 'Documentation' or 'Other'. As always, you can send an email to 'streamcomputing@amd.com' for general requests or 'streamdeveloper@amd.com' for development related requests.

      If you wish to file a Feature Request, please include a description of the feature request and the part of the SDK that this request applies to.

      If you wish to file a Bug Report, please include the hardware you are running on, operating system, SDK version, driver/catalyst version, and if possible either a detailed description on how to reproduce the problem or a test case. A test case is preferable as it can help reduce the time it takes to determine the cause of the issue.

      If you wish to file a Documentation request, please specify the document, what you believe is in error or what you believe should be added and which SDK the document is from.

      Thank you for your feedback.
      AMD Stream Computing Team

        • what's the meaning of "compute kernel"?
          wgbljl

          Hi Micah,

              In the Page 151 of the User_Guide in 1.3beta SDK, it says that the FireStream 9250 support "Compute Kernel" which FireStream 9170 does not support. What's the meaning of "Compute Kernel"?

             Thanks.

          • Bug Reports
            rveldema

            - double constants in expressions cause internal compiler error in brcc:

            A test case:

            -------------------------------------------------



            kernel void   gpgpu_laplacalc(float __brook_iter_space<>, int  __looplen, int  Z, int
              firstcol, int  lastcol, double  loop_invar_0,
                   double      t1a[], out double      t1b[], double      t1c[], double      t1d[]
            )
            {
              int   iindex;
              double xx;

              iindex= 4;

              xx = 4.0;

              t1b[iindex]= t1a[iindex] - (4.0 * t1a[iindex]);
            }
            ------------------------

            replace the 4.0 in the last line with xx  allows brcc to compile the kernel OK.

            This is under linux with brcc from 1.3.

             

             

             

             

            • SDK 1.3 Feedback
              godsic

              Is HD3450 support ATI Stream 1.3 with Cat 8.12?

              • Feature Request
                rveldema

                feature request.

                Brook 1.3 does not support multple gather streams

                so that is illegal to write:

                ------------------


                kernel void multiscatter(double input<>, out double output1[], out double output2[])
                {
                    int index = 33;
                    output1[index] = input * 2.0;
                    output2[index] = input * 3.0;
                }
                -----------------

                Unfortunately, this can be quite common requiring the programmer

                to split the code above into multiple kernels and doing

                duplicate work.

                  • Feature Request
                    pbhani

                    rveldema,

                    Thanks for your feedback. I assume you mean multiple scatter streams! Yes, this is a restriction of the underlying architecture that Brook+ does not virtualize. It would be possible to support this feature using multiple passes and we do have this on our TODO list.

                  • bug report
                    rveldema

                    bug report

                    like the other report, this gives a strange internal compiler error too

                    but seemingly for a different reason.

                    -------------
                    kernel void singlescatter(double input<>, out double output1[])
                    {
                        int index = 33;
                        double xx = 2.0;
                        output1[index] = input * 2.0;
                    }

                    -----------

                     replacing the 2.0 with xx in the last line again fixes the problem fortunately,

                     

                    • Else bug
                      maligor

                      Been testing Brook+ as a raw converter for photo's. This is converted from the old 1.2.1 stuff.

                      More or less 'if else' seems to corrupt the 'mode' variable in some way and it runs way out of bounds. It seems to grab y index inside itself somehow. The cpu target doesn't do this. Removing the 'else' statements in it makes it behave.

                      Broken Code:

                      kernel void convert_surface_simple( float img[][], out float3 image_out<> )
                      {
                          int2 ind = instance().xy;
                          int mode;   
                          mode = (int)(fmod((float)ind.x * 1.0f, 2.0f) + fmod((float)ind.y * 1.0f, 2.0f) * 2.0f);

                          if( mode == 0 ) { // Blue
                              image_out.x = img[ind.y+1][ind.x+1];
                              image_out.y = img[ind.y+1][ind.x];
                              image_out.z = img[ind.y][ind.x];
                          }
                          else if( mode == 1 ) { // Green1
                              image_out.x = img[ind.y+1][ind.x];
                              image_out.y = img[ind.y][ind.x];
                              image_out.z = img[ind.y][ind.x+1];
                          }
                          else if( mode == 2 ) { // Green2
                              image_out.x = img[ind.y][ind.x+1];
                              image_out.y = img[ind.y][ind.x];
                              image_out.z = img[ind.y+1][ind.x];
                          }
                          else if( mode == 3 ) { // Red
                              image_out.x = img[ind.y][ind.x];
                              image_out.y = img[ind.y+1][ind.x];
                              image_out.z = img[ind.y+1][ind.x+1];
                          }
                      }

                      This is just a version I use for testing the in/out, I have a more complex version that doesn't work right quite yet but the speed is impressive.

                      • Feature Request
                        rveldema

                        bug report

                          gather array of structures

                        ----------------------


                        typedef struct foo
                        {
                          double field;
                        } foo;

                        kernel void struct_use(double input<>, out double output1[], foo z[])
                        {
                            int index = 33;
                            output1[index] = input * z[0].field;
                        }

                        ----------------------

                        this causes brcc (1.3, & linux) to abort saying something about

                        failing to determine which function overload was intended ??).

                        Removing " * z[0].field" from the last line makes it compile again.

                         

                         

                         

                        • SDK 1.3 Feedback
                          MicahVillmow
                          Thank you very much for the bug reports. I've filed them against the brook+ team so that they can be fixed.
                          • feature request
                            rveldema

                            feature request

                             explicit memory management for GPU memory:

                            I'd very much like some version of brook_malloc(int) and brook_free(void*)

                            coupled ofcourse with memory indirection in kernel codes.

                            This would allow us to implement complex data structures in GPU memory.

                            Example:

                            ----------------------

                            typedef struct tree {

                              double val;

                               struct tree *left, *right;

                            } tree;

                             

                            kernel void access_tree(tree *t, reduce double sum<>

                            {  sum = t->val  + t->left->val; // etc

                             

                            void hostcode() {

                               tree *a = brook_malloc(sizeof(tree)); // etc

                              stream b; // etc

                              access_tree(a, b);

                              brook_free(a); // etc

                            }

                            ----------------------

                             

                            This would make the work needed to 'massage' our data structures into

                            arrays superfluous.

                             

                            • SDK 1.3 Feedback
                              Ceq
                              I think I've found a bug in SDK 1.3, every time you try to use a structure
                              in a subkernel brcc aborts due to an unhandled exception.
                              (BRCC source file "express.cpp", function "semanticCheck", line 1845)

                              To check this open the following sample program from Brook+ directory:
                              "BROOK\samples\legacy\tests\struct"


                              The Brook+ code should be like the following:
                              ----------------------------------------------------------------------------------------------
                              typedef struct PairRec
                              {
                              float first;
                              float second;
                              } Pair;


                              kernel void struct_gather(float index< >, Pair pairs[ ], out float result< > )
                              {
                              Pair p = pairs[ index ];
                              result = p.first + p.second;
                              }


                              Now create another kernel and call it from the first one:
                              ----------------------------------------------------------------------------------------------
                              kernel void auxKer(Pair p< >, out float result< >) { result = p.first + p.second; }

                              kernel void struct_gather(float index< >, Pair pairs[ ], out float result< > )
                              {
                              Pair p = pairs[ index ];
                              auxKer(p, result);
                              }


                              According to Brook+ documentation calling a subkernel from another kernel should work.
                              Note that if you use base types this doesn't happen.
                                • SDK 1.3 Feedback
                                  pbhani

                                  Ceq,

                                  Thanks for the bug report. Looks like this code-path is busted! We'll track this as well. Please use the workaround of using base types, as you suggested, for now.

                                    • SDK 1.3 Feedback
                                      beldoy

                                      Hi,

                                      I have the HD 3850 card agp version and after installing the 8.12 drivers and 1.3 sdk it seems that the avivo transcoder does not have h264 avc encoding or at least I cant see it anywhere and when using any of the available options to encode the time taken is almost twice as long as with normal cpu encoding, and would seem like the stream part is not enabled at least with my setup.

                                      My setup:

                                      AMD 3000+ CPU single core
                                      1.5GB Memory
                                      HD 3850 AGP Card

                                      Any help appreciated.

                                        • SDK 1.3 Feedback
                                          Remotion

                                          Hi,

                                          The new Brook+ runtime look much better but unfortunately still has many problems.

                                          It seems that only this simple reduction kernel will return proper value.

                                          reduce void
                                          ReduceK(float input<>, reduce float output<> )
                                          {
                                              output += input;
                                          }

                                          This one and other variations on input alredy return totaly wrong ansver (INF) on HD 4870.

                                          reduce void
                                          ReduceK(float input<>, reduce float output<> )
                                          {
                                              output += (input * input);
                                          }

                                          This is strange why now one need to create env var BRT_PERMIT_READ_WRITE_ALIASING to use one strean as input and as output.

                                          This was working well with older SDK.

                                          It there a way to use kernels like this one?

                                          kernel void Add(float input<>, out float output<> )

                                          {

                                          output += input; // this is not reduction!

                                          }

                                           

                                          • SDK 1.3 Feedback
                                            Remotion

                                            Why it is now needed to define BRT_PERMIT_READ_WRITE_ALIASING to use the same stream as input and output.
                                            Using the same stream as input and output highly simplify and accelerated my programs and work well with HD 4870.

                                            The biggest problem is kernel call memory leaks and slowdown using VS2008.

                                              • SDK 1.3 Feedback
                                                pbhani

                                                Remotion,

                                                GPUs have separate read and write caches. Using the same Stream as input as well as output might work for simple cases, but is NOT guaranteed to work in the general purpose case (e.g. use of gather and scatter streams). The new runtime simply checks for this condition. If you feel that your application is not sensitive to this issue, please use the env variable... as long as you are aware of the underlying issues.

                                                • SDK 1.3 Feedback
                                                  gaurav.garg

                                                  Hi Remotion,

                                                   

                                                  Thanks for your feedback.

                                                  1. Issues with reduction - I think the issue might be that the result is going out of floating point maximum value that GPU can represnt. You can try to put a bound on your input array values (may be something < 5) and test it.

                                                  2. Why BRT_PERMIT_READ_WRITE_ALIASING - Under SIMD parallelism you might get incorrect results if you use the same stream as input and output. Consider this example -

                                                  kernel void test(float a[], out float b<>

                                                  {

                                                      b = a[0];
                                                  }

                                                  If you call this kernel with the same stream as input and output you migtht get undefined results as Brook+ doesn't guarantee order of execution of input stream.

                                                  3. Memory leaks and slow-down with VS2008 - Do you see these memory leaks only with VS2008? I am using pre-built Brook+ library and tried running some samples with iterations upto 1000 and don't see increasing memory usage of my application or any slow-down.

                                                    • SDK 1.3 Feedback
                                                      Remotion

                                                      Hi and thanks for you reply,

                                                      I have tested this reduction kernel with streams filled wiht 1.0 and still got wrong results.

                                                      Yes I know about this problem, this is the same as using multiple CPU cores to do the work and all my kernels are resistent to this.

                                                      I am using WinXP 64-bit and calling kernels from another DLL with is compiled with VS2008 and have this problem.

                                                      The Brook+ runtime is compiled with VS2008 too.

                                                        • SDK 1.3 Feedback
                                                          gaurav.garg

                                                          Could you also post your runtime part of the code for reduction issue?

                                                          We will try to reproduce the memory leak issue on our end with VS2008. It would be great if you can post a test case.

                                                          Thanks.

                                                            • SDK 1.3 Feedback
                                                              Remotion

                                                              I have just modified reduce_kernel sample.

                                                              reduce void
                                                              reduceGPU(float input<>, reduce float output<>
                                                              {
                                                                  output += input * 0.1f;
                                                              }

                                                              Even this code return wrong result.

                                                              I will try to create simple project with memory leak issue and send it vie e-mail later.

                                                                • SDK 1.3 Feedback
                                                                  gaurav.garg

                                                                  What are the dimensions are you using for input and output streams for reduction?

                                                                  Did you try error checking on output stream? errorLog on output stream can give some useful information.

                                                                   

                                                                  • SDK 1.3 Feedback
                                                                    gaurav.garg

                                                                    Hi Remotion,

                                                                    Reduction doesn't work if you have any expression on the right side.

                                                                    Reductions is defined as a single, two-input operator. I think this constarint on reduction is not new and it has been the same way from brookGPU.

                                                                     

                                                                    • SDK 1.3 Feedback
                                                                      eduardoschardong
                                                                      Originally posted by: Remotion

                                                                      reduce void
                                                                      reduceGPU(float input<>, reduce float output<>)
                                                                      {
                                                                          output += input * 0.1f;
                                                                      }

                                                                      Reduction kernels are expected to be comutative, neither this one and the last one are, so they return incorrect results.


                                                                      BTW, hey AMD, could you simplify a little?
                                                                      Wipe out the <> in kernels and make non-scatter output streams readable, a return parameter for default output? For example, in kernel void sum(float a, out float b){ b+=a;/**/} why could b read and then write to the same location? The reduce parameter in reduction kernels doesn't work in this way? And about those <>, note that, like in the code above it doesn't matter for the kernel the size of the a stream and even if a is a constant and not a stream, why forcing the programmer to write those useless <>? And the default output, lets say we have a kernel float sum(float a, float b) {return a + b;/**/}, it's so difficult for the compiler to trasnlate it to kernel void sum(float a, float b, out float c) {c = a + b;/**/}? The first form is more readable, and also, for all those simple streams I won't have to rewrite/ctrl+c,ctrl+v/wrap those simple functions that already exists for CPUs...
                                                            • SDK 1.3 Feedback
                                                              bronson

                                                              Is anybody working on AVT for Linux?  I'd like to offload h.264 encoding to the GPU...  possible?

                                                        • SDK 1.3 Feedback
                                                          rick.weber

                                                          I read in the what's new that this version of Brook+ supports arrays of streams on the host and dynamic stream allocation. I couldn't find how to do either in the documentation and float myStream<10>[10]; doesn't pass brcc's syntax checking. Firstly, how do I do this, and my request is that the documentation be updated to explain how to do this

                                                            • SDK 1.3 Feedback
                                                              rick.weber

                                                               

                                                              Originally posted by: rick.weber I read in the what's new that this version of Brook+ supports arrays of streams on the host and dynamic stream allocation. I couldn't find how to do either in the documentation and float myStream<10>[10]; doesn't pass brcc's syntax checking. Firstly, how do I do this, and my request is that the documentation be updated to explain how to do this

                                                               

                                                              Oh wait, the runtime C++ API was updated to support this, which I'm guessing means I have to write the kernel in Brook+, cross compile with brcc, and then modify the .cpp file, changing the ::StreamOperator: or whatever that corresponds to the desired on in the .br file to make an array of them.

                                                                • SDK 1.3 Feedback
                                                                  gaurav.garg

                                                                  Hi Rick,

                                                                   

                                                                  Brook+ 1.3 exposes Stream as a class and you can call operators on it from your C++ file. You can take a look at the samples under samples\CPP\apps those are using C++ runtime API.

                                                                    • SDK 1.3 Feedback
                                                                      rick.weber

                                                                       

                                                                      Originally posted by: gaurav.garg Hi Rick,

                                                                       

                                                                       

                                                                       

                                                                      Brook+ 1.3 exposes Stream as a class and you can call operators on it from your C++ file. You can take a look at the samples under samples\CPP\apps those are using C++ runtime API.

                                                                       

                                                                      That precisely answers my question. Thank you.

                                                                • SDK 1.3 Feedback
                                                                  Ceq
                                                                    • SDK 1.3 Feedback
                                                                      gaurav.garg

                                                                      Brook+ has to convert user defined structs into hardware supported data formats. If you specify these typedefs in br file, brcc parses this information and generate methods that helps runtime get information about the formats used in the struct.

                                                                       

                                                                      So, you have to define these typedefs in a br file -

                                                                      struct.br-

                                                                      typedef struct Ostr {
                                                                      float4 a;
                                                                      float4 b;
                                                                      } Odef;

                                                                       

                                                                      cpp file -

                                                                      #include struct.h // generated header file from .br file

                                                                      #include "brook/Stream.h"
                                                                      using namespace brook;

                                                                      int main(int argc, char *argv[ ] ) {
                                                                      unsigned int dim[1] = { 1 };
                                                                      Stream<Odef> s1( 1, dim);
                                                                      return 0;
                                                                      }

                                                                       

                                                                      I hope it helps.

                                                                    • SDK 1.3 Feedback
                                                                      dar

                                                                      bug report

                                                                      Scientific Linux 5.1 (RHEL clone) x86_64

                                                                      amdstream-cal-1.3.0_beta.x86_64.run does not contain/install lib/ or lib64/  directories and thus does not install the required shared libraries.

                                                                      • SDK 1.3 Feedback
                                                                        dar

                                                                        bug report

                                                                        Scientific Linux 5.1 (RHEL 5.1 clone) x86_64

                                                                        problems with int4, questioncolon - there are three related issues.

                                                                        consider the simple kernel below:

                                                                        kernel void test_int4_gpu_kern( int n, int4 s_src<>, out int4 s_dst<> )
                                                                        {

                                                                           const int4 zero4 = int4(0,0,0,0);
                                                                           int4 imask = int4(n+2,n-2,n,n);
                                                                           int4 tmp = s_src;
                                                                         
                                                                           /* fails */
                                                                        // tmp = (imask == tmp)? zero4 : tmp;
                                                                         
                                                                           /* works with brtvector.hpp patch */
                                                                           tmp = ((int4)(imask == tmp))? zero4 : tmp;
                                                                         
                                                                           s_dst = tmp; 
                                                                        }

                                                                        First, brook+ will produce the following error for this kernel,

                                                                        g++ -O3 -I/usr/local/amdbrook/sdk/include -c test_int4.cpp
                                                                        /usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp: In function ‘T singlequestioncolon(const B&, const T&, const T& [with T = int, B = int]’:
                                                                        /usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp:479:   instantiated from ‘vec<typename BRT_TYPE::TYPE, LUB<BRT_TYPE::size,tsize>::size> vec<VALUE, tsize>:uestioncolon(const BRT_TYPE&, const BRT_TYPE& const [with BRT_TYPE = __BrtInt4, VALUE = int, unsigned int tsize = 4u]’
                                                                        test_int4.cpp:18:   instantiated from here
                                                                        /usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp:59: error: request for member ‘questioncolon’ in ‘a’, which is of non-class type ‘const int’
                                                                        make: *** [test_int4.o] Error 1
                                                                        rm test_int4.cpp

                                                                        However, this can be fixed with the following patch,

                                                                        --- brtvector.hpp~      2008-12-16 12:00:52.000000000 -0500
                                                                        +++ brtvector.hpp       2008-12-16 12:04:54.000000000 -0500
                                                                        @@ -58,6 +58,17 @@
                                                                                                                                   const T&c){
                                                                             return a.questioncolon(b,c);
                                                                         };
                                                                        +
                                                                        +
                                                                        +
                                                                        +/* XXX added by DAR */
                                                                        +template <> inline int singlequestioncolon (const int &a,
                                                                        +                                              const int &b,
                                                                        +                                              const int &c) {
                                                                        +    return a?b:c;
                                                                        +}
                                                                        +
                                                                        +
                                                                         template <> inline float singlequestioncolon (const char &a,
                                                                                                                       const float &b,
                                                                                                                       const float &c) {

                                                                        Second, issue is that it should not be necessary to perform a cast in,

                                                                        tmp = ((int4)(imask == tmp))? zero4 : tmp;

                                                                        When the (int4) cast is removed, the following error is generated,

                                                                        /usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp:59: error: request for member ‘questioncolon’ in ‘a’, which is of non-class type ‘const char’

                                                                        The equivalent expression for float4 does not require the equivalent cast.

                                                                        Third issue is minor.  The compiler produces warning that,

                                                                        test_int4.br(37) : WARN--1: conditional expression must have scalar type. On short vectors, assumes x components as condition
                                                                                         Statement: (int4 ) (imask == tmp) in tmp = ((int4 ) (imask == tmp)) ? (zero4) : (tmp)

                                                                        However, this does not appear correct.  For float4 the conditional expression correctly applied component-wise, and with the patch above, the same is true for int4.  In the simple kernel, values are masked out component-wise to zero.

                                                                         

                                                                        • SDK 1.3 Feedback
                                                                          Ceq
                                                                          Hi, previously gaurav.garg told me how to use structs in Brook+, however I've found a problem:


                                                                          1. Open example in BROOK\samples\CPP\tutorials\SimpleKernel
                                                                          (I'm using Visual Studio 2005)


                                                                          2. Edit file "copy.br" and add the following lines:

                                                                          typedef struct PairRec {
                                                                          float first;
                                                                          float second;
                                                                          } Pair;


                                                                          3. Rebuild

                                                                          1>simple_kernel.cpp
                                                                          1>c:\hd1\brook\samples\cpp\tutorials\simplekernel\brookgenfiles/copy.h(37) : error C2146: syntax error : missing ';' before identifier 'first'
                                                                          1>c:\hd1\brook\samples\cpp\tutorials\simplekernel\brookgenfiles/copy.h(37) : error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
                                                                          1>c:\hd1\brook\samples\cpp\tutorials\simplekernel\brookgenfiles/copy.h(37) : error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
                                                                          ...


                                                                          What's wrong? Is this a bug? Is there any workaround?
                                                                            • SDK 1.3 Feedback
                                                                              gaurav.garg

                                                                              Hi Ceq,

                                                                               

                                                                              It looks like a bug in code-generation where a header file required for compilation is missing.

                                                                              As a workaround you can include "brook\CPU\brtvector.hpp" in simple_kernel.cpp before including "brookgenfiles/copy.h".

                                                                               

                                                                              Hope it helps.

                                                                                • SDK 1.3 Feedback
                                                                                  gaurav.garg

                                                                                  Hi Ceq,

                                                                                  Thanks for pointing out memory leak issues. Looks like there are memory leaks for reduction kernels. I did try the regular kernels and everything looks OK. But, reduction kernel shows the behavior mentioned by you.

                                                                              • SDK 1.3 Feedback
                                                                                Ceq
                                                                                • SDK 1.3 Feedback
                                                                                  nberger
                                                                                  BUG REPORT: Kernel calls getting slower
                                                                                  Hi!
                                                                                  With some effort I have managed to move my partial wave analysis framework to the 1.3 SDK and
                                                                                  the good news is that it now works (as opposed to the attempts with the 1.2 and 1.1 versions) and
                                                                                  produces correct results. I found however that if I call a kernel multiple times, it becomes slower.
                                                                                  In an attempt to make sure that the problem is not somewhere with my code, I went to the simpleKernel
                                                                                  example and just placed a loop around the kernel call - and also here the execution time increases for every
                                                                                  additional kernel call. Unfortunately this behavior just about kills my application - any tips for workarounds
                                                                                  or patches are warmly welcome.

                                                                                  Thanks

                                                                                  Nik
                                                                                  • SDK 1.3 Feedback
                                                                                    nberger
                                                                                    No. The scary thing is that this behavior is also seen with the simplest of all kernels, namely the copy kernel form the
                                                                                    simpleKernel example, which just does output = input...

                                                                                    Cheers

                                                                                    Nik
                                                                                      • SDK 1.3 Feedback
                                                                                        Remotion

                                                                                        This are exactly my problems too.

                                                                                        Slowdown and memory leaks even without reduction kernels.

                                                                                        Using domains will change behavior a bit and slowdown are not so bad now but leaks are still there.

                                                                                         

                                                                                         

                                                                                      • SDK 1.3 Feedback
                                                                                        bayoumi
                                                                                        • SDK 1.3 Feedback
                                                                                          bayoumi
                                                                                          if someone has 64b Linux 5.2 RHEL, does the inputspeed & outputspeed precompiled binaries under amdcal/bin/lnx64 give consistent results?
                                                                                          • SDK 1.3 Feedback
                                                                                            bayoumi
                                                                                            has anyone seen the slowdown with time in Windows XP (32 or 64) as well, or any OS other than Scientific Linux?
                                                                                            • SDK 1.3 Feedback
                                                                                              Ceq
                                                                                              Yes bayoumi, I'm working on a application that requires five kernels, two of them are reductions.
                                                                                              Each iteration was taking more time, the first one is around 0.20 seconds, the last one (iteration 4800) takes more than a second.
                                                                                              I think it is related to the memory leaks issues, because at the end of execution it requires more than 1 GB.
                                                                                              Reductions are very affected, but normal kernels can slow down too.
                                                                                              As a workaround use as many streams as you can as private function variables, so unneeded streams are destroyed at the end of the function, for me this worked fine.

                                                                                              By the way, I would like to congratulate AMD people working in the forum and behind the stream SDK because 1.3 is a huge improvement.
                                                                                              Previously on 1.21 I wasn't able to implement my algorithm, but now I got more than 100x the CPU performance.
                                                                                              I'm working on shallow water systems simulation based on finite volumen scheme, when finished I will open another thread to show some results.


                                                                                              QUESTION:
                                                                                              ------------------------------------------------------------------------------------------------------------
                                                                                              In a normal kernel if you use two streams of diferent sizes runtime issues a warning that auto-stride / auto-replication will be deprecated in future versions... is this true? why? I think is a nice feature if you work with regular patterns.
                                                                                              • SDK 1.3 Feedback
                                                                                                bayoumi
                                                                                                thanks Ceq for your reply. I would expect AMD to post a patch soon.
                                                                                                Can't wait to try the 1.3 version.
                                                                                                BTW, I see you're using XP x64. which libraries did you use for brook+ (brook.lib or brook_d.lib), and which option /MDd, /MTd, MD or MT? I always end up in access violation error during runtime with XP x64 (the sdk binaries samples are running OK)
                                                                                                • SDK 1.3 Feedback
                                                                                                  rick.weber

                                                                                                  Bug Report:

                                                                                                  Swizzling parts of arrays doesn't seem to work in kernels.

                                                                                                  E.g.

                                                                                                  kernel void foo(float a<>,float4 b<>, float4 c<>

                                                                                                  {

                                                                                                  float4 r[2];
                                                                                                  float4 tmp1; tmp2;

                                                                                                  float val = a*a;
                                                                                                  r[0].x = val;
                                                                                                  r[0].y = val;
                                                                                                  r[0].z = val;
                                                                                                  r[0].w = val;
                                                                                                  r[1].x = val;
                                                                                                  r[1].y = val;
                                                                                                  r[1].z = val;
                                                                                                  r[1].w = val;

                                                                                                  b = r[0];
                                                                                                  c = r[1]; 

                                                                                                  }

                                                                                                  This kernel does not work. However, if I assign tmp1 and tmp2 to val in a similar fashion, the kernel does work.

                                                                                                  • SDK 1.3 Feedback
                                                                                                    Ceq
                                                                                                    BUG REPORT:

                                                                                                    Please, have a look at this two kernels, altough they do the same the output is different, looks
                                                                                                    like the code generated in the second case for the condition is wrong:
                                                                                                    By the way, should I compare the value with 0.1f instead of 0.0f to stay on the safe side? (altough results are the same)

                                                                                                    // Good result
                                                                                                    // [ 0 1 2 ] + [ 00 00 00 ] = [ 00 01 02 ]
                                                                                                    // [ 3 4 5 ] + [ 00 10 20 ] = [ 03 14 25 ]
                                                                                                    // [ 6 7 8 ] + [ 30 40 50 ] = [ 36 47 58 ]

                                                                                                    kernel void fun1(float Center<>, float Down[][], out float Sol<>) {
                                                                                                    float2 pos = indexof(Center).xy;
                                                                                                    Sol = Center;
                                                                                                    if(pos.y > 0.0f) { // <----------
                                                                                                    float2 dD = { 0.0f, -1.0f };
                                                                                                    Sol += Down[pos + dD];
                                                                                                    }
                                                                                                    }

                                                                                                    // Bad result
                                                                                                    // [ 00 02 04 ]
                                                                                                    // [ 03 14 25 ]
                                                                                                    // [ 36 47 58 ]

                                                                                                    kernel void fun2(float Center<>, float Down[][], out float Sol<>) {
                                                                                                    float2 pos = indexof(Center).xy;
                                                                                                    float D = 0.0f;
                                                                                                    if(pos.y > 0.0f) { // <----------
                                                                                                    float2 dD = { 0.0f, -1.0f };
                                                                                                    D = Down[pos + dD];
                                                                                                    }
                                                                                                    Sol = Center + D;
                                                                                                    }

                                                                                                    #define streamPrint(_str, _ptr, _x, _y) streamWrite(_str, _ptr); print(#_str, _ptr, _x, _y)

                                                                                                    void print(char *name, float *ptr, int x, int y) {
                                                                                                    int i, j, pos;
                                                                                                    printf("\n%s:\n", name);
                                                                                                    for(pos = 0, i = 0; i < y; i++) {
                                                                                                    for(j = 0; j < x; j++, pos++)
                                                                                                    printf("%6.2f ", ptr[pos]);
                                                                                                    printf("\n");
                                                                                                    }
                                                                                                    }

                                                                                                    int main(int argc, char *argv[]) {
                                                                                                    const int NUMX = 3;
                                                                                                    const int NUMY = 3;
                                                                                                    const int SIZE = NUMX * NUMY;
                                                                                                    float A[SIZE], B[SIZE], S[SIZE];
                                                                                                    int i, j, pos;
                                                                                                    for(i = 0; i < SIZE; i++) {
                                                                                                    A[ i] = 1.0f * i;
                                                                                                    B[ i] = 10.0f * i;
                                                                                                    }
                                                                                                    {
                                                                                                    float sA < NUMY, NUMX > ;
                                                                                                    float sB < NUMY, NUMX > ;
                                                                                                    float sS < NUMY, NUMX > ;
                                                                                                    streamRead(sA, A);
                                                                                                    streamRead(sB, B);
                                                                                                    fun1(sA, sB, sS);
                                                                                                    streamPrint(sS, S, NUMX, NUMY);
                                                                                                    fun2(sA, sB, sS);
                                                                                                    streamPrint(sS, S, NUMX, NUMY);
                                                                                                    }
                                                                                                    }


                                                                                                    EDIT:
                                                                                                    -----------------------------------------------------------------
                                                                                                    Another test, change:
                                                                                                    if(pos.y > 0.0f) {
                                                                                                    float2 dD = { 0.0f, -1.0f };
                                                                                                    ...

                                                                                                    By the following:
                                                                                                    if(pos.x > 0.0f) {
                                                                                                    float2 dD = { -1.0f, 0.0f };
                                                                                                    ...

                                                                                                    And both cases will fail returning some negative values


                                                                                                    EDIT2:
                                                                                                    -----------------------------------------------------------------
                                                                                                    Using the new gather array notation will fix the problem
                                                                                                    gstream[posx][posy];


                                                                                                    EDIT3:
                                                                                                    -----------------------------------------------------------------
                                                                                                    ...but looks like not always, it fails if you use this typedef
                                                                                                    struct as gather type (even changing fields order changes
                                                                                                    the results).

                                                                                                    typedef struct float5S {
                                                                                                    float dt;
                                                                                                    float4 fl;
                                                                                                    } float5;

                                                                                                    kernel void fun3(float5 C<>, float5 L[][], float5 U[][],
                                                                                                    out float dt<>, out float4 dvar<>) {
                                                                                                    int2 pos = instance().xy;
                                                                                                    dvar = float4(0.0f, 0.0f, 0.0f, 0.0f);
                                                                                                    dt = C.dt;
                                                                                                    if(pos.x > 0) {
                                                                                                    float5 datL = fldtL[pos.y][pos.x - 1];
                                                                                                    dvar -= datL.fl;
                                                                                                    dt += datL.dt;
                                                                                                    }
                                                                                                    if(pos.y > 0) {
                                                                                                    float5 datU = U[pos.y - 1][pos.x];
                                                                                                    dvar -= datU.fl;
                                                                                                    dt += datU.dt;
                                                                                                    }
                                                                                                    }
                                                                                                      • SDK 1.3 Feedback
                                                                                                        lust

                                                                                                         

                                                                                                        This is the code that I use for some testing. The results get more and more time each frame.

                                                                                                         



                                                                                                        kernel void krnShitIntersectTriangle( float3 rayOrigs<>,

                                                                                                        float3 rayDirs<>,

                                                                                                        out float4 outHits<> )

                                                                                                        {

                                                                                                        float3 v0 = float3(0.f,0.f,0.f);

                                                                                                        float3 v1 = float3(100.f,0.f,0.f);

                                                                                                        float3 v2 = float3(0.f,100.f,100.f);

                                                                                                        float3 rayOrigin = rayOrigs;

                                                                                                        float3 rayDir = rayDirs;

                                                                                                        float4 currentHit = float4(9999999.0f, -1.f, -1.f, -1.f );

                                                                                                         

                                                                                                        float3 edge1 = v1 - v0;

                                                                                                        float3 edge2 = v2 - v0;

                                                                                                         

                                                                                                        float3 tvec = rayOrigin - v0;

                                                                                                         

                                                                                                        float3 qvec = cross( tvec, edge1 );

                                                                                                        float3 pvec = cross(rayDir, edge2);

                                                                                                        float det = dot(edge1, pvec);

                                                                                                        float inv_det = 1.0f / det;

                                                                                                        float value1, value2;

                                                                                                         

                                                                                                        float4 triangHit;

                                                                                                         

                                                                                                        triangHit.x = dot( edge2, qvec ) * inv_det;

                                                                                                        triangHit.z = dot( rayDir, qvec ) * inv_det;

                                                                                                        triangHit.y = dot( tvec, pvec ) * inv_det;

                                                                                                        triangHit.w = 0.0f;

                                                                                                         

                                                                                                        outHits = currentHit;

                                                                                                         

                                                                                                        value2 = (triangHit.x <= currentHit.x) && (triangHit.z >= 0.0f) && (triangHit.y >= 0.0f) && (triangHit.x >= 0.0f) && ((triangHit.y + triangHit.z) <= 1.0f);

                                                                                                         

                                                                                                        if( value2 )

                                                                                                        {

                                                                                                        outHits = triangHit;

                                                                                                        }

                                                                                                        }

                                                                                                        I keep the streams as members of a class:

                                                                                                        "Stream<float3> _origins;"

                                                                                                        "Stream<float3>_dirs;"

                                                                                                        "Stream<float4>_hits;"

                                                                                                        Since there is no default constructor, unlike 1.2.1, I construct them with a small size, and later on assign them like this: _dirs = Stream( rank2, dimsWH );

                                                                                                        Each time I measure the kernel execution times they get bigger and bigger:

                                                                                                         

                                                                                                        const int MAX_ITERS = 645;

                                                                                                        DWORD timeATStart;

                                                                                                        static float timeDurationF[MAX_ITERS];

                                                                                                         

                                                                                                        for( int i=0; i

                                                                                                        {

                                                                                                        PerfCounter0.Start();

                                                                                                        krnShitIntersectTriangle( traceContextGPU._origins, traceContextGPU._dirs, traceContextGPU._Hits );

                                                                                                         

                                                                                                        krnShadeNdotL_x3( traceContextGPU._origins, traceContextGPU._dirs, traceContextGPU._Hits,

                                                                                                        _treeFaces_x3,

                                                                                                        traceContextGPU._colors

                                                                                                        );

                                                                                                         

                                                                                                        PerfCounter0.Stop();

                                                                                                        timeDurationF

                                                                                                        = PerfCounter0.GetElapsedTime();

                                                                                                        PerfCounter0.Reset();

                                                                                                        }

                                                                                                         

                                                                                                         timeDurationF

                                                                                                        grows for example

                                                                                                        0.000175, 0.000213, 0.000222, 0.000230, 0.000238, 0.000249, 0.000257 ...

                                                                                                        This is for 128x128 rays. If I set the resolution to 1024x1024, the first 100 times the execution time is less than 0.0009xx, then it jumps to 0.0xxxx and then stays like 0.0xxxxx , which is bad IMHO. I do not know whether this is due to the address virtualization, or I need to think of some load balancing strategy.

                                                                                                        This is new to 1.3, previous version was fine.

                                                                                                        I am running this test on WindowsXP x64, using x64 build target with VS2005.

                                                                                                         Any idea why this is so and how to cure would be very warmly greeted



                                                                                                        0.000175, 0.000213, 0.000222, 0.000230, 0.000238, 0.000249, 0.000257 ...

                                                                                                        This is for 128x128 rays. If I set the resolution to 1024x1024, the first 100 times the execution time is less than 0.0009xx, then it jumps to 0.0xxxx and then stays like 0.0xxxxx , which is bad IMHO. I do not know whether this is due to the address virtualization, or I need to think of some load balancing strategy.

                                                                                                        This is new to 1.3, previous version was fine.

                                                                                                        I am running this test on WindowsXP x64, using x64 build target with VS2005.

                                                                                                         Any idea why this is so and how to cure would be very warmly greeted



                                                                                                      • SDK 1.3 Feedback
                                                                                                        Ceq
                                                                                                          • SDK 1.3 Feedback
                                                                                                            gaurav.garg

                                                                                                            Hi All,

                                                                                                            Thanks for pointing out all the slow-down issues. The issue is that Brook+ 1.3 uses some kind of caching for different execution events.

                                                                                                            Calling a kernel in a big for loop shows these issues. As a workaround you should call error() on output stream after a kernel call. I have tested the bug report sent by nberger - 

                                                                                                            for(int j=0; j < 10; j++){
                                                                                                            clock_t before = clock();
                                                                                                            for(int i=0; i < 1000; i++){
                                                                                                            copy(inputStream, outputStream);
                                                                                                            }
                                                                                                            outputStream.error();
                                                                                                            clock_t after = clock();
                                                                                                            cout << "1000 Calls: " << (after-before) << " ticks = " << (float)(after-before)/(float)CLOCKS_PER_SEC << " s" << endl;
                                                                                                            }

                                                                                                             

                                                                                                            To fix the slowdown change it to -

                                                                                                            for(int j=0; j < 10; j++){
                                                                                                            clock_t before = clock();
                                                                                                            for(int i=0; i < 1000; i++){
                                                                                                            copy(inputStream, outputStream);
                                                                                                            outputStream.error(); // Change here
                                                                                                            }
                                                                                                            clock_t after = clock();
                                                                                                            cout << "1000 Calls: " << (after-before) << " ticks = " << (float)(after-before)/(float)CLOCKS_PER_SEC << " s" << endl;
                                                                                                            }

                                                                                                            Let me know if you still face any issues. I have filed a bug for this and it should be fixed in next release.

                                                                                                          • SDK 1.3 Feedback
                                                                                                            Ceq
                                                                                                            In the last post of the previous page I wrote a bug report (I've fixed two copy-paste errors now)
                                                                                                            I would be very grateful if someboy at AMD could check if I'm doing something wrong
                                                                                                            or confirm it is really a bug because I'm really in a hurry due to a deadline for my project.
                                                                                                            Not using structs in gather forces me to use too many base type streams, (hence too many in kernel
                                                                                                            memory fetches that hurt performance) and the compiler warns that the kernel required two passes.

                                                                                                            Thanks
                                                                                                              • SDK 1.3 Feedback
                                                                                                                gaurav.garg

                                                                                                                It looks like a bug on Brook+ side.

                                                                                                                As a sidenote its better not to use structs in Brook+. Brcc expands these structs into multiple base type streams, so you can't save your memory fetches. On the other side, you have some performance overhead during data transfer while using structs as runtime has to transfer data to different base streams and copy data element by element.

                                                                                                              • SDK 1.3 Feedback
                                                                                                                Ceq
                                                                                                                that was a really fast answer, thanks a lot for the notes on structs gaurav!
                                                                                                                • SDK 1.3 Feedback
                                                                                                                  bayoumi
                                                                                                                  gaurav
                                                                                                                  I confirm the workaround works with 64 Linux SL5.2, with Firestream 9170 & driver 8.561.
                                                                                                                  questions:
                                                                                                                  1. Are there any similar slowdown issues with CAL. I had inconsistent results (including inf) when running the inputspeed & outputspeed, input_IL, output_IL precompiled binaries under the same pltaform.
                                                                                                                  2- Is there a performance penalty (CPU-GPU transfer overheads) when using outstream.error()?
                                                                                                                  3. Do we need inputstream.error() inside the loop before kernel call?
                                                                                                                    • SDK 1.3 Feedback
                                                                                                                      gaurav.garg

                                                                                                                      1. Can you post your command line options and the results?

                                                                                                                      2. error() call synchronizes all the pending events associated to the stream. It doesn't have any data transfer overhead.

                                                                                                                      3. inputstream.error() will probably synchronize streamRead, there are no issues with data transfer synchronization implementation. So, you need not to call error() on input stream.

                                                                                                                      As a sidenote, error() is very useful API to know any issues with your stream. And in case you have any error(), you can check errorLog() on the stream.

                                                                                                                    • SDK 1.3 Feedback
                                                                                                                      bayoumi
                                                                                                                      thank you for your reply.
                                                                                                                      Here is the case for CAL sdk 1.3 precomplied binaries (BTW, the precompiled binaries for XP x64 give consistent results):
                                                                                                                      location : /usr/local/amdcal/bin/lnx64
                                                                                                                      64 Linux SL5.2, with Firestream 9170 & driver 8.561
                                                                                                                      terminal output:

                                                                                                                      [lnx64]$ exportspeed
                                                                                                                      Supported CAL Runtime Version: 1.3.145
                                                                                                                      Found CAL Runtime Version: 1.3.145
                                                                                                                      Program: exportspeed Kernel System
                                                                                                                      WxH In-Out Src Dst Iter GB/sec GB/sec
                                                                                                                      256x 256 1 1 4 4 2 inf 0.15
                                                                                                                      256x 256 1 2 4 4 2 2.93 0.21
                                                                                                                      256x 256 1 3 4 4 2 7.81 0.29
                                                                                                                      256x 256 1 4 4 4 2 inf 0.38
                                                                                                                      256x 256 1 5 4 4 2 2.93 0.37
                                                                                                                      256x 256 1 6 4 4 2 6.84 0.51
                                                                                                                      256x 256 1 7 4 4 2 7.81 0.54
                                                                                                                      256x 256 1 8 4 4 2 8.79 0.61

                                                                                                                      Press enter to exit...

                                                                                                                      --------------------------------------------------------------------------------
                                                                                                                      $ inputspeed
                                                                                                                      Supported CAL Runtime Version: 1.3.145
                                                                                                                      Found CAL Runtime Version: 1.3.145
                                                                                                                      Program: inputspeed Kernel System
                                                                                                                      WxH In-Out Src Dst Iter GB/sec GB/sec
                                                                                                                      256x 256 1 1 4 4 2 3.91 0.13
                                                                                                                      256x 256 2 1 4 4 2 inf 0.18
                                                                                                                      256x 256 3 1 4 4 2 7.81 0.20
                                                                                                                      256x 256 4 1 4 4 2 inf 0.23
                                                                                                                      256x 256 5 1 4 4 2 11.72 0.27
                                                                                                                      256x 256 6 1 4 4 2 13.67 0.29
                                                                                                                      256x 256 7 1 4 4 2 inf 0.33
                                                                                                                      256x 256 8 1 4 4 2 17.58 0.34
                                                                                                                      256x 256 9 1 4 4 2 9.77 0.36
                                                                                                                      256x 256 10 1 4 4 2 21.48 0.36
                                                                                                                      256x 256 11 1 4 4 2 23.44 0.39
                                                                                                                      256x 256 12 1 4 4 2 25.39 0.40
                                                                                                                      256x 256 13 1 4 4 2 13.67 0.41
                                                                                                                      256x 256 14 1 4 4 2 29.30 0.41
                                                                                                                      256x 256 15 1 4 4 2 15.62 0.42
                                                                                                                      256x 256 16 1 4 4 2 33.20 0.44

                                                                                                                      Press enter to exit...
                                                                                                                      --------------------------------------------------------------------------------
                                                                                                                      $ input_IL
                                                                                                                      Supported CAL Runtime Version: 1.3.145
                                                                                                                      Found CAL Runtime Version: 1.3.145
                                                                                                                      Program: input_IL Kernel System
                                                                                                                      WxH In-Out Src Dst Iter GB/sec GB/sec
                                                                                                                      256x 256 1 1 4 4 2 inf 0.13

                                                                                                                      Press enter to exit...
                                                                                                                      --------------------------------------------------------------------------------
                                                                                                                      $ output_IL
                                                                                                                      Supported CAL Runtime Version: 1.3.145
                                                                                                                      Found CAL Runtime Version: 1.3.145
                                                                                                                      Program: output_IL Kernel System
                                                                                                                      WxH In-Out Src Dst Iter GB/sec GB/sec
                                                                                                                      256x 256 0 1 4 4 2 inf 0.07

                                                                                                                      Press enter to exit...

                                                                                                                      Thanks
                                                                                                                      • SDK 1.3 Feedback
                                                                                                                        titanius

                                                                                                                        Bug Report (perhaps) in brook+ Samples

                                                                                                                        I am using a 4830 on a Core2 Duo machine with 4GB RAM on Debian 64bit, with the latest driver and sdk. I know neither 4830 or Debian is officially supported but...

                                                                                                                        So i am able to run all the brook+ examples, for few iterations, but when i try to run it for more iterations like say 100 or even 20 i end up getting the following error.

                                                                                                                        "Error occured
                                                                                                                        Kernel Execution : Uninitialized or Allocation failed Input streams.
                                                                                                                        Stream Write : Uninitialized stream"

                                                                                                                        it happens with the all the matmult samples for larger sizes like 1024 or so. for sizes like 512 i can go upto 50 iterations.

                                                                                                                        Other optimization feature in CAL samples

                                                                                                                        So i am trying to find the best mat x mat-mult code (including by using sgemm or dgemm). The CAL simple_matmult is real fast (320 gflops vs 200 gflops via sgemm) but the bottleneck in that CAL sample seems to be the way the data is copied between cpu-gpu. copyTo called via copyToGPU and copyFrom called via copyFromGPU (all in amdcal/samples/common/Samples.cpp)

                                                                                                                        Right now it seems to be iteratively copied, to and fro, so that padding is preserved. Perhaps a restructuring of the data in memory before copying it back might speed up quite a bit.

                                                                                                                        Documentation feature inclusion

                                                                                                                        Is it possible to include explaining the swizzle stuff in the computing guide? It can be found elsewhere on the web ( http://www.nada.kth.se/~tomaso/Stream2008/M3.pdf ), but it seems as an abrupt jump in the guide as there is no explanation of what swizzle does.

                                                                                                                         

                                                                                                                        thanks.

                                                                                                                         

                                                                                                                        • SDK 1.3 Feedback
                                                                                                                          nberger
                                                                                                                          Quick question on the .error() workaround: If I have multiple output streams, do I have to call .error() on all of them?
                                                                                                                          Thanks
                                                                                                                          Nik
                                                                                                                          • SDK 1.3 Feedback
                                                                                                                            nberger
                                                                                                                            Thanks for the quick answer. Now things are working fine...
                                                                                                                            • SDK 1.3 Feedback
                                                                                                                              Jetto

                                                                                                                              Hello,

                                                                                                                              I try to SDK on Ubuntu 8.10 amd64 on a Q6600 HD 4850 512MB.

                                                                                                                              I use standard libxcb-xlib so I have the annoying "locking assertion failure" backtrace.

                                                                                                                              I have some timing result of brook+ sample code that not looks like consistent:

                                                                                                                              $ ./mandelbrot -p -q
                                                                                                                              Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
                                                                                                                              64      64      1               0.000000        0.010000        0.000000

                                                                                                                              oops CPU is faster

                                                                                                                              $ ./mandelbrot -p -q -i 1000 2>/dev/null
                                                                                                                              Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
                                                                                                                              64      64      1000            0.105000        0.313000        0.335463

                                                                                                                              humm CPU is still faster

                                                                                                                              ./mandelbrot -p -q -i 10000 2>/dev/null
                                                                                                                              Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
                                                                                                                              64      64      10000           1.045000        23.088000       0.045262

                                                                                                                              OMG how can we explain that ?

                                                                                                                              If use larger matrix it's better

                                                                                                                              $ ./mandelbrot -p -x 1024 -y 1024 -q.
                                                                                                                              Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
                                                                                                                              1024    1024    1               0.023000        0.010000        2.300000

                                                                                                                              This is ok but

                                                                                                                              $  ./mandelbrot -p -x 8192 -y 8192 -i 10 -q
                                                                                                                              Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
                                                                                                                              8192    8192    10              14.849000       0.001000        14849.000000

                                                                                                                              GPU became 100 time faster !

                                                                                                                              ./mandelbrot -e -p -x 8192 -y 8192 -i 1 -q
                                                                                                                              -e Verify correct output.
                                                                                                                              Computing Mandelbrot set on CPU ... Done
                                                                                                                              ./mandelbrot: Failed!

                                                                                                                              Humm maybe the matrix is too big

                                                                                                                              I use binary from the sdk and do not try to compile.

                                                                                                                              BR.

                                                                                                                                • SDK 1.3 Feedback
                                                                                                                                  gaurav.garg

                                                                                                                                   

                                                                                                                                  $  ./mandelbrot -p -x 8192 -y 8192 -i 10 -q Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup         8192    8192    10              14.849000       0.001000        14849.000000

                                                                                                                                   

                                                                                                                                  GPU became 100 time faster !

                                                                                                                                   

                                                                                                                                  ./mandelbrot -e -p -x 8192 -y 8192 -i 1 -q -e Verify correct output. Computing Mandelbrot set on CPU ... Done ./mandelbrot: Failed!

                                                                                                                                   

                                                                                                                                  Humm maybe the matrix is too big

                                                                                                                                   

                                                                                                                                  I think you are running examples coming with legacy folder. Try running CPP samples. They have error checking on streams and in case Brook+ is not able to allocate stream on GPU, it will show an error rather than showing these false numbers.

                                                                                                                                    • SDK 1.3 Feedback
                                                                                                                                      zpdixon

                                                                                                                                      I noticed that dcl_resource_id(...) statements that are commented out in IL kernels are actually *interpreted* by the CAL compiler. How to reproduce: write a kernel with a commented out dcl_resource statement, and run calCtxRunProgram() without defining i0:

                                                                                                                                      ; dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

                                                                                                                                      Notice how calCtxRunProgram() will return an error "Symbol "i0" used as INPUT does not have a memory association.".

                                                                                                                                      Another bug: calGetErrorString() returns strings that are prematurely truncated. For example while debugging the above problem, for me printf("[%s]", calGetErrorString()) was displaying

                                                                                                                                      [Symbol "]

                                                                                                                                      After dumping the memory around that string, I noticed that it was actually

                                                                                                                                      [Symbol "\x00i0\x00" used as \x00INPUT\x00 does not have a memory association.]

                                                                                                                                      with 4 NUL bytes around "i0" and "INPUT". My platform is 64-bit linux if that matters...

                                                                                                                                       

                                                                                                                                  • bug in haar_wavelet ?
                                                                                                                                    Jetto

                                                                                                                                     /usr/local/amdbrook/samples/bin/legacy/lnx_x86_64/haar_wavelet  -e -i 2 -t -p -q 2>/dev/null
                                                                                                                                    Width   Height  Iterations      GPU Total Time 
                                                                                                                                    64      64      2               0.031000       

                                                                                                                                    -e Verify correct output.
                                                                                                                                    Computing Haar Wavelet Transform on CPU ... Done
                                                                                                                                    /usr/local/amdbrook/samples/bin/legacy/lnx_x86_64/haar_wavelet: Failed!

                                                                                                                                    -p Compare performance with CPU.
                                                                                                                                    Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
                                                                                                                                    64      64      2               0.000000        0.031000        0.000000

                                                                                                                                    It's ok with -i 1