39 Replies Latest reply on Nov 20, 2008 9:03 PM by bubu

    SDK 1.2 Feedback

    MicahVillmow
      Hello everyone, as you know we released Version 1.2 of the stream SDK. In order to further improve future releases of the SDK, we would appreciate your help in providing feedback in this thread so that the information does not get buried in other threads. Please make sure you label each item as a 'Feature Request', 'Bug Reports', 'Documentation' or 'Other'. As always, you can send an email to 'streamcomputing@amd.com' for general requests or 'streamdeveloper@amd.com' for development related requests.

      If you wish to file a Feature Request, please include a description of the feature request and the part of the SDK that this request applies to.

      If you wish to file a Bug Report, please include the hardware you are running on, operating system, SDK version, driver/catalyst version, and if possible either a detailed description on how to reproduce the problem or a test case. A test case is preferable as it can help reduce the time it takes to determine the cause of the issue.

      If you wish to file a Documentation request, please specify the document, what you believe is in error or what you believe should be added and which SDK the document is from.

      Thank you for your feedback.
      AMD Stream Computing Team

        • SDK 1.2 Feedback
          rahulgarg
          Documentation Request for "Stream Computing User Guide" in CAL v1.2:

          Please add overview of "Compute Shaders" as well as LDS functionality in the CAL portion of the user guide.
          a) What functionality is offered by compute shaders?
          b) How to access LDS and shared registers from compute shaders?
          c) How are thread groups and thread blocks of compute shaders mapped to hardware or allocated on SIMD? Do many thread groups execute concurrently on a SIMD? Looking at disassembly of compute shaders, one can check out the number of wavefronts allocated per SIMD by the compiler. The question is : how are these allocations computed?
          d) LDS : Difference between wavefrontRel and wavefrontAbs. An example will be appreciated.
          • SDK 1.2 Feedback
            rahulgarg
            For GPU kernels, maximizing cache hit rate is critical to performance. However the CAL documentation provides very little info about the cache hierarchy. I have 2 requests related to caches for CAL v1.2

            Feature request :

            For RV670, we can use the CAL counter extensions to record the input cache hit rate. However, RV770 does not have the same cache hierarchy and when I try and record the cache hit rates using the extension on RV770, I am getting 0.0 as the result. Thus I request that if cache hit rate counters do exist in the Rv770 hardware, then such counters should be exposed in CAL.

            Documentation Request:
            The cache hierarchy on RV770 is not properly documented in CAL. For example figure 3.3 in the Stream Computing User Guide provides a generic overview of the stream processor hardware but its not clear whether such a diagram is generic to both RV670 and RV770 or whether its specific to one family. The article on Rage3d does provide an overview of the cache hierarchy but the article is likely not quotable in any academic setting. Further, we have no clue about the sizes of the caches. Further in some cases its not at all clear whether a read/write will be cached. Consider a resource r1 allocated in linear memory (i.e. using CAL_RESALLOC_GLOBAL_BUFFER flag). Let r1 be bound to name "i0" (and not g[]) in a context. Now if I sample from i0, is the read cached?

            edit : Therefore I request that more info be provided for caches.
              • SDK 1.2 Feedback
                jopakastner

                Bug report:

                SDK 1.2, Brook+
                Catalyst 8.8
                HD 4850
                Linux 64 bit (openSuSE 10.3, Athlon X2)

                As discussed in the thread (with code example)
                http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=99991
                repeated invocation of a reduction kernel results in a segementation fault (at least for HD4850 & Linux 64)

                edit: Ok, its not a bug; I didn't realize that the length of the output stream for a reduction kernel has to match the length of the input stream. However, I think reduction is not so useful in this case ... Thus:

                Feature request:

                - Reduction of a 1D stream to a _real_ single value

              • SDK 1.2 Feedback
                MicahVillmow
                These have been reported and should be fixed in the next major release. Also bumping this thread so it doesn't fall off the first page.
                  • SDK 1.2 Feedback
                    ryta1203
                    Can we please have local arrays for Brook+ kernels?

                    Also, can the multi-kernel scatter out problem get fixed ALONG with the multi-out scatter for 1 kernel?
                      • SDK 1.2 Feedback
                        lpw

                        Feature Requests

                        - calMemCopy for domains.

                        - Write Query and Write Mask interfaces for CAL (similar to the GPU backends for Brook+, as seen in the source code).

                         

                        Documentation

                        - persistent (reduction) buffers, scratch buffers

                        - additional documentation for CAL extensions

                         

                    • SDK 1.2 Feedback
                      MicahVillmow
                      lpw&ryta,
                      I've added these to our tracker database so that the proper people in charge can make decisions.

                      Also, scratch buffers are documented a little bit in cal IL as temp arrays.

                      Lpw, can you expand about what you mean for write query/write mask?
                      • SDK 1.2 Feedback
                        kos

                        Feature request                                                                                                                                                                                                                       Description here  http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=100614&enterthread=y

                        • SDK 1.2 Feedback
                          kos

                          Question                                                                                                                                                                                                                                  Can I use Open Solaris with streamcomputing sdk if I can make it runing ati driver.

                          • Cal Compiler Feature Request: read-combining
                            sgratton

                            Hi Micah,

                            I'd really like to see "read-combining" or something similar to boost read speeds of global buffers, as you mentioned could be coming in this topic.

                            Assuming rv770 can read global memory quickly, this would be a real help for any algorithm that has to make multiple passes over the data; I think slow reading is a significant bottleneck for CAL at the moment.

                            Best,
                            Steven.
                            • SDK 1.2 Feedback
                              plaicy

                              Bug Report

                              It would be nice if the sdk detects if the current graphics card is not supported by cal (I tested it with a x700 on Linux):

                              $ ./bin/lnx32/FindNumDevices
                              XIO: fatal IO error 0 (Success) on X server ":0.1"
                              after 9 requests (9 known processed) with 0 events remaining.

                              With DISPLAY=:0.0 I get the same result. If I unset DISPLAY the tool FindNumDevices works correctly:

                              $ DISPLAY= ./bin/lnx32/FindNumDevices
                              CAL initialized.

                              Finding out number of devices :-
                              Device Count = 0

                              CAL shutdown successful.

                              Press enter to exit...

                               

                              Greetings from Hamburg

                              • SDK 1.2 Feedback
                                Ceq
                                Bug Report 1
                                -----------------------------------------

                                Using indexof on undefined variable causes a strange assertion failure instead of a error message, example:

                                kernel void test(float a<>, out float b<>) { b = a + indexof( bx ); }

                                Assertion failed: index >=0 && index <= AsInt(paramResource.size()), file h:\hd1\brook\platform\brcc\src\cgprogram.cpp, line 1019



                                Bug Report 2
                                -----------------------------------------

                                Some repeated reductions or inside a loop abort program execution, for example:

                                Open samples/tests/reduction/reduction.br and duplicate line 211:

                                matrix_mult(result1, quadresult);
                                sum(matrices, sum_res[0]); // line 211
                                sum(matrices, sum_res[0]); // duplicated line
                                sum(quadresult, sum_res[1]);

                                • SDK 1.2 Feedback
                                  MicahVillmow
                                  Ryta, this is possible via the C++ interface that is generated after your source code is run through brcc.
                                  • SDK 1.2 Feedback
                                    Ceq
                                    Hi ryta, about freeing streams I think this could be useful:

                                    Since Brook+ ignores preprocessor commands you can easily take advantage of it to avoid editing the generated file:

                                    #define streamFree(_stream) _stream.~stream();

                                    And now you can type in your code:

                                    streamFree(streamName);

                                    The same trick could be used in other situations since BRCC doesn't complain about undefined functions
                                      • SDK 1.2 Feedback
                                        jean-claude

                                        Hi Guys,

                                        Seems that there are a lot of questions left unanswered on this forum.

                                        I strongly would like to suggest to our friend at AMD to be more present on the forum and to provide adequate answers/hints to issues raised.

                                        The point is that definitively Brook+ is still far from being a professional grade environment.

                                        Anyway, there are enough motivated beta testers here that trust Brook is worth spending some time understanding it and developing advanced programs on GPU.

                                        So please AMD, show your dedication to Brook by allocating more interest to feedback and questions from your early users.

                                        Thanks.

                                        Jean-Claude

                                        • SDK 1.2 Feedback
                                          josopait

                                           

                                          Originally posted by: Ceq Hi ryta, about freeing streams I think this could be useful: Since Brook+ ignores preprocessor commands you can easily take advantage of it to avoid editing the generated file: #define streamFree(_stream) _stream.~stream(); And now you can type in your code: streamFree(streamName); The same trick could be used in other situations since BRCC doesn't complain about undefined functions


                                          Ceq,

                                          ugh, that looks a bit scary what you are doing there. If you destruct a stream by calling the ~stream() destructor, then it will get destructed a second time when the stream gets out of scope, with possibly undefined behavior. Well, if it works...

                                          • SDK 1.2 Feedback
                                            Yadovit

                                            Why you do not develop and do not promote AMD Stream as CUDA? In internet information on exit SDK and devices only. But article interesting no.

                                              • SDK 1.2 Feedback
                                                kos

                                                QUESTION  What hapens to domain of execution on "mov o0, r1                ret_dyn" - will there any value for that thread ?                                                QUESTION 2 : How much output registrs 0[n] can I use ?

                                                  • SDK 1.2 Feedback
                                                    tonald

                                                    why when I use brook+ 1.21 got a strange behavior, finally, I find it like below, when I try to use twice GPU:

                                                     /////////////////////////////////////////////////////////////////////////
                                                     // Brook code block
                                                     /////////////////////////////////////////////////////////////////////////
                                                        {
                                                            float inputStream<Length>;
                                                            float outputStream<Length>;
                                                            float res<1>;

                                                            streamRead(inputStream, input);
                                                            hello_brook_check(inputStream, outputStream, (float)Length / 3.0f);
                                                            hello_brook_sum(outputStream, res);
                                                            streamWrite(res, &result);
                                                        }

                                                     /////////////////////////////////////////////////////////////////////////
                                                     // Brook code block
                                                     /////////////////////////////////////////////////////////////////////////
                                                        {
                                                            float inputStream1<Length>;
                                                            float outputStream1<Length>;
                                                            float res1<1>;

                                                            streamRead(inputStream1, input);
                                                            hello_brook_check(inputStream1, outputStream1, (float)Length / 3.0f);
                                                            hello_brook_sum(outputStream1, res1);
                                                            streamWrite(res1, &result);
                                                        }

                                                     

                                                    the programm will halt at "hello_brook_check(inputStream1, outputStream1, (float)Length / 3.0f);" and give error message " unhandled exception at .....".

                                                    but if I use it like:

                                                    /////////////////////////////////////////////////////////////////////////
                                                     // Brook code block
                                                     /////////////////////////////////////////////////////////////////////////

                                                            float inputStream1<Length>;
                                                            float outputStream1<Length>;
                                                            float res1<1>;


                                                        {
                                                            float inputStream<Length>;
                                                            float outputStream<Length>;
                                                            float res<1>;

                                                            streamRead(inputStream, input);
                                                            hello_brook_check(inputStream, outputStream, (float)Length / 3.0f);
                                                            hello_brook_sum(outputStream, res);
                                                            streamWrite(res, &result);
                                                        }

                                                     /////////////////////////////////////////////////////////////////////////
                                                     // Brook code block
                                                     /////////////////////////////////////////////////////////////////////////
                                                        {

                                                            streamRead(inputStream1, input);
                                                            hello_brook_check(inputStream1, outputStream1, (float)Length / 3.0f);
                                                            hello_brook_sum(outputStream1, res1);
                                                            streamWrite(res1, &result);
                                                        }

                                                     

                                                    There will  be no error, program will run correctly.

                                                     

                                                     

                                                    Is there anybody know what happen here?

                                              • SDK 1.2 Feedback
                                                kos

                                                feature request                                                                                                   HLSL extension in AMDhlslCompiler - gather or random read via global buffer.   Not critical, but likely/

                                                • SDK 1.2 Feedback
                                                  MicahVillmow
                                                  kos, the exact syntax for setting up a global buffer is as follows:
                                                  global float4 random[];

                                                  usage is same as using a C array.
                                                  random[0].z = 4294967296;
                                                    • SDK 1.2 Feedback
                                                      kos

                                                      Could I type code like that in GPUShaderAnalizer ? Will it only work on R670+ gpus ?

                                                        • SDK 1.2 Feedback
                                                          kos

                                                          And if you are here right now please answer to following questions:   1) can I estimate gpu load parameter just like catalist does, and how ?(under linux and windows)          2) Can I use streamcomputing sdk under Sun Solaris 10 if linux driver properly working              3 ) can't you provide gpu analizer lib for linux ?

                                                      • SDK 1.2 Feedback
                                                        MicahVillmow
                                                        kos,
                                                        GPU Shader Analyzer does not currently support AMD HLSL which is shipped with the CAL SDK. You can estimate GPU load by looking at the ISA and calculating how many ALU instructions you are executing in comparison to the number of texture instructions. I don't know the exact heuristics/equations GSA uses, so I can't tell you how. You can find some performance equations from slides here: http://coachk.cs.ucf.edu/courses/CDA6938/
                                                        Solaris is not a supported platform at this time and is not something we test, so I can't answer this. GPU Shader Analyzer is currently windows only, but if you send them an email requesting linux support they will better be able to understand their users needs and can make decisions about support for linux based on that information.
                                                          • SDK 1.2 Feedback
                                                            kos

                                                            THANK YOU MICAH! Do you mean that Catalis Control Center Overdrive (or simply overclocking) panel performance counter calculates ALU/TEX ratio ? I saw gpu load monitoring in other programs(rivatuner) and thought that there is some standart interfase to get gpu load, for exemple I can run my cal application and look to that perf. counter. And all that dinamicaly, I've sent email to rivatuner author, but again I thought there must be standart interface to get gpu load characteristic.

                                                          • SDK 1.2 Feedback
                                                            Ceq
                                                            Umh, you're right Josopait. I'm not sure what would happen. Anyway you can use the preprocessor that way to call C++ functions, otherwise wouldn't be allowed in Brook+ without modifying the compiler output.
                                                            • SDK 1.2 Feedback
                                                              lpw

                                                              Feature Request

                                                              A blocking version of calCtxIsEventDone would be nice (without busy wait).

                                                              Cheers,

                                                              L

                                                              • SDK 1.2 Feedback
                                                                sgratton

                                                                Hi there,

                                                                I'd like to report a probable...

                                                                documentation error in the CAL 1.2.1 SDK:

                                                                Intermediate Language Spec, the "sample" instruction on p 6-28. It says the range of the "aoffimmi" offset is -64->63.5 (i.e. the offsets are S7.1 format). I think it should rather be -8->7.5. The latter would be consistent with the r600isa.pdf document which, in describing tex_dword2, says offsets are S3.1 or [-8,8) and also with my experience in debugging a kernel.

                                                                Best,
                                                                Steven.



                                                                • SDK 1.2 Feedback
                                                                  sgratton

                                                                  Hi there,

                                                                  A feature request: a full-precision IL dsqrt instruction. (Presumably it'd need to compile into multiple gpuisa instructions.)

                                                                  Thanks,
                                                                  Steven.


                                                                  • SDK 1.2 Feedback
                                                                    MicahVillmow
                                                                    bubu, Can you expand on what exactly you mean here, thanks?

                                                                    Kos,
                                                                    1) probably zero's will get written out
                                                                    2) There are between 8 and 16 outputs depending on the graphics card.
                                                                      • SDK 1.2 Feedback
                                                                        bubu

                                                                         

                                                                        Originally posted by: MicahVillmow bubu, Can you expand on what exactly you mean here, thanks?


                                                                        Somebody there had the wonderful idea of copy-protecting the: Stream_Computing_User_Guide.pdf ( rev.1.2.1)

                                                                        R600isa.pdf(rev 0.31)

                                                                        Intermediate_Language_Specification--Stream_Processor.pdf(v2.0)

                                                                        ... so you cannot copy(for copy-paste) the example code and neither the functions names and C constants/flags...

                                                                        I was starting to learn CAL functions... I wanted to copy the CAL_RESALLOC_CACHED flag from there to my code.... but I cannot copy due to the security restrictions on the PDF. Also I tried to copy the Cal-init example on page 3-9 ... but I cannot because it's copy protected...

                                                                        That's what I'm referring to... and that protection is ridiculous... because if I can print the document I can perform an OCR... or to download a PDF crack tool from noob secutiry web pages... So, pls, remote that protection or DRM.