cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

SDK 1.2 Feedback

Hello everyone, as you know we released Version 1.2 of the stream SDK. In order to further improve future releases of the SDK, we would appreciate your help in providing feedback in this thread so that the information does not get buried in other threads. Please make sure you label each item as a 'Feature Request', 'Bug Reports', 'Documentation' or 'Other'. As always, you can send an email to 'streamcomputing@amd.com' for general requests or 'streamdeveloper@amd.com' for development related requests.

If you wish to file a Feature Request, please include a description of the feature request and the part of the SDK that this request applies to.

If you wish to file a Bug Report, please include the hardware you are running on, operating system, SDK version, driver/catalyst version, and if possible either a detailed description on how to reproduce the problem or a test case. A test case is preferable as it can help reduce the time it takes to determine the cause of the issue.

If you wish to file a Documentation request, please specify the document, what you believe is in error or what you believe should be added and which SDK the document is from.

Thank you for your feedback.
AMD Stream Computing Team

0 Likes
39 Replies
rahulgarg
Adept II

Documentation Request for "Stream Computing User Guide" in CAL v1.2:

Please add overview of "Compute Shaders" as well as LDS functionality in the CAL portion of the user guide.
a) What functionality is offered by compute shaders?
b) How to access LDS and shared registers from compute shaders?
c) How are thread groups and thread blocks of compute shaders mapped to hardware or allocated on SIMD? Do many thread groups execute concurrently on a SIMD? Looking at disassembly of compute shaders, one can check out the number of wavefronts allocated per SIMD by the compiler. The question is : how are these allocations computed?
d) LDS : Difference between wavefrontRel and wavefrontAbs. An example will be appreciated.
0 Likes
rahulgarg
Adept II

For GPU kernels, maximizing cache hit rate is critical to performance. However the CAL documentation provides very little info about the cache hierarchy. I have 2 requests related to caches for CAL v1.2

Feature request :

For RV670, we can use the CAL counter extensions to record the input cache hit rate. However, RV770 does not have the same cache hierarchy and when I try and record the cache hit rates using the extension on RV770, I am getting 0.0 as the result. Thus I request that if cache hit rate counters do exist in the Rv770 hardware, then such counters should be exposed in CAL.

Documentation Request:
The cache hierarchy on RV770 is not properly documented in CAL. For example figure 3.3 in the Stream Computing User Guide provides a generic overview of the stream processor hardware but its not clear whether such a diagram is generic to both RV670 and RV770 or whether its specific to one family. The article on Rage3d does provide an overview of the cache hierarchy but the article is likely not quotable in any academic setting. Further, we have no clue about the sizes of the caches. Further in some cases its not at all clear whether a read/write will be cached. Consider a resource r1 allocated in linear memory (i.e. using CAL_RESALLOC_GLOBAL_BUFFER flag). Let r1 be bound to name "i0" (and not g[]) in a context. Now if I sample from i0, is the read cached?

edit : Therefore I request that more info be provided for caches.
0 Likes

Bug report:

SDK 1.2, Brook+
Catalyst 8.8
HD 4850
Linux 64 bit (openSuSE 10.3, Athlon X2)

As discussed in the thread (with code example)
http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=99991
repeated invocation of a reduction kernel results in a segementation fault (at least for HD4850 & Linux 64)

edit: Ok, its not a bug; I didn't realize that the length of the output stream for a reduction kernel has to match the length of the input stream. However, I think reduction is not so useful in this case ... Thus:

Feature request:

- Reduction of a 1D stream to a _real_ single value

0 Likes

These have been reported and should be fixed in the next major release. Also bumping this thread so it doesn't fall off the first page.
0 Likes

Can we please have local arrays for Brook+ kernels?

Also, can the multi-kernel scatter out problem get fixed ALONG with the multi-out scatter for 1 kernel?
0 Likes

Feature Requests

- calMemCopy for domains.

- Write Query and Write Mask interfaces for CAL (similar to the GPU backends for Brook+, as seen in the source code).

 

Documentation

- persistent (reduction) buffers, scratch buffers

- additional documentation for CAL extensions

 

0 Likes

lpw&ryta,
I've added these to our tracker database so that the proper people in charge can make decisions.

Also, scratch buffers are documented a little bit in cal IL as temp arrays.

Lpw, can you expand about what you mean for write query/write mask?
0 Likes
kos
Journeyman III

Feature request                                                                                                                                                                                                                       Description here  http://forums.amd.com/devforum/messageview.cfm?catid=328&threadid=100614&enterthread=y

0 Likes

Hello!

I have Radeon 3650. This card is support AMD Stream?

Thanks!

(Sorry for my english)

0 Likes

Yadovit. yes it is  supported.

0 Likes
kos
Journeyman III

Question                                                                                                                                                                                                                                  Can I use Open Solaris with streamcomputing sdk if I can make it runing ati driver.

0 Likes
sgratton
Adept I


Hi Micah,

I'd really like to see "read-combining" or something similar to boost read speeds of global buffers, as you mentioned could be coming in this topic.

Assuming rv770 can read global memory quickly, this would be a real help for any algorithm that has to make multiple passes over the data; I think slow reading is a significant bottleneck for CAL at the moment.

Best,
Steven.
0 Likes
plaicy
Journeyman III

Bug Report

It would be nice if the sdk detects if the current graphics card is not supported by cal (I tested it with a x700 on Linux):

$ ./bin/lnx32/FindNumDevices
XIO: fatal IO error 0 (Success) on X server ":0.1"
after 9 requests (9 known processed) with 0 events remaining.

With DISPLAY=:0.0 I get the same result. If I unset DISPLAY the tool FindNumDevices works correctly:

$ DISPLAY= ./bin/lnx32/FindNumDevices
CAL initialized.

Finding out number of devices :-
Device Count = 0

CAL shutdown successful.

Press enter to exit...

 

Greetings from Hamburg

0 Likes
Ceq
Journeyman III

Bug Report 1
-----------------------------------------

Using indexof on undefined variable causes a strange assertion failure instead of a error message, example:

kernel void test(float a<>, out float b<>) { b = a + indexof( bx ); }

Assertion failed: index >=0 && index <= AsInt(paramResource.size()), file h:\hd1\brook\platform\brcc\src\cgprogram.cpp, line 1019



Bug Report 2
-----------------------------------------

Some repeated reductions or inside a loop abort program execution, for example:

Open samples/tests/reduction/reduction.br and duplicate line 211:

matrix_mult(result1, quadresult);
sum(matrices, sum_res[0]); // line 211
sum(matrices, sum_res[0]); // duplicated line
sum(quadresult, sum_res[1]);

0 Likes

If there is not a way to already explicitly free a stream in Brook+ than it would be great to be able to do this since streams are unidirectional.
0 Likes

Ryta, this is possible via the C++ interface that is generated after your source code is run through brcc.
0 Likes

Originally posted by: MicahVillmow

Ryta, this is possible via the C++ interface that is generated after your source code is run through brcc.


Ok, let me be more specific, I would like to do it directly without having to do this, that would be nice.
0 Likes

Also, I would deeply like a better working documentation that speaks about the major differences between BRT_RUNTIME=CPU and BRT_RUNTIME=CAL
0 Likes
Ceq
Journeyman III

Hi ryta, about freeing streams I think this could be useful:

Since Brook+ ignores preprocessor commands you can easily take advantage of it to avoid editing the generated file:

#define streamFree(_stream) _stream.~stream();

And now you can type in your code:

streamFree(streamName);

The same trick could be used in other situations since BRCC doesn't complain about undefined functions
0 Likes

Hi Guys,

Seems that there are a lot of questions left unanswered on this forum.

I strongly would like to suggest to our friend at AMD to be more present on the forum and to provide adequate answers/hints to issues raised.

The point is that definitively Brook+ is still far from being a professional grade environment.

Anyway, there are enough motivated beta testers here that trust Brook is worth spending some time understanding it and developing advanced programs on GPU.

So please AMD, show your dedication to Brook by allocating more interest to feedback and questions from your early users.

Thanks.

Jean-Claude

0 Likes

Originally posted by: Ceq Hi ryta, about freeing streams I think this could be useful: Since Brook+ ignores preprocessor commands you can easily take advantage of it to avoid editing the generated file: #define streamFree(_stream) _stream.~stream(); And now you can type in your code: streamFree(streamName); The same trick could be used in other situations since BRCC doesn't complain about undefined functions


Ceq,

ugh, that looks a bit scary what you are doing there. If you destruct a stream by calling the ~stream() destructor, then it will get destructed a second time when the stream gets out of scope, with possibly undefined behavior. Well, if it works...

0 Likes

Why you do not develop and do not promote AMD Stream as CUDA? In internet information on exit SDK and devices only. But article interesting no.

0 Likes

QUESTION  What hapens to domain of execution on "mov o0, r1                ret_dyn" - will there any value for that thread ?                                                QUESTION 2 : How much output registrs 0 can I use ?

0 Likes
tonald
Journeyman III

why when I use brook+ 1.21 got a strange behavior, finally, I find it like below, when I try to use twice GPU:

 /////////////////////////////////////////////////////////////////////////
 // Brook code block
 /////////////////////////////////////////////////////////////////////////
    {
        float inputStream<Length>;
        float outputStream<Length>;
        float res<1>;

        streamRead(inputStream, input);
        hello_brook_check(inputStream, outputStream, (float)Length / 3.0f);
        hello_brook_sum(outputStream, res);
        streamWrite(res, &result);
    }

 /////////////////////////////////////////////////////////////////////////
 // Brook code block
 /////////////////////////////////////////////////////////////////////////
    {
        float inputStream1<Length>;
        float outputStream1<Length>;
        float res1<1>;

        streamRead(inputStream1, input);
        hello_brook_check(inputStream1, outputStream1, (float)Length / 3.0f);
        hello_brook_sum(outputStream1, res1);
        streamWrite(res1, &result);
    }

 

the programm will halt at "hello_brook_check(inputStream1, outputStream1, (float)Length / 3.0f);" and give error message " unhandled exception at .....".

but if I use it like:

/////////////////////////////////////////////////////////////////////////
 // Brook code block
 /////////////////////////////////////////////////////////////////////////

        float inputStream1<Length>;
        float outputStream1<Length>;
        float res1<1>;


    {
        float inputStream<Length>;
        float outputStream<Length>;
        float res<1>;

        streamRead(inputStream, input);
        hello_brook_check(inputStream, outputStream, (float)Length / 3.0f);
        hello_brook_sum(outputStream, res);
        streamWrite(res, &result);
    }

 /////////////////////////////////////////////////////////////////////////
 // Brook code block
 /////////////////////////////////////////////////////////////////////////
    {

        streamRead(inputStream1, input);
        hello_brook_check(inputStream1, outputStream1, (float)Length / 3.0f);
        hello_brook_sum(outputStream1, res1);
        streamWrite(res1, &result);
    }

 

There will  be no error, program will run correctly.

 

 

Is there anybody know what happen here?

0 Likes
kos
Journeyman III

feature request                                                                                                   HLSL extension in AMDhlslCompiler - gather or random read via global buffer.   Not critical, but likely/

0 Likes

kos, this is already available via the global keyword.
0 Likes

Oops, I've made a mistake - gather!=read it's random write, but I know that random read in HLSL = texture.sample(coord).  Can't you provide any samples or give me a link to it ? Thank, you.

0 Likes

kos, the exact syntax for setting up a global buffer is as follows:
global float4 random[];

usage is same as using a C array.
random[0].z = 4294967296;
0 Likes

Could I type code like that in GPUShaderAnalizer ? Will it only work on R670+ gpus ?

0 Likes
kos
Journeyman III

And if you are here right now please answer to following questions:   1) can I estimate gpu load parameter just like catalist does, and how ?(under linux and windows)          2) Can I use streamcomputing sdk under Sun Solaris 10 if linux driver properly working              3 ) can't you provide gpu analizer lib for linux ?

0 Likes

kos,
GPU Shader Analyzer does not currently support AMD HLSL which is shipped with the CAL SDK. You can estimate GPU load by looking at the ISA and calculating how many ALU instructions you are executing in comparison to the number of texture instructions. I don't know the exact heuristics/equations GSA uses, so I can't tell you how. You can find some performance equations from slides here: http://coachk.cs.ucf.edu/courses/CDA6938/
Solaris is not a supported platform at this time and is not something we test, so I can't answer this. GPU Shader Analyzer is currently windows only, but if you send them an email requesting linux support they will better be able to understand their users needs and can make decisions about support for linux based on that information.
0 Likes

THANK YOU MICAH! Do you mean that Catalis Control Center Overdrive (or simply overclocking) panel performance counter calculates ALU/TEX ratio ? I saw gpu load monitoring in other programs(rivatuner) and thought that there is some standart interfase to get gpu load, for exemple I can run my cal application and look to that perf. counter. And all that dinamicaly, I've sent email to rivatuner author, but again I thought there must be standart interface to get gpu load characteristic.

0 Likes
Ceq
Journeyman III

Umh, you're right Josopait. I'm not sure what would happen. Anyway you can use the preprocessor that way to call C++ functions, otherwise wouldn't be allowed in Brook+ without modifying the compiler output.
0 Likes
lpw
Journeyman III

Feature Request

A blocking version of calCtxIsEventDone would be nice (without busy wait).

Cheers,

L

0 Likes
sgratton
Adept I


Hi there,

I'd like to report a probable...

documentation error in the CAL 1.2.1 SDK:

Intermediate Language Spec, the "sample" instruction on p 6-28. It says the range of the "aoffimmi" offset is -64->63.5 (i.e. the offsets are S7.1 format). I think it should rather be -8->7.5. The latter would be consistent with the r600isa.pdf document which, in describing tex_dword2, says offsets are S3.1 or [-8,8) and also with my experience in debugging a kernel.

Best,
Steven.



0 Likes
sgratton
Adept I


Hi there,

A feature request: a full-precision IL dsqrt instruction. (Presumably it'd need to compile into multiple gpuisa instructions.)

Thanks,
Steven.


0 Likes

Pls, unlock security in the PDF. It's a P.A.I.N to remember all the error codes, function names, etc.

0 Likes

bubu, Can you expand on what exactly you mean here, thanks?

Kos,
1) probably zero's will get written out
2) There are between 8 and 16 outputs depending on the graphics card.
0 Likes

Originally posted by: MicahVillmow bubu, Can you expand on what exactly you mean here, thanks?


Somebody there had the wonderful idea of copy-protecting the: Stream_Computing_User_Guide.pdf ( rev.1.2.1)

R600isa.pdf(rev 0.31)

Intermediate_Language_Specification--Stream_Processor.pdf(v2.0)

... so you cannot copy(for copy-paste) the example code and neither the functions names and C constants/flags...

I was starting to learn CAL functions... I wanted to copy the CAL_RESALLOC_CACHED flag from there to my code.... but I cannot copy due to the security restrictions on the PDF. Also I tried to copy the Cal-init example on page 3-9 ... but I cannot because it's copy protected...

That's what I'm referring to... and that protection is ridiculous... because if I can print the document I can perform an OCR... or to download a PDF crack tool from noob secutiry web pages... So, pls, remote that protection or DRM.

0 Likes