Archives Discussions

MicahVillmow · ‎12-10-2008

Now that 1.3 has been released to the public, we would like feedback on it un order to further improve future releases of the SDK. we would appreciate your help in providing feedback in this thread so that the information does not get buried in other threads. Please make sure you label each item as a 'Feature Request', 'Bug Reports', 'Documentation' or 'Other'. As always, you can send an email to 'streamcomputing@amd.com' for general requests or 'streamdeveloper@amd.com' for development related requests.

If you wish to file a Feature Request, please include a description of the feature request and the part of the SDK that this request applies to.

If you wish to file a Bug Report, please include the hardware you are running on, operating system, SDK version, driver/catalyst version, and if possible either a detailed description on how to reproduce the problem or a test case. A test case is preferable as it can help reduce the time it takes to determine the cause of the issue.

If you wish to file a Documentation request, please specify the document, what you believe is in error or what you believe should be added and which SDK the document is from.

Thank you for your feedback.
AMD Stream Computing Team

wgbljl · ‎12-10-2008

Hi Micah,

In the Page 151 of the User_Guide in 1.3beta SDK, it says that the FireStream 9250 support "Compute Kernel" which FireStream 9170 does not support. What's the meaning of "Compute Kernel"?

Thanks.

pbhani · ‎12-11-2008

Compute Kernels (aka Compute Shaders) are a more generic way of launching Kernels on R7xx GPUs. The best way to understand them is to recall that modern GPUs have graphics specific pipelines and currently, CAL by default launches kernels using Pixel Shaders. You can think of these shaders as being invoked during the rasterization of screen aligned quadrilaterals. However, this involves various kinds of interactions with and dependence on the graphics pipeline. e.g. If you look closely, each CAL kernel has a title il_ps_* (for pixel shader mode) and vPos/vWinCoord for current fragment position.

Compute Kernels basically relax these dependencies. You have direct access to thread IDs in the kernels and you can control various thread group parameters. Compute Kernels also allow you to access R7xx's on-ship per-SIMD shared memory.

wgbljl · ‎12-11-2008

Hi pbhani, How to use the on-chip shared memory in Brook+? or Do I have to use it in IL language?

Thanks.

gaurav_garg · ‎12-11-2008

Currently shared memory is exposed only through IL.

pbhani · ‎12-12-2008

As Gaurav said, shared memory is not currently available via Brook+ and niether is compute shader support. We have the RFE on our list of TODOs for future releases... so stay tuned.

rveldema · ‎12-11-2008

- double constants in expressions cause internal compiler error in brcc:

A test case:

-------------------------------------------------

kernel void   gpgpu_laplacalc(float __brook_iter_space<>, int __looplen, int Z, int
firstcol, int lastcol, double loop_invar_0,
       double      t1a[], out double      t1b[], double      t1c[], double      t1d[]
)
{
int   iindex;
double xx;

iindex= 4;

xx = 4.0;

t1b[iindex]= t1a[iindex] - (4.0 * t1a[iindex]);
}
------------------------

replace the 4.0 in the last line with xx allows brcc to compile the kernel OK.

This is under linux with brcc from 1.3.

godsic · ‎12-11-2008

Is HD3450 support ATI Stream 1.3 with Cat 8.12?

rveldema · ‎12-11-2008

feature request.

Brook 1.3 does not support multple gather streams

so that is illegal to write:

------------------

kernel void multiscatter(double input<>, out double output1[], out double output2[])
{
    int index = 33;
    output1[index] = input * 2.0;
    output2[index] = input * 3.0;
}
-----------------

Unfortunately, this can be quite common requiring the programmer

to split the code above into multiple kernels and doing

duplicate work.

pbhani · ‎12-12-2008

rveldema,

Thanks for your feedback. I assume you mean multiple scatter streams! Yes, this is a restriction of the underlying architecture that Brook+ does not virtualize. It would be possible to support this feature using multiple passes and we do have this on our TODO list.

rveldema · ‎12-11-2008

bug report

like the other report, this gives a strange internal compiler error too

but seemingly for a different reason.

-------------
kernel void singlescatter(double input<>, out double output1[])
{
    int index = 33;
    double xx = 2.0;
    output1[index] = input * 2.0;
}

-----------

replacing the 2.0 with xx in the last line again fixes the problem fortunately,

maligor · ‎12-11-2008

Been testing Brook+ as a raw converter for photo's. This is converted from the old 1.2.1 stuff.

More or less 'if else' seems to corrupt the 'mode' variable in some way and it runs way out of bounds. It seems to grab y index inside itself somehow. The cpu target doesn't do this. Removing the 'else' statements in it makes it behave.

Broken Code:

kernel void convert_surface_simple( float img[][], out float3 image_out<> )
{
    int2 ind = instance().xy;
    int mode;
    mode = (int)(fmod((float)ind.x * 1.0f, 2.0f) + fmod((float)ind.y * 1.0f, 2.0f) * 2.0f);

    if( mode == 0 ) { // Blue
        image_out.x = img[ind.y+1][ind.x+1];
        image_out.y = img[ind.y+1][ind.x];
        image_out.z = img[ind.y][ind.x];
    }
    else if( mode == 1 ) { // Green1
        image_out.x = img[ind.y+1][ind.x];
        image_out.y = img[ind.y][ind.x];
        image_out.z = img[ind.y][ind.x+1];
    }
    else if( mode == 2 ) { // Green2
        image_out.x = img[ind.y][ind.x+1];
        image_out.y = img[ind.y][ind.x];
        image_out.z = img[ind.y+1][ind.x];
    }
    else if( mode == 3 ) { // Red
        image_out.x = img[ind.y][ind.x];
        image_out.y = img[ind.y+1][ind.x];
        image_out.z = img[ind.y+1][ind.x+1];
    }
}

This is just a version I use for testing the in/out, I have a more complex version that doesn't work right quite yet but the speed is impressive.

rveldema · ‎12-11-2008

bug report

gather array of structures

----------------------

typedef struct foo
{
double field;
} foo;

kernel void struct_use(double input<>, out double output1[], foo z[])
{
int index = 33;
output1[index] = input * z[0].field;
}

----------------------

this causes brcc (1.3, & linux) to abort saying something about

failing to determine which function overload was intended ??).

Removing " * z[0].field" from the last line makes it compile again.

MicahVillmow · ‎12-11-2008

Thank you very much for the bug reports. I've filed them against the brook+ team so that they can be fixed.

rveldema · ‎12-11-2008

feature request

explicit memory management for GPU memory:

I'd very much like some version of brook_malloc(int) and brook_free(void*)

coupled ofcourse with memory indirection in kernel codes.

This would allow us to implement complex data structures in GPU memory.

Example:

----------------------

typedef struct tree {

double val;

struct tree *left, *right;

} tree;

kernel void access_tree(tree *t, reduce double sum<>

{ sum = t->val + t->left->val; // etc

}

void hostcode() {

tree *a = brook_malloc(sizeof(tree)); // etc

stream b; // etc

access_tree(a, b);

brook_free(a); // etc

}

----------------------

This would make the work needed to 'massage' our data structures into

arrays superfluous.

MicahVillmow · ‎12-11-2008

sla · ‎12-11-2008

IL still has no integer sub ('isub') instruction, and 'iadd' with 'neg' modifier gets compiled into 2 instructions:
iadd    r0.x,r0.x,r3.y_neg(y)

SUB_INT     ____, 0.0f, PV2.x
ADD_INT     ____, R3.x, PV3.z

rveldema · ‎12-12-2008

yes, I know, this is a well known transform (creating an array for each structure field and then having a semi-runtime pass that translates pointers to indexes). With a little effort, something like a limited 'new' might be possible even inside kernels.

Luckily, its also easy to implement in the compiler! Its a choice of changing every app each time or the compiler once.

Cheers,

Ronald.

Ceq · ‎12-11-2008

I think I've found a bug in SDK 1.3, every time you try to use a structure
in a subkernel brcc aborts due to an unhandled exception.
(BRCC source file "express.cpp", function "semanticCheck", line 1845)

To check this open the following sample program from Brook+ directory:
"BROOK\samples\legacy\tests\struct"

The Brook+ code should be like the following:
----------------------------------------------------------------------------------------------
typedef struct PairRec
{
float first;
float second;
} Pair;

kernel void struct_gather(float index< >, Pair pairs[ ], out float result< > )
{
Pair p = pairs[ index ];
result = p.first + p.second;
}

Now create another kernel and call it from the first one:
----------------------------------------------------------------------------------------------
kernel void auxKer(Pair p< >, out float result< >) { result = p.first + p.second; }

kernel void struct_gather(float index< >, Pair pairs[ ], out float result< > )
{
Pair p = pairs[ index ];
auxKer(p, result);
}

According to Brook+ documentation calling a subkernel from another kernel should work.
Note that if you use base types this doesn't happen.

pbhani · ‎12-12-2008

Ceq,

Thanks for the bug report. Looks like this code-path is busted! We'll track this as well. Please use the workaround of using base types, as you suggested, for now.

beldoy · ‎12-12-2008

Hi,

I have the HD 3850 card agp version and after installing the 8.12 drivers and 1.3 sdk it seems that the avivo transcoder does not have h264 avc encoding or at least I cant see it anywhere and when using any of the available options to encode the time taken is almost twice as long as with normal cpu encoding, and would seem like the stream part is not enabled at least with my setup.

My setup:

AMD 3000+ CPU single core
1.5GB Memory
HD 3850 AGP Card

Any help appreciated.

Remotion · ‎12-12-2008

Hi,

The new Brook+ runtime look much better but unfortunately still has many problems.

It seems that only this simple reduction kernel will return proper value.

reduce void
ReduceK(float input<>, reduce float output<> )
{
output += input;
}

This one and other variations on input alredy return totaly wrong ansver (INF) on HD 4870.

reduce void
ReduceK(float input<>, reduce float output<> )
{
output += (input * input);
}

This is strange why now one need to create env var BRT_PERMIT_READ_WRITE_ALIASING to use one strean as input and as output.

This was working well with older SDK.

It there a way to use kernels like this one?

kernel void Add(float input<>, out float output<> )

{

output += input; // this is not reduction!

}

Remotion · ‎12-12-2008

Why it is now needed to define BRT_PERMIT_READ_WRITE_ALIASING to use the same stream as input and output.
Using the same stream as input and output highly simplify and accelerated my programs and work well with HD 4870.

The biggest problem is kernel call memory leaks and slowdown using VS2008.

pbhani · ‎12-12-2008

Remotion,

GPUs have separate read and write caches. Using the same Stream as input as well as output might work for simple cases, but is NOT guaranteed to work in the general purpose case (e.g. use of gather and scatter streams). The new runtime simply checks for this condition. If you feel that your application is not sensitive to this issue, please use the env variable... as long as you are aware of the underlying issues.

gaurav_garg · ‎12-12-2008

Hi Remotion,

Thanks for your feedback.

1. Issues with reduction - I think the issue might be that the result is going out of floating point maximum value that GPU can represnt. You can try to put a bound on your input array values (may be something < 5) and test it.

2. Why BRT_PERMIT_READ_WRITE_ALIASING - Under SIMD parallelism you might get incorrect results if you use the same stream as input and output. Consider this example -

kernel void test(float a[], out float b<>

{

b = a[0];
}

If you call this kernel with the same stream as input and output you migtht get undefined results as Brook+ doesn't guarantee order of execution of input stream.

3. Memory leaks and slow-down with VS2008 - Do you see these memory leaks only with VS2008? I am using pre-built Brook+ library and tried running some samples with iterations upto 1000 and don't see increasing memory usage of my application or any slow-down.

Remotion · ‎12-12-2008

Hi and thanks for you reply,

I have tested this reduction kernel with streams filled wiht 1.0 and still got wrong results.

Yes I know about this problem, this is the same as using multiple CPU cores to do the work and all my kernels are resistent to this.

I am using WinXP 64-bit and calling kernels from another DLL with is compiled with VS2008 and have this problem.

The Brook+ runtime is compiled with VS2008 too.

gaurav_garg · ‎12-12-2008

Could you also post your runtime part of the code for reduction issue?

We will try to reproduce the memory leak issue on our end with VS2008. It would be great if you can post a test case.

Thanks.

Remotion · ‎12-12-2008

I have just modified reduce_kernel sample.

reduce void
reduceGPU(float input<>, reduce float output<>
{
output += input * 0.1f;
}

Even this code return wrong result.

I will try to create simple project with memory leak issue and send it vie e-mail later.

gaurav_garg · ‎12-13-2008

What are the dimensions are you using for input and output streams for reduction?

Did you try error checking on output stream? errorLog on output stream can give some useful information.

gaurav_garg · ‎12-13-2008

Hi Remotion,

Reduction doesn't work if you have any expression on the right side.

Reductions is defined as a single, two-input operator. I think this constarint on reduction is not new and it has been the same way from brookGPU.

eduardoschardong · ‎12-15-2008

Originally posted by: Remotion

reduce void
reduceGPU(float input<>, reduce float output<>)
{
output += input * 0.1f;
}

Reduction kernels are expected to be comutative, neither this one and the last one are, so they return incorrect results.

BTW, hey AMD, could you simplify a little?
Wipe out the <> in kernels and make non-scatter output streams readable, a return parameter for default output? For example, in kernel void sum(float a, out float b){ b+=a;/**/} why could b read and then write to the same location? The reduce parameter in reduction kernels doesn't work in this way? And about those <>, note that, like in the code above it doesn't matter for the kernel the size of the a stream and even if a is a constant and not a stream, why forcing the programmer to write those useless <>? And the default output, lets say we have a kernel float sum(float a, float b) {return a + b;/**/}, it's so difficult for the compiler to trasnlate it to kernel void sum(float a, float b, out float c) {c = a + b;/**/}? The first form is more readable, and also, for all those simple streams I won't have to rewrite/ctrl+c,ctrl+v/wrap those simple functions that already exists for CPUs...

bronson · ‎12-15-2008

Is anybody working on AVT for Linux? I'd like to offload h.264 encoding to the GPU... possible?

rick_weber · ‎12-13-2008

I read in the what's new that this version of Brook+ supports arrays of streams on the host and dynamic stream allocation. I couldn't find how to do either in the documentation and float myStream<10>[10]; doesn't pass brcc's syntax checking. Firstly, how do I do this, and my request is that the documentation be updated to explain how to do this

rick_weber · ‎12-13-2008

Originally posted by: rick.weber I read in the what's new that this version of Brook+ supports arrays of streams on the host and dynamic stream allocation. I couldn't find how to do either in the documentation and float myStream<10>[10]; doesn't pass brcc's syntax checking. Firstly, how do I do this, and my request is that the documentation be updated to explain how to do this

Oh wait, the runtime C++ API was updated to support this, which I'm guessing means I have to write the kernel in Brook+, cross compile with brcc, and then modify the .cpp file, changing the ::StreamOperator: or whatever that corresponds to the desired on in the .br file to make an array of them.

gaurav_garg · ‎12-13-2008

Hi Rick,

Brook+ 1.3 exposes Stream as a class and you can call operators on it from your C++ file. You can take a look at the samples under samples\CPP\apps those are using C++ runtime API.

rick_weber · ‎12-13-2008

Originally posted by: gaurav.garg Hi Rick,

Brook+ 1.3 exposes Stream as a class and you can call operators on it from your C++ file. You can take a look at the samples under samples\CPP\apps those are using C++ runtime API.

That precisely answers my question. Thank you.

Ceq · ‎12-14-2008

gaurav_garg · ‎12-15-2008

Brook+ has to convert user defined structs into hardware supported data formats. If you specify these typedefs in br file, brcc parses this information and generate methods that helps runtime get information about the formats used in the struct.

So, you have to define these typedefs in a br file -

struct.br-

typedef struct Ostr {
float4 a;
float4 b;
} Odef;

cpp file -

#include struct.h // generated header file from .br file

#include "brook/Stream.h"
using namespace brook;

int main(int argc, char *argv[ ] ) {
unsigned int dim[1] = { 1 };
Stream<Odef> s1( 1, dim);
return 0;
}

I hope it helps.

dar · ‎12-16-2008

bug report

Scientific Linux 5.1 (RHEL clone) x86_64

amdstream-cal-1.3.0_beta.x86_64.run does not contain/install lib/ or lib64/ directories and thus does not install the required shared libraries.

gaurav_garg · ‎12-16-2008

Hi dar,

With SDK 1.3 libraries are shipped with catalyst. You need to upgrade your drivers to 8.12

Archives Discussions

SDK 1.3 Feedback