Archives Discussions

Now that 1.4 has been released to the public, we would like feedback on it un order to further improve future releases of the SDK. we would appreciate your help in providing feedback in this thread so that the information does not get buried in other threads. Please make sure you label each item as a 'Feature Request', 'Bug Reports', 'Documentation' or 'Other'. As always, you can send an email to 'streamcomputing@amd.com' for general requests or 'streamdeveloper@amd.com' for development related requests.

If you wish to file a Feature Request, please include a description of the feature request and the part of the SDK that this request applies to.

If you wish to file a Bug Report, please include the hardware you are running on, operating system, SDK version, driver/catalyst version, and if possible either a detailed description on how to reproduce the problem or a test case. A test case is preferable as it can help reduce the time it takes to determine the cause of the issue.

If you wish to file a Documentation request, please specify the document, what you believe is in error or what you believe should be added and which SDK the document is from.

Thank you for your feedback.
AMD Stream Computing Team

115 Replies

Other:

Does HD 4770:

1) work with the SDK1.4 ?

2) have double precision (FP64) support ?

The 4770 will get the official supported designation in June/July. It is really a matter of getting it made part of our testing matrix. Unofficially, I haven't heard any issues from our engineering team about it though.

It will have DPFP support and will be annotated (or rather not annotated since we only mark the cards without DPFP support) appropriately on the sys req page for the SDK.

Michael.

The 4770 will get the official supported designation in June/July. It is really a matter of getting it made part of our testing matrix. Unofficially, I haven't heard any issues from our engineering team about it though.

It will have DPFP support and will be annotated (or rather not annotated since we only mark the cards without DPFP support) appropriately on the sys req page for the SDK.

Michael.

The 4770 will get the official supported designation in June/July. It is really a matter of getting it made part of our testing matrix. Unofficially, I haven't heard any issues from our engineering team about it though.

It will have DPFP support and will be annotated (or rather not annotated since we only mark the cards without DPFP support) appropriately on the sys req page for the SDK.

Michael.

The 4770 will get the official supported designation in June/July. It is really a matter of getting it made part of our testing matrix. Unofficially, I haven't heard any issues from our engineering team about it though.

It will have DPFP support and will be annotated (or rather not annotated since we only mark the cards without DPFP support) appropriately on the sys req page for the SDK.

Michael.

The 4770 will get the official supported designation in June/July. It is really a matter of getting it made part of our testing matrix. Unofficially, I haven't heard any issues from our engineering team about it though.

It will have DPFP support and will be annotated (or rather not annotated since we only mark the cards without DPFP support) appropriately on the sys req page for the SDK.

Michael.

If I have code:

::brook::Stream<float>* divG;

which has some data and I want to make temp variable that has the same data. If I write

::brook::Stream<float> *temp=new ::brook::Stream<float>(*divG);

which I thought is a copy constructor, then I cant use this temp variable in a call like this:

calculate_and_add_divergence_gpu_ati(cols,rows,*Gx,*Gy,*temp,*divG)

where divG is output stream. The runtime reports the error: "Input stream is the same as the output stream."

I get the same error even after the code like this:

temp->assign(divG);

or

*temp=*divG;

If I could just get the dimension of divG I could make a temp variable like

::brook::Stream<float> *temp=new ::brook::Stream<float>(1,&dimension); And then use assign method.

Here is the kernel which I am calling:

kernel void calculate_and_add_divergence_gpu_ati(int cols,int rows,float Gx[],float Gy[],float temp<>, out float divG<>{
    float divGx, divGy;
    int idx=instance().x;
    int ky = idx % cols;
    int kx=idx-ky*cols;

    if(kx == 0)
        divGx = Gx[idx];
    else
        divGx = Gx[idx] - Gx[idx-1];

    if(ky == 0)
        divGy = Gy[idx];
    else
        divGy = Gy[idx] - Gy[idx - cols];

    divG = temp + divGx + divGy;
}

So the simple method getDimensions of class Stream would be really nice.

Thank you for your support.

Bug Report: brcc/brcc_d (Linux/Windows XP SP3) requires infinite time to compile the following program instead of reporting an compile error.

The Stream KernelAnalyzer reports "Error - Brook+ timed out whilst compiling.".

kernel float func(float i[][]) {
    int2 pos = instance().xy;
    return i[pos.y][pos.x];
}

kernel void func2(float i[][], out float o<>) {
    o = func(i);
}

Feature Request: making the counters more accessible, it doesn't make any sense at all that if one wants to use the counters that they have to recode every line of code from the sample

When can we use ati stream sdk with gcc-4.3 on X64-linux(ubuntu9.04/Fedora11/OpenSUSE11.*) or OpenSolaris ?

Documentation: Can we please have better documentation??? We've been asking for this for almost a year now and while it has gotten better, it's not where it needs to be by any extent.

Feature Request: 4x 4870x2 support under Linux. 2x 4870x2 works, but half my hardware is still sitting in boxes. If I have to start 2 X servers with 2 cards each, that would be ok, but presently fglrx only supports 1 X server.

Feature Request: CAL without X (like CUDA standalone module) for a GPU cluster.

Feature request:

Not allowing for local arrays inside a kernel turns out to be a big problem for certain applications I am writing. Are there plans to include local array creation in brook+ soon? Or are there ways to circumvent this limitation?

Reachable from code AMD HLSL compiler, with IL on output, in SDK.

fearure request: possibility to use without running X-server

having to configure X on a dedicated compute server makes things really messy; and if you accidentally use "ssh -X", you are wondering why things go wrong.

[Edit:]

Quoting from "Running ATI Stream Applications Remotely" (ATI Knowledge base):

"If your X server console is not the active console, your remote CAL applications wait for the X server to become active again."

You cannot really mean that serious?!

I'm currently planning to set up a GPU cluster for CFD computations, and that misfeature is definitely going to influence my decision which hardware I'm going to buy.

BUG

kernel void Generate_v_alt_full (out double v_alt<>, double orig_x[][], int aout_full[] )
{
int2 ind = instance().xy;
v_alt = orig_x[ind.y][aout_full[ind.x]];
}

Above compiles well

kernel void Generate_v_alt_full (out double v_alt<>, double orig_x[][], int aout_full[] )
{
int2 ind = instance().xy;
v_alt = orig_x[aout_full[ind.x]][ind.y];
}

Interchanging the indices spews a bunch of parser error

While processing <buffer>:176
In compiler at zzerror()[parser.y:112]
message = syntax error

ERROR: Parse error. Invalid expression.
While processing <buffer>:176
In compiler at zzparse()[parser.y:301]
(yyvsp[0]).ToString() = ")"
Aborting...
Problem with compiling brookgenfiles/misc_kernels_Copy_Stream_kv.hlsl
Error--:cal back end failed to compile kernel "Copy_Stream_kv"
NOTICE: Parse error
In compiler at zzerror()[parser.y:112]
message = syntax error

ERROR: Parse error. Expected declaration.
In compiler at zzparse()[parser.y:198]
(yyvsp[0]) = ")"

Declaring a temp variable for aout_full[ind.x] and substituting in the last kernel makes the parse errors go away.

SDK 1.4, Debian Unstable x64.

Better documentation,

easier multi-gpu API/automatic scalability over multiple GPU,

kernel call spooling on the hardware, so repetitive kernel call won't take time that long, for example:

for(int epoch = 0; epoch < num_of_epoch;epoch++)
    {
        for(int i = 0; i < (int) yB; i++)
        {
            Stream<float4> myu_min(rank[2], streamSizeMinOfVecCluster);
            Stream<float4> myu_max_of_min(rank[2], streamSizeMaxOfMin);
            myufy(i,fuzzy_number,vec_ref,myu)
            minimum_myu_cluster(myu,myu_min);
            max_of_min_myu(myu_min,myu_max_of_min);
        }
    }

My BIG WISH for 1.5 is global synchronizing threads

Becouse time of launching simple function on GPU via ATISTREAM is very big, ten times slower than launching function on GPU via CUDA

Becouse time of launching simple function on GPU via ATISTREAM is very big, ten times slower than launching function on GPU via CUDA

The main reason for this overhead is online compilation of generated IL in kernel call. On the other hand, nvcc directly generates GPU assembly and use it in kernel call.

A better support for offline compilation should be added in CAL.

Gaurav,
This is already possible in CAL with calImageRead and calclImageWrite. You can compile a IL program to a calImage and then write that image to a CALvoid buffer and then save it offline. The only issue is that we do not use a custom compilation step to do this and a seperate tool needs to be written that does this. The samples used to do this around version 0.9 or beta 1.0 of CAL.

Originally posted by: MicahVillmow Gaurav, This is already possible in CAL with calImageRead and calclImageWrite. You can compile a IL program to a calImage and then write that image to a CALvoid buffer and then save it offline. The only issue is that we do not use a custom compilation step to do this and a seperate tool needs to be written that does this. The samples used to do this around version 0.9 or beta 1.0 of CAL.

When it will be integrated with Brook+ compile step? I don't want to rewrite my code in CAL.

Originally posted by: MicahVillmow

Gaurav,

This is already possible in CAL with calImageRead and calclImageWrite. You can compile a IL program to a calImage and then write that image to a CALvoid buffer and then save it offline. The only issue is that we do not use a custom compilation step to do this and a seperate tool needs to be written that does this. The samples used to do this around version 0.9 or beta 1.0 of CAL.

Is it possible to use such form of ready kernels precaching combined with Brook+ sources?
Or this on-fly recompilation is unavoidable with Brook+ ?
(
It's very important cause I didn't see any kernel call shorter than ~0,4ms - it's huge time for computationally-intensive tasks...
)

multi mon and crossfire combined

get it fixed or else

Just wanted to report that with 1.4 SDK performance of BarsWF dropped by a factor of 2.5.

That means I gave up on this, and distributed version would support nVidia only.

That is really sad, because AMD cards showed much higher potential (up to 2.6 billions of hashes per second on 5870). nVidia needed at least 4 chips to get there.

Originally posted by: BarsMonster Just wanted to report that with 1.4 SDK performance of BarsWF dropped by a factor of 2.5.

That means I gave up on this, and distributed version would support nVidia only.

Which version were you using previously? Is it possible for you find out which part of the code is showing drop in performance? If you can point down 1 or 2 kernel, we can try to help you.

Originally posted by: gaurav.garg
Originally posted by: BarsMonster Just wanted to report that with 1.4 SDK performance of BarsWF dropped by a factor of 2.5.

That means I gave up on this, and distributed version would support nVidia only.

Which version were you using previously? Is it possible for you find out which part of the code is showing drop in performance? If you can point down 1 or 2 kernel, we can try to help you.

1.2 or 1.3... Do not remember, the one requiring Catalyst <=8.12.

I have the only kernel, and it's pretty huge, takes around 2 second for first launch (i.e. compilation), and then it runs for ~150ms (amount of work is variable to reach that timing). So it's not "1ms kernel speed" issue.

Probably need to downgrade display driver & SDK and compare CAL sources & doublecheck performance increase in older version.

Another option is to migrate it to OpenCL.

I've already did OpenCL migration for nVidia, and speed drop was under 5%.

If OpenCL in AMD would be not much worse, it might be the best bet option.

There are few things those are slower with Brook+ 1.3 or 1.4 SDK. If you can avoid those features, you can get much better performance.

1. Are you using stream re-size feature? i.e. Is there any size mis-match between sizes of your stream arguments of kernel?

2. Are you using domain operator?

3. Are you using scatter stream in your kernel?

Is it possible for you to post the kernel signature and the kernel call from host?

Also, the size of each stream passed to kernel?

Originally posted by: gaurav.garg There are few things those are slower with Brook+ 1.3 or 1.4 SDK. If you can avoid those features, you can get much better performance.

1. Are you using stream re-size feature? i.e. Is there any size mis-match between sizes of your stream arguments of kernel?

2. Are you using domain operator?

3. Are you using scatter stream in your kernel?

Is it possible for you to post the kernel signature and the kernel call from host?

Also, the size of each stream passed to kernel?

1) No

2) No

3) No

kernel void hello_brook_check(int k1, int k2, int k3, int k4, int len, int _ta, int _tb, int _tc, int _td, int charset[], int charset_len, out int output<> )

Launched as simple as

hello_brook_check(data->data_h[0], data->data_h[1],data->data_h[2],data->data_h[3],perm: : pwd_len8,g->hash_i[0],g->hash_i[1],g->hash_i[2],g->hash_i[3],data->charset[0], perm::charset_len,data->output[0]);

output stream size is 8000

charset size is ~80.

What kernel does is just calculating around 5*5'000 itterations of MD5 and comparing resulting hash with a target value (_ta,_tb,_tc,_td).

In the old version there was also 20 more input streams, but I optimized them away with almost no performance boost.

CPU-part(preparing workunit) is able to give out ~120 times more work, there were no changes in CPU part, so I doubt it was any slower.

Install previuos version of Brook on existing Catalyst and see if you get any performance improvement. It will help us to track down the issue to Brook+ changes or Catalyst changes.

Originally posted by: gaurav.garg Install previuos version of Brook on existing Catalyst and see if you get any performance improvement. It will help us to track down the issue to Brook+ changes or Catalyst changes.

Well, previous Brook+ required Catalist 8.12 or earlier (because that DLL renaming).

Some guys on my forum were able to hack newer drivers to work with older Brook, but I don't think it's realiable 🙂

I am not asking you to ship it as product, but to test it so that we can get the main cause of performace drop. You can rename aticalrt.dll and aticalcl.dll (available under Windows/system32 and windows/SysWow64 ) to amdcalrt.dll and amdcalcl.dll.

After looking in .il code I've realized that 2/3 of instructions are mov ones.
With that much registers on the card, I do not see how compiler might want to use move that often :-S

Ahh, it is supposed to be non-optimized, I see...

BarsMonster,
Please look at the ISA and not the IL as the underlying CAL compiler will optimize the unnecessary move instructions away.

Originally posted by: MicahVillmow BarsMonster, Please look at the ISA and not the IL as the underlying CAL compiler will optimize the unnecessary move instructions away.

Unfortunately I cannot check how ISA looks like, because Stream Kernel analizer causes brcc to crash (probably SKA is 1.3 and SDK is 1.4). Is there any other way of checking ISA code?

The CAL sampler parameter extensions are documented in the Stream 2.0 Computing User Guide, but the function pointer types and relevant enumerations aren't in the cal_ext.h header file. (I checked the 1.4 and 2.0 Stream SDK releases, and neither one has the sampler extension info in cal_ext.h.)

The functions seem to be supported in the CAL libraries - if I add one to the last CALextid in the SDK header file cal_ext.h and call calExtGetProc() with "calCtxGetSamplerParams" or "calCtxSetSamplerParams", calExtGetProc() returns successfully. Calls to the returned function pointers seem to work as I would expect (although I've only tried Windows XP). So the fix might only require additions to the header files.

Jeremy Furtek