cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

SDK 1.3 Feedback

Now that 1.3 has been released to the public, we would like feedback on it un order to further improve future releases of the SDK. we would appreciate your help in providing feedback in this thread so that the information does not get buried in other threads. Please make sure you label each item as a 'Feature Request', 'Bug Reports', 'Documentation' or 'Other'. As always, you can send an email to 'streamcomputing@amd.com' for general requests or 'streamdeveloper@amd.com' for development related requests.

If you wish to file a Feature Request, please include a description of the feature request and the part of the SDK that this request applies to.

If you wish to file a Bug Report, please include the hardware you are running on, operating system, SDK version, driver/catalyst version, and if possible either a detailed description on how to reproduce the problem or a test case. A test case is preferable as it can help reduce the time it takes to determine the cause of the issue.

If you wish to file a Documentation request, please specify the document, what you believe is in error or what you believe should be added and which SDK the document is from.

Thank you for your feedback.
AMD Stream Computing Team

0 Likes
78 Replies
dar
Journeyman III

bug report

Scientific Linux 5.1 (RHEL 5.1 clone) x86_64

problems with int4, questioncolon - there are three related issues.

consider the simple kernel below:

kernel void test_int4_gpu_kern( int n, int4 s_src<>, out int4 s_dst<> )
{

   const int4 zero4 = int4(0,0,0,0);
   int4 imask = int4(n+2,n-2,n,n);
   int4 tmp = s_src;
 
   /* fails */
// tmp = (imask == tmp)? zero4 : tmp;
 
   /* works with brtvector.hpp patch */
   tmp = ((int4)(imask == tmp))? zero4 : tmp;
 
   s_dst = tmp; 
}

First, brook+ will produce the following error for this kernel,

g++ -O3 -I/usr/local/amdbrook/sdk/include -c test_int4.cpp
/usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp: In function ‘T singlequestioncolon(const B&, const T&, const T& [with T = int, B = int]’:
/usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp:479:   instantiated from ‘vec<typename BRT_TYPE::TYPE, LUB<BRT_TYPE::size,tsize>::size> vec<VALUE, tsize>:uestioncolon(const BRT_TYPE&, const BRT_TYPE& const [with BRT_TYPE = __BrtInt4, VALUE = int, unsigned int tsize = 4u]’
test_int4.cpp:18:   instantiated from here
/usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp:59: error: request for member ‘questioncolon’ in ‘a’, which is of non-class type ‘const int’
make: *** [test_int4.o] Error 1
rm test_int4.cpp

However, this can be fixed with the following patch,

--- brtvector.hpp~      2008-12-16 12:00:52.000000000 -0500
+++ brtvector.hpp       2008-12-16 12:04:54.000000000 -0500
@@ -58,6 +58,17 @@
                                                           const T&c){
     return a.questioncolon(b,c);
 };
+
+
+
+/* XXX added by DAR */
+template <> inline int singlequestioncolon (const int &a,
+                                              const int &b,
+                                              const int &c) {
+    return a?b:c;
+}
+
+
 template <> inline float singlequestioncolon (const char &a,
                                               const float &b,
                                               const float &c) {

Second, issue is that it should not be necessary to perform a cast in,

tmp = ((int4)(imask == tmp))? zero4 : tmp;

When the (int4) cast is removed, the following error is generated,

/usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp:59: error: request for member ‘questioncolon’ in ‘a’, which is of non-class type ‘const char’

The equivalent expression for float4 does not require the equivalent cast.

Third issue is minor.  The compiler produces warning that,

test_int4.br(37) : WARN--1: conditional expression must have scalar type. On short vectors, assumes x components as condition
                 Statement: (int4 ) (imask == tmp) in tmp = ((int4 ) (imask == tmp)) ? (zero4) : (tmp)

However, this does not appear correct.  For float4 the conditional expression correctly applied component-wise, and with the patch above, the same is true for int4.  In the simple kernel, values are masked out component-wise to zero.

 

0 Likes

Thanks dar for the bug report. I will send this patch to the concerned engineer to look at.

0 Likes
Ceq
Journeyman III

Hi, previously gaurav.garg told me how to use structs in Brook+, however I've found a problem:


1. Open example in BROOK\samples\CPP\tutorials\SimpleKernel
(I'm using Visual Studio 2005)


2. Edit file "copy.br" and add the following lines:

typedef struct PairRec {
float first;
float second;
} Pair;


3. Rebuild

1>simple_kernel.cpp
1>c:\hd1\brook\samples\cpp\tutorials\simplekernel\brookgenfiles/copy.h(37) : error C2146: syntax error : missing ';' before identifier 'first'
1>c:\hd1\brook\samples\cpp\tutorials\simplekernel\brookgenfiles/copy.h(37) : error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
1>c:\hd1\brook\samples\cpp\tutorials\simplekernel\brookgenfiles/copy.h(37) : error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
...


What's wrong? Is this a bug? Is there any workaround?
0 Likes

Hi Ceq,

 

It looks like a bug in code-generation where a header file required for compilation is missing.

As a workaround you can include "brook\CPU\brtvector.hpp" in simple_kernel.cpp before including "brookgenfiles/copy.h".

 

Hope it helps.

0 Likes

Hi Ceq,

Thanks for pointing out memory leak issues. Looks like there are memory leaks for reduction kernels. I did try the regular kernels and everything looks OK. But, reduction kernel shows the behavior mentioned by you.

0 Likes
Ceq
Journeyman III

0 Likes
nberger
Adept I

BUG REPORT: Kernel calls getting slower
Hi!
With some effort I have managed to move my partial wave analysis framework to the 1.3 SDK and
the good news is that it now works (as opposed to the attempts with the 1.2 and 1.1 versions) and
produces correct results. I found however that if I call a kernel multiple times, it becomes slower.
In an attempt to make sure that the problem is not somewhere with my code, I went to the simpleKernel
example and just placed a loop around the kernel call - and also here the execution time increases for every
additional kernel call. Unfortunately this behavior just about kills my application - any tips for workarounds
or patches are warmly welcome.

Thanks

Nik
0 Likes

0 Likes

Hi nberger,

 

Are you using a reduction kernel? It seems that there are some meory leaks in reduction path.

0 Likes
nberger
Adept I

No. The scary thing is that this behavior is also seen with the simplest of all kernels, namely the copy kernel form the
simpleKernel example, which just does output = input...

Cheers

Nik
0 Likes

This are exactly my problems too.

Slowdown and memory leaks even without reduction kernels.

Using domains will change behavior a bit and slowdown are not so bad now but leaks are still there.

 

 

0 Likes
bayoumi
Journeyman III

0 Likes
bayoumi
Journeyman III

if someone has 64b Linux 5.2 RHEL, does the inputspeed & outputspeed precompiled binaries under amdcal/bin/lnx64 give consistent results?
0 Likes
bayoumi
Journeyman III

has anyone seen the slowdown with time in Windows XP (32 or 64) as well, or any OS other than Scientific Linux?
0 Likes
Ceq
Journeyman III

Yes bayoumi, I'm working on a application that requires five kernels, two of them are reductions.
Each iteration was taking more time, the first one is around 0.20 seconds, the last one (iteration 4800) takes more than a second.
I think it is related to the memory leaks issues, because at the end of execution it requires more than 1 GB.
Reductions are very affected, but normal kernels can slow down too.
As a workaround use as many streams as you can as private function variables, so unneeded streams are destroyed at the end of the function, for me this worked fine.

By the way, I would like to congratulate AMD people working in the forum and behind the stream SDK because 1.3 is a huge improvement.
Previously on 1.21 I wasn't able to implement my algorithm, but now I got more than 100x the CPU performance.
I'm working on shallow water systems simulation based on finite volumen scheme, when finished I will open another thread to show some results.


QUESTION:
------------------------------------------------------------------------------------------------------------
In a normal kernel if you use two streams of diferent sizes runtime issues a warning that auto-stride / auto-replication will be deprecated in future versions... is this true? why? I think is a nice feature if you work with regular patterns.
0 Likes
bayoumi
Journeyman III

thanks Ceq for your reply. I would expect AMD to post a patch soon.
Can't wait to try the 1.3 version.
BTW, I see you're using XP x64. which libraries did you use for brook+ (brook.lib or brook_d.lib), and which option /MDd, /MTd, MD or MT? I always end up in access violation error during runtime with XP x64 (the sdk binaries samples are running OK)
0 Likes
rick_weber
Adept II

Bug Report:

Swizzling parts of arrays doesn't seem to work in kernels.

E.g.

kernel void foo(float a<>,float4 b<>, float4 c<>

{

float4 r[2];
float4 tmp1; tmp2;

float val = a*a;
r[0].x = val;
r[0].y = val;
r[0].z = val;
r[0].w = val;
r[1].x = val;
r[1].y = val;
r[1].z = val;
r[1].w = val;

b = r[0];
c = r[1]; 

}

This kernel does not work. However, if I assign tmp1 and tmp2 to val in a similar fashion, the kernel does work.

0 Likes

Hi Rick,

 

Local arrays are not suported in Brook+. I am surprised brcc didn't complain for it.

0 Likes

Then, I guess the real bug is that brcc didn't complain In actuality, I was using arrays to make for a more terse expression. So as a feature request, macro expansion (i.e. you can use arrays, but they are really just n different variables rather than n contiguous elements in memory) would be useful.

0 Likes
Ceq
Journeyman III

BUG REPORT:

Please, have a look at this two kernels, altough they do the same the output is different, looks
like the code generated in the second case for the condition is wrong:
By the way, should I compare the value with 0.1f instead of 0.0f to stay on the safe side? (altough results are the same)

// Good result
// [ 0 1 2 ] + [ 00 00 00 ] = [ 00 01 02 ]
// [ 3 4 5 ] + [ 00 10 20 ] = [ 03 14 25 ]
// [ 6 7 8 ] + [ 30 40 50 ] = [ 36 47 58 ]

kernel void fun1(float Center<>, float Down[][], out float Sol<>) {
float2 pos = indexof(Center).xy;
Sol = Center;
if(pos.y > 0.0f) { // <----------
float2 dD = { 0.0f, -1.0f };
Sol += Down[pos + dD];
}
}

// Bad result
// [ 00 02 04 ]
// [ 03 14 25 ]
// [ 36 47 58 ]

kernel void fun2(float Center<>, float Down[][], out float Sol<>) {
float2 pos = indexof(Center).xy;
float D = 0.0f;
if(pos.y > 0.0f) { // <----------
float2 dD = { 0.0f, -1.0f };
D = Down[pos + dD];
}
Sol = Center + D;
}

#define streamPrint(_str, _ptr, _x, _y) streamWrite(_str, _ptr); print(#_str, _ptr, _x, _y)

void print(char *name, float *ptr, int x, int y) {
int i, j, pos;
printf("\n%s:\n", name);
for(pos = 0, i = 0; i < y; i++) {
for(j = 0; j < x; j++, pos++)
printf("%6.2f ", ptr[pos]);
printf("\n");
}
}

int main(int argc, char *argv[]) {
const int NUMX = 3;
const int NUMY = 3;
const int SIZE = NUMX * NUMY;
float A[SIZE], B[SIZE], S[SIZE];
int i, j, pos;
for(i = 0; i < SIZE; i++) {
A[ i] = 1.0f * i;
B[ i] = 10.0f * i;
}
{
float sA < NUMY, NUMX > ;
float sB < NUMY, NUMX > ;
float sS < NUMY, NUMX > ;
streamRead(sA, A);
streamRead(sB, B);
fun1(sA, sB, sS);
streamPrint(sS, S, NUMX, NUMY);
fun2(sA, sB, sS);
streamPrint(sS, S, NUMX, NUMY);
}
}


EDIT:
-----------------------------------------------------------------
Another test, change:
if(pos.y > 0.0f) {
float2 dD = { 0.0f, -1.0f };
...

By the following:
if(pos.x > 0.0f) {
float2 dD = { -1.0f, 0.0f };
...

And both cases will fail returning some negative values


EDIT2:
-----------------------------------------------------------------
Using the new gather array notation will fix the problem
gstream[posx][posy];


EDIT3:
-----------------------------------------------------------------
...but looks like not always, it fails if you use this typedef
struct as gather type (even changing fields order changes
the results).

typedef struct float5S {
float dt;
float4 fl;
} float5;

kernel void fun3(float5 C<>, float5 L[][], float5 U[][],
out float dt<>, out float4 dvar<>) {
int2 pos = instance().xy;
dvar = float4(0.0f, 0.0f, 0.0f, 0.0f);
dt = C.dt;
if(pos.x > 0) {
float5 datL = fldtL[pos.y][pos.x - 1];
dvar -= datL.fl;
dt += datL.dt;
}
if(pos.y > 0) {
float5 datU = U[pos.y - 1][pos.x];
dvar -= datU.fl;
dt += datU.dt;
}
}
0 Likes
lust
Journeyman III

 

This is the code that I use for some testing. The results get more and more time each frame.

 



kernel void krnShitIntersectTriangle( float3 rayOrigs<>,

float3 rayDirs<>,

out float4 outHits<> )

{

float3 v0 = float3(0.f,0.f,0.f);

float3 v1 = float3(100.f,0.f,0.f);

float3 v2 = float3(0.f,100.f,100.f);

float3 rayOrigin = rayOrigs;

float3 rayDir = rayDirs;

float4 currentHit = float4(9999999.0f, -1.f, -1.f, -1.f );

 

float3 edge1 = v1 - v0;

float3 edge2 = v2 - v0;

 

float3 tvec = rayOrigin - v0;

 

float3 qvec = cross( tvec, edge1 );

float3 pvec = cross(rayDir, edge2);

float det = dot(edge1, pvec);

float inv_det = 1.0f / det;

float value1, value2;

 

float4 triangHit;

 

triangHit.x = dot( edge2, qvec ) * inv_det;

triangHit.z = dot( rayDir, qvec ) * inv_det;

triangHit.y = dot( tvec, pvec ) * inv_det;

triangHit.w = 0.0f;

 

outHits = currentHit;

 

value2 = (triangHit.x <= currentHit.x) && (triangHit.z >= 0.0f) && (triangHit.y >= 0.0f) && (triangHit.x >= 0.0f) && ((triangHit.y + triangHit.z) <= 1.0f);

 

if( value2 )

{

outHits = triangHit;

}

}

I keep the streams as members of a class:

"Stream<float3> _origins;"

"Stream<float3>_dirs;"

"Stream<float4>_hits;"

Since there is no default constructor, unlike 1.2.1, I construct them with a small size, and later on assign them like this: _dirs = Stream( rank2, dimsWH );

Each time I measure the kernel execution times they get bigger and bigger:

 

const int MAX_ITERS = 645;

DWORD timeATStart;

static float timeDurationF[MAX_ITERS];

 

for( int i=0; i

{

PerfCounter0.Start();

krnShitIntersectTriangle( traceContextGPU._origins, traceContextGPU._dirs, traceContextGPU._Hits );

 

krnShadeNdotL_x3( traceContextGPU._origins, traceContextGPU._dirs, traceContextGPU._Hits,

_treeFaces_x3,

traceContextGPU._colors

);

 

PerfCounter0.Stop();

timeDurationF

= PerfCounter0.GetElapsedTime();

PerfCounter0.Reset();

}

 

 timeDurationF

grows for example

0.000175, 0.000213, 0.000222, 0.000230, 0.000238, 0.000249, 0.000257 ...

This is for 128x128 rays. If I set the resolution to 1024x1024, the first 100 times the execution time is less than 0.0009xx, then it jumps to 0.0xxxx and then stays like 0.0xxxxx , which is bad IMHO. I do not know whether this is due to the address virtualization, or I need to think of some load balancing strategy.

This is new to 1.3, previous version was fine.

I am running this test on WindowsXP x64, using x64 build target with VS2005.

 Any idea why this is so and how to cure would be very warmly greeted



0.000175, 0.000213, 0.000222, 0.000230, 0.000238, 0.000249, 0.000257 ...

This is for 128x128 rays. If I set the resolution to 1024x1024, the first 100 times the execution time is less than 0.0009xx, then it jumps to 0.0xxxx and then stays like 0.0xxxxx , which is bad IMHO. I do not know whether this is due to the address virtualization, or I need to think of some load balancing strategy.

This is new to 1.3, previous version was fine.

I am running this test on WindowsXP x64, using x64 build target with VS2005.

 Any idea why this is so and how to cure would be very warmly greeted



0 Likes
Ceq
Journeyman III

0 Likes

Hi All,

Thanks for pointing out all the slow-down issues. The issue is that Brook+ 1.3 uses some kind of caching for different execution events.

Calling a kernel in a big for loop shows these issues. As a workaround you should call error() on output stream after a kernel call. I have tested the bug report sent by nberger - 

for(int j=0; j < 10; j++){
clock_t before = clock();
for(int i=0; i < 1000; i++){
copy(inputStream, outputStream);
}
outputStream.error();
clock_t after = clock();
cout << "1000 Calls: " << (after-before) << " ticks = " << (float)(after-before)/(float)CLOCKS_PER_SEC << " s" << endl;
}

 

To fix the slowdown change it to -

for(int j=0; j < 10; j++){
clock_t before = clock();
for(int i=0; i < 1000; i++){
copy(inputStream, outputStream);
outputStream.error(); // Change here
}
clock_t after = clock();
cout << "1000 Calls: " << (after-before) << " ticks = " << (float)(after-before)/(float)CLOCKS_PER_SEC << " s" << endl;
}

Let me know if you still face any issues. I have filed a bug for this and it should be fixed in next release.

0 Likes
Ceq
Journeyman III

In the last post of the previous page I wrote a bug report (I've fixed two copy-paste errors now)
I would be very grateful if someboy at AMD could check if I'm doing something wrong
or confirm it is really a bug because I'm really in a hurry due to a deadline for my project.
Not using structs in gather forces me to use too many base type streams, (hence too many in kernel
memory fetches that hurt performance) and the compiler warns that the kernel required two passes.

Thanks
0 Likes

It looks like a bug on Brook+ side.

As a sidenote its better not to use structs in Brook+. Brcc expands these structs into multiple base type streams, so you can't save your memory fetches. On the other side, you have some performance overhead during data transfer while using structs as runtime has to transfer data to different base streams and copy data element by element.

0 Likes
Ceq
Journeyman III

that was a really fast answer, thanks a lot for the notes on structs gaurav!
0 Likes
bayoumi
Journeyman III

gaurav
I confirm the workaround works with 64 Linux SL5.2, with Firestream 9170 & driver 8.561.
questions:
1. Are there any similar slowdown issues with CAL. I had inconsistent results (including inf) when running the inputspeed & outputspeed, input_IL, output_IL precompiled binaries under the same pltaform.
2- Is there a performance penalty (CPU-GPU transfer overheads) when using outstream.error()?
3. Do we need inputstream.error() inside the loop before kernel call?
0 Likes

1. Can you post your command line options and the results?

2. error() call synchronizes all the pending events associated to the stream. It doesn't have any data transfer overhead.

3. inputstream.error() will probably synchronize streamRead, there are no issues with data transfer synchronization implementation. So, you need not to call error() on input stream.

As a sidenote, error() is very useful API to know any issues with your stream. And in case you have any error(), you can check errorLog() on the stream.

0 Likes
bayoumi
Journeyman III

thank you for your reply.
Here is the case for CAL sdk 1.3 precomplied binaries (BTW, the precompiled binaries for XP x64 give consistent results):
location : /usr/local/amdcal/bin/lnx64
64 Linux SL5.2, with Firestream 9170 & driver 8.561
terminal output:

[lnx64]$ exportspeed
Supported CAL Runtime Version: 1.3.145
Found CAL Runtime Version: 1.3.145
Program: exportspeed Kernel System
WxH In-Out Src Dst Iter GB/sec GB/sec
256x 256 1 1 4 4 2 inf 0.15
256x 256 1 2 4 4 2 2.93 0.21
256x 256 1 3 4 4 2 7.81 0.29
256x 256 1 4 4 4 2 inf 0.38
256x 256 1 5 4 4 2 2.93 0.37
256x 256 1 6 4 4 2 6.84 0.51
256x 256 1 7 4 4 2 7.81 0.54
256x 256 1 8 4 4 2 8.79 0.61

Press enter to exit...

--------------------------------------------------------------------------------
$ inputspeed
Supported CAL Runtime Version: 1.3.145
Found CAL Runtime Version: 1.3.145
Program: inputspeed Kernel System
WxH In-Out Src Dst Iter GB/sec GB/sec
256x 256 1 1 4 4 2 3.91 0.13
256x 256 2 1 4 4 2 inf 0.18
256x 256 3 1 4 4 2 7.81 0.20
256x 256 4 1 4 4 2 inf 0.23
256x 256 5 1 4 4 2 11.72 0.27
256x 256 6 1 4 4 2 13.67 0.29
256x 256 7 1 4 4 2 inf 0.33
256x 256 8 1 4 4 2 17.58 0.34
256x 256 9 1 4 4 2 9.77 0.36
256x 256 10 1 4 4 2 21.48 0.36
256x 256 11 1 4 4 2 23.44 0.39
256x 256 12 1 4 4 2 25.39 0.40
256x 256 13 1 4 4 2 13.67 0.41
256x 256 14 1 4 4 2 29.30 0.41
256x 256 15 1 4 4 2 15.62 0.42
256x 256 16 1 4 4 2 33.20 0.44

Press enter to exit...
--------------------------------------------------------------------------------
$ input_IL
Supported CAL Runtime Version: 1.3.145
Found CAL Runtime Version: 1.3.145
Program: input_IL Kernel System
WxH In-Out Src Dst Iter GB/sec GB/sec
256x 256 1 1 4 4 2 inf 0.13

Press enter to exit...
--------------------------------------------------------------------------------
$ output_IL
Supported CAL Runtime Version: 1.3.145
Found CAL Runtime Version: 1.3.145
Program: output_IL Kernel System
WxH In-Out Src Dst Iter GB/sec GB/sec
256x 256 0 1 4 4 2 inf 0.07

Press enter to exit...

Thanks
0 Likes
titanius
Adept II

Bug Report (perhaps) in brook+ Samples

I am using a 4830 on a Core2 Duo machine with 4GB RAM on Debian 64bit, with the latest driver and sdk. I know neither 4830 or Debian is officially supported but...

So i am able to run all the brook+ examples, for few iterations, but when i try to run it for more iterations like say 100 or even 20 i end up getting the following error.

"Error occured
Kernel Execution : Uninitialized or Allocation failed Input streams.
Stream Write : Uninitialized stream"

it happens with the all the matmult samples for larger sizes like 1024 or so. for sizes like 512 i can go upto 50 iterations.

Other optimization feature in CAL samples

So i am trying to find the best mat x mat-mult code (including by using sgemm or dgemm). The CAL simple_matmult is real fast (320 gflops vs 200 gflops via sgemm) but the bottleneck in that CAL sample seems to be the way the data is copied between cpu-gpu. copyTo called via copyToGPU and copyFrom called via copyFromGPU (all in amdcal/samples/common/Samples.cpp)

Right now it seems to be iteratively copied, to and fro, so that padding is preserved. Perhaps a restructuring of the data in memory before copying it back might speed up quite a bit.

Documentation feature inclusion

Is it possible to include explaining the swizzle stuff in the computing guide? It can be found elsewhere on the web ( http://www.nada.kth.se/~tomaso/Stream2008/M3.pdf ), but it seems as an abrupt jump in the guide as there is no explanation of what swizzle does.

 

thanks.

 

0 Likes
nberger
Adept I

Quick question on the .error() workaround: If I have multiple output streams, do I have to call .error() on all of them?
Thanks
Nik
0 Likes

Calling it on any one output stream is fine.

0 Likes
nberger
Adept I

Thanks for the quick answer. Now things are working fine...
0 Likes
Jetto
Journeyman III

Hello,

I try to SDK on Ubuntu 8.10 amd64 on a Q6600 HD 4850 512MB.

I use standard libxcb-xlib so I have the annoying "locking assertion failure" backtrace.

I have some timing result of brook+ sample code that not looks like consistent:

$ ./mandelbrot -p -q
Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
64      64      1               0.000000        0.010000        0.000000

oops CPU is faster

$ ./mandelbrot -p -q -i 1000 2>/dev/null
Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
64      64      1000            0.105000        0.313000        0.335463

humm CPU is still faster

./mandelbrot -p -q -i 10000 2>/dev/null
Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
64      64      10000           1.045000        23.088000       0.045262

OMG how can we explain that ?

If use larger matrix it's better

$ ./mandelbrot -p -x 1024 -y 1024 -q.
Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
1024    1024    1               0.023000        0.010000        2.300000

This is ok but

$  ./mandelbrot -p -x 8192 -y 8192 -i 10 -q
Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
8192    8192    10              14.849000       0.001000        14849.000000

GPU became 100 time faster !

./mandelbrot -e -p -x 8192 -y 8192 -i 1 -q
-e Verify correct output.
Computing Mandelbrot set on CPU ... Done
./mandelbrot: Failed!

Humm maybe the matrix is too big

I use binary from the sdk and do not try to compile.

BR.

0 Likes

 

$  ./mandelbrot -p -x 8192 -y 8192 -i 10 -q Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup         8192    8192    10              14.849000       0.001000        14849.000000

 

GPU became 100 time faster !

 

./mandelbrot -e -p -x 8192 -y 8192 -i 1 -q -e Verify correct output. Computing Mandelbrot set on CPU ... Done ./mandelbrot: Failed!

 

Humm maybe the matrix is too big

 

I think you are running examples coming with legacy folder. Try running CPP samples. They have error checking on streams and in case Brook+ is not able to allocate stream on GPU, it will show an error rather than showing these false numbers.

0 Likes

I noticed that dcl_resource_id(...) statements that are commented out in IL kernels are actually *interpreted* by the CAL compiler. How to reproduce: write a kernel with a commented out dcl_resource statement, and run calCtxRunProgram() without defining i0:

; dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

Notice how calCtxRunProgram() will return an error "Symbol "i0" used as INPUT does not have a memory association.".

Another bug: calGetErrorString() returns strings that are prematurely truncated. For example while debugging the above problem, for me printf("[%s]", calGetErrorString()) was displaying

[Symbol "]

After dumping the memory around that string, I noticed that it was actually

[Symbol "\x00i0\x00" used as \x00INPUT\x00 does not have a memory association.]

with 4 NUL bytes around "i0" and "INPUT". My platform is 64-bit linux if that matters...

 

0 Likes

CAL compiler interprets dcl_resource implicitly if there is any sampling instruction is used. Comment out sampling instruction as well and it should not give an error for symbol "i0".

0 Likes

There was no sampling instruction in my test case. I still got an error for symbol "i0".
0 Likes
Jetto
Journeyman III

 /usr/local/amdbrook/samples/bin/legacy/lnx_x86_64/haar_wavelet  -e -i 2 -t -p -q 2>/dev/null
Width   Height  Iterations      GPU Total Time 
64      64      2               0.031000       

-e Verify correct output.
Computing Haar Wavelet Transform on CPU ... Done
/usr/local/amdbrook/samples/bin/legacy/lnx_x86_64/haar_wavelet: Failed!

-p Compare performance with CPU.
Width   Height  Iterations      CPU Total Time  GPU Total Time  Speedup        
64      64      2               0.000000        0.031000        0.000000

It's ok with -i 1

0 Likes