Now that 1.3 has been released to the public, we would like feedback on it un order to further improve future releases of the SDK. we would appreciate your help in providing feedback in this thread so that the information does not get buried in other threads. Please make sure you label each item as a 'Feature Request', 'Bug Reports', 'Documentation' or 'Other'. As always, you can send an email to 'streamcomputing@amd.com' for general requests or 'streamdeveloper@amd.com' for development related requests.
If you wish to file a Feature Request, please include a description of the feature request and the part of the SDK that this request applies to.
If you wish to file a Bug Report, please include the hardware you are running on, operating system, SDK version, driver/catalyst version, and if possible either a detailed description on how to reproduce the problem or a test case. A test case is preferable as it can help reduce the time it takes to determine the cause of the issue.
If you wish to file a Documentation request, please specify the document, what you believe is in error or what you believe should be added and which SDK the document is from.
Thank you for your feedback.
AMD Stream Computing Team
bug report
Scientific Linux 5.1 (RHEL 5.1 clone) x86_64
problems with int4, questioncolon - there are three related issues.
consider the simple kernel below:
kernel void test_int4_gpu_kern( int n, int4 s_src<>, out int4 s_dst<> )
{
const int4 zero4 = int4(0,0,0,0);
int4 imask = int4(n+2,n-2,n,n);
int4 tmp = s_src;
/* fails */
// tmp = (imask == tmp)? zero4 : tmp;
/* works with brtvector.hpp patch */
tmp = ((int4)(imask == tmp))? zero4 : tmp;
s_dst = tmp;
}
First, brook+ will produce the following error for this kernel,
g++ -O3 -I/usr/local/amdbrook/sdk/include -c test_int4.cpp
/usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp: In function ‘T singlequestioncolon(const B&, const T&, const T& [with T = int, B = int]’:
/usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp:479: instantiated from ‘vec<typename BRT_TYPE::TYPE, LUB<BRT_TYPE::size,tsize>::size> vec<VALUE, tsize>:uestioncolon(const BRT_TYPE&, const BRT_TYPE& const [with BRT_TYPE = __BrtInt4, VALUE = int, unsigned int tsize = 4u]’
test_int4.cpp:18: instantiated from here
/usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp:59: error: request for member ‘questioncolon’ in ‘a’, which is of non-class type ‘const int’
make: *** [test_int4.o] Error 1
rm test_int4.cpp
However, this can be fixed with the following patch,
--- brtvector.hpp~ 2008-12-16 12:00:52.000000000 -0500
+++ brtvector.hpp 2008-12-16 12:04:54.000000000 -0500
@@ -58,6 +58,17 @@
const T&c){
return a.questioncolon(b,c);
};
+
+
+
+/* XXX added by DAR */
+template <> inline int singlequestioncolon (const int &a,
+ const int &b,
+ const int &c) {
+ return a?b:c;
+}
+
+
template <> inline float singlequestioncolon (const char &a,
const float &b,
const float &c) {
Second, issue is that it should not be necessary to perform a cast in,
tmp = ((int4)(imask == tmp))? zero4 : tmp;
When the (int4) cast is removed, the following error is generated,
/usr/local/amdbrook/sdk/include/brook/CPU/brtvector.hpp:59: error: request for member ‘questioncolon’ in ‘a’, which is of non-class type ‘const char’
The equivalent expression for float4 does not require the equivalent cast.
Third issue is minor. The compiler produces warning that,
test_int4.br(37) : WARN--1: conditional expression must have scalar type. On short vectors, assumes x components as condition
Statement: (int4 ) (imask == tmp) in tmp = ((int4 ) (imask == tmp)) ? (zero4) : (tmp)
However, this does not appear correct. For float4 the conditional expression correctly applied component-wise, and with the patch above, the same is true for int4. In the simple kernel, values are masked out component-wise to zero.
Thanks dar for the bug report. I will send this patch to the concerned engineer to look at.
Hi Ceq,
It looks like a bug in code-generation where a header file required for compilation is missing.
As a workaround you can include "brook\CPU\brtvector.hpp" in simple_kernel.cpp before including "brookgenfiles/copy.h".
Hope it helps.
Hi Ceq,
Thanks for pointing out memory leak issues. Looks like there are memory leaks for reduction kernels. I did try the regular kernels and everything looks OK. But, reduction kernel shows the behavior mentioned by you.
Hi nberger,
Are you using a reduction kernel? It seems that there are some meory leaks in reduction path.
This are exactly my problems too.
Slowdown and memory leaks even without reduction kernels.
Using domains will change behavior a bit and slowdown are not so bad now but leaks are still there.
Bug Report:
Swizzling parts of arrays doesn't seem to work in kernels.
E.g.
kernel void foo(float a<>,float4 b<>, float4 c<>
{
float4 r[2];
float4 tmp1; tmp2;
float val = a*a;
r[0].x = val;
r[0].y = val;
r[0].z = val;
r[0].w = val;
r[1].x = val;
r[1].y = val;
r[1].z = val;
r[1].w = val;
b = r[0];
c = r[1];
}
This kernel does not work. However, if I assign tmp1 and tmp2 to val in a similar fashion, the kernel does work.
Hi Rick,
Local arrays are not suported in Brook+. I am surprised brcc didn't complain for it.
Then, I guess the real bug is that brcc didn't complain In actuality, I was using arrays to make for a more terse expression. So as a feature request, macro expansion (i.e. you can use arrays, but they are really just n different variables rather than n contiguous elements in memory) would be useful.
This is the code that I use for some testing. The results get more and more time each frame.
kernel void krnShitIntersectTriangle( float3 rayOrigs<>,
float3 rayDirs<>,
out float4 outHits<> )
{
float3 v0 = float3(0.f,0.f,0.f);
float3 v1 = float3(100.f,0.f,0.f);
float3 v2 = float3(0.f,100.f,100.f);
float3 rayOrigin = rayOrigs;
float3 rayDir = rayDirs;
float4 currentHit = float4(9999999.0f, -1.f, -1.f, -1.f );
float3 edge1 = v1 - v0;
float3 edge2 = v2 - v0;
float3 tvec = rayOrigin - v0;
float3 qvec = cross( tvec, edge1 );
float3 pvec = cross(rayDir, edge2);
float det = dot(edge1, pvec);
float inv_det = 1.0f / det;
float value1, value2;
float4 triangHit;
triangHit.x = dot( edge2, qvec ) * inv_det;
triangHit.z = dot( rayDir, qvec ) * inv_det;
triangHit.y = dot( tvec, pvec ) * inv_det;
triangHit.w = 0.0f;
outHits = currentHit;
value2 = (triangHit.x <= currentHit.x) && (triangHit.z >= 0.0f) && (triangHit.y >= 0.0f) && (triangHit.x >= 0.0f) && ((triangHit.y + triangHit.z) <= 1.0f);
if( value2 )
{
outHits = triangHit;
}
}
I keep the streams as members of a class:
"Stream<float3> _origins;"
"Stream<float3>_dirs;"
"Stream<float4>_hits;"
Since there is no default constructor, unlike 1.2.1, I construct them with a small size, and later on assign them like this: _dirs = Stream( rank2, dimsWH );
Each time I measure the kernel execution times they get bigger and bigger:
const int MAX_ITERS = 645;
DWORD timeATStart;
static float timeDurationF[MAX_ITERS];
for( int i=0; i
{
PerfCounter0.Start();
krnShitIntersectTriangle( traceContextGPU._origins, traceContextGPU._dirs, traceContextGPU._Hits );
krnShadeNdotL_x3( traceContextGPU._origins, traceContextGPU._dirs, traceContextGPU._Hits,
_treeFaces_x3,
traceContextGPU._colors
);
PerfCounter0.Stop();
timeDurationF
PerfCounter0.Reset();
}
timeDurationF
0.000175, 0.000213, 0.000222, 0.000230, 0.000238, 0.000249, 0.000257 ...
This is for 128x128 rays. If I set the resolution to 1024x1024, the first 100 times the execution time is less than 0.0009xx, then it jumps to 0.0xxxx and then stays like 0.0xxxxx , which is bad IMHO. I do not know whether this is due to the address virtualization, or I need to think of some load balancing strategy.
This is new to 1.3, previous version was fine.
I am running this test on WindowsXP x64, using x64 build target with VS2005.
Any idea why this is so and how to cure would be very warmly greeted
0.000175, 0.000213, 0.000222, 0.000230, 0.000238, 0.000249, 0.000257 ...
This is for 128x128 rays. If I set the resolution to 1024x1024, the first 100 times the execution time is less than 0.0009xx, then it jumps to 0.0xxxx and then stays like 0.0xxxxx , which is bad IMHO. I do not know whether this is due to the address virtualization, or I need to think of some load balancing strategy.
This is new to 1.3, previous version was fine.
I am running this test on WindowsXP x64, using x64 build target with VS2005.
Any idea why this is so and how to cure would be very warmly greeted
Hi All,
Thanks for pointing out all the slow-down issues. The issue is that Brook+ 1.3 uses some kind of caching for different execution events.
Calling a kernel in a big for loop shows these issues. As a workaround you should call error() on output stream after a kernel call. I have tested the bug report sent by nberger -
for(int j=0; j < 10; j++){
clock_t before = clock();
for(int i=0; i < 1000; i++){
copy(inputStream, outputStream);
}
outputStream.error();
clock_t after = clock();
cout << "1000 Calls: " << (after-before) << " ticks = " << (float)(after-before)/(float)CLOCKS_PER_SEC << " s" << endl;
}
To fix the slowdown change it to -
for(int j=0; j < 10; j++){
clock_t before = clock();
for(int i=0; i < 1000; i++){
copy(inputStream, outputStream);
outputStream.error(); // Change here
}
clock_t after = clock();
cout << "1000 Calls: " << (after-before) << " ticks = " << (float)(after-before)/(float)CLOCKS_PER_SEC << " s" << endl;
}
Let me know if you still face any issues. I have filed a bug for this and it should be fixed in next release.
It looks like a bug on Brook+ side.
As a sidenote its better not to use structs in Brook+. Brcc expands these structs into multiple base type streams, so you can't save your memory fetches. On the other side, you have some performance overhead during data transfer while using structs as runtime has to transfer data to different base streams and copy data element by element.
1. Can you post your command line options and the results?
2. error() call synchronizes all the pending events associated to the stream. It doesn't have any data transfer overhead.
3. inputstream.error() will probably synchronize streamRead, there are no issues with data transfer synchronization implementation. So, you need not to call error() on input stream.
As a sidenote, error() is very useful API to know any issues with your stream. And in case you have any error(), you can check errorLog() on the stream.
Bug Report (perhaps) in brook+ Samples
I am using a 4830 on a Core2 Duo machine with 4GB RAM on Debian 64bit, with the latest driver and sdk. I know neither 4830 or Debian is officially supported but...
So i am able to run all the brook+ examples, for few iterations, but when i try to run it for more iterations like say 100 or even 20 i end up getting the following error.
"Error occured
Kernel Execution : Uninitialized or Allocation failed Input streams.
Stream Write : Uninitialized stream"
it happens with the all the matmult samples for larger sizes like 1024 or so. for sizes like 512 i can go upto 50 iterations.
Other optimization feature in CAL samples
So i am trying to find the best mat x mat-mult code (including by using sgemm or dgemm). The CAL simple_matmult is real fast (320 gflops vs 200 gflops via sgemm) but the bottleneck in that CAL sample seems to be the way the data is copied between cpu-gpu. copyTo called via copyToGPU and copyFrom called via copyFromGPU (all in amdcal/samples/common/Samples.cpp)
Right now it seems to be iteratively copied, to and fro, so that padding is preserved. Perhaps a restructuring of the data in memory before copying it back might speed up quite a bit.
Documentation feature inclusion
Is it possible to include explaining the swizzle stuff in the computing guide? It can be found elsewhere on the web ( http://www.nada.kth.se/~tomaso/Stream2008/M3.pdf ), but it seems as an abrupt jump in the guide as there is no explanation of what swizzle does.
thanks.
Calling it on any one output stream is fine.
Hello,
I try to SDK on Ubuntu 8.10 amd64 on a Q6600 HD 4850 512MB.
I use standard libxcb-xlib so I have the annoying "locking assertion failure" backtrace.
I have some timing result of brook+ sample code that not looks like consistent:
$ ./mandelbrot -p -q
Width Height Iterations CPU Total Time GPU Total Time Speedup
64 64 1 0.000000 0.010000 0.000000
oops CPU is faster
$ ./mandelbrot -p -q -i 1000 2>/dev/null
Width Height Iterations CPU Total Time GPU Total Time Speedup
64 64 1000 0.105000 0.313000 0.335463
humm CPU is still faster
./mandelbrot -p -q -i 10000 2>/dev/null
Width Height Iterations CPU Total Time GPU Total Time Speedup
64 64 10000 1.045000 23.088000 0.045262
OMG how can we explain that ?
If use larger matrix it's better
$ ./mandelbrot -p -x 1024 -y 1024 -q.
Width Height Iterations CPU Total Time GPU Total Time Speedup
1024 1024 1 0.023000 0.010000 2.300000
This is ok but
$ ./mandelbrot -p -x 8192 -y 8192 -i 10 -q
Width Height Iterations CPU Total Time GPU Total Time Speedup
8192 8192 10 14.849000 0.001000 14849.000000
GPU became 100 time faster !
./mandelbrot -e -p -x 8192 -y 8192 -i 1 -q
-e Verify correct output.
Computing Mandelbrot set on CPU ... Done
./mandelbrot: Failed!
Humm maybe the matrix is too big
I use binary from the sdk and do not try to compile.
BR.
$ ./mandelbrot -p -x 8192 -y 8192 -i 10 -q Width Height Iterations CPU Total Time GPU Total Time Speedup 8192 8192 10 14.849000 0.001000 14849.000000
GPU became 100 time faster !
./mandelbrot -e -p -x 8192 -y 8192 -i 1 -q -e Verify correct output. Computing Mandelbrot set on CPU ... Done ./mandelbrot: Failed!
Humm maybe the matrix is too big
I think you are running examples coming with legacy folder. Try running CPP samples. They have error checking on streams and in case Brook+ is not able to allocate stream on GPU, it will show an error rather than showing these false numbers.
I noticed that dcl_resource_id(...) statements that are commented out in IL kernels are actually *interpreted* by the CAL compiler. How to reproduce: write a kernel with a commented out dcl_resource statement, and run calCtxRunProgram() without defining i0:
; dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
Notice how calCtxRunProgram() will return an error "Symbol "i0" used as INPUT does not have a memory association.".
Another bug: calGetErrorString() returns strings that are prematurely truncated. For example while debugging the above problem, for me printf("[%s]", calGetErrorString()) was displaying
[Symbol "]
After dumping the memory around that string, I noticed that it was actually
[Symbol "\x00i0\x00" used as \x00INPUT\x00 does not have a memory association.]
with 4 NUL bytes around "i0" and "INPUT". My platform is 64-bit linux if that matters...
CAL compiler interprets dcl_resource implicitly if there is any sampling instruction is used. Comment out sampling instruction as well and it should not give an error for symbol "i0".
/usr/local/amdbrook/samples/bin/legacy/lnx_x86_64/haar_wavelet -e -i 2 -t -p -q 2>/dev/null
Width Height Iterations GPU Total Time
64 64 2 0.031000
-e Verify correct output.
Computing Haar Wavelet Transform on CPU ... Done
/usr/local/amdbrook/samples/bin/legacy/lnx_x86_64/haar_wavelet: Failed!
-p Compare performance with CPU.
Width Height Iterations CPU Total Time GPU Total Time Speedup
64 64 2 0.000000 0.031000 0.000000
It's ok with -i 1