I tested it on Vista-64 and got the same output as you (failed for -y 128 and passed for -y 127)
Before fixing, I have try to look on an obvious improvement doing stream init and result copy out of the iteration loop.
Surprise that fix also .
I don't understand at all why but that fix.
I also got some perfomance improvement.
diff -u /usr/local/amdbrook/samples/legacy/apps/haar_wavelet/haar_wavelet.br haar_wavelet.br
--- /usr/local/amdbrook/samples/legacy/apps/haar_wavelet/haar_wavelet.br 2008-12-03 01:12:53.000000000 +0100
+++ haar_wavelet.br 2009-01-10 17:13:18.000000000 +0100
@@ -171,10 +171,10 @@
// Record GPU Total time
+ // Write to stream
+ streamRead(stream0, io);
for (i = 0; i < cmd.Iterations; ++i)
- // Write to stream
- streamRead(stream0, io);
// Run the brook program
while (w > 1)
@@ -199,16 +199,16 @@
inp = 1 - inp;
- // Write data back from stream
- streamWrite(stream0, io);
- streamWrite(stream1, io);
+ // Write data back from stream
+ streamWrite(stream0, io);
+ streamWrite(stream1, io);
My patch is full buggy. Really I don't understand how it can give the right result. With this fix insted of doing the wavelet transform i time on the same data, it does the new iteration with the result of the last iteration...
I'm very confuse with this.
When the test fail the gpu output equal the input.
Try setting environment variable BRT_RUNTIME=cpu and see if it works.
Maybe I'm missing something but I think is just adding two lines to reinitialize variables:
Add in new CPP code line 272 or old legacy code line 215 (w = Length; instead).
for (i = 0; i < info->Iterations; ++i )
// Write to stream
inp = 0; // <------
w = _width * _height; // <------
Cep, you are right. Thank you
I had thinked to imp variable but not to w.
I afraid that using gpu for haar wavelet is useless because perf aren't very good :
Width Height Iterations CPU Total Time GPU Total Time Speedup
4096 4096 100 44.084000 69.486000 0.634430
That's annoying because I would like to do Dirac video encoding
haar wavelet uses domain in a loop multiple times. Domain operator has bad performance and it is suggested to avoid use of this operator.
You can try emulating domain by passing different constant parameters(specify domain using these constants) to kernel and specifying domain of execution of the kernel.
e.g. rather calling a kernel like this-
copy(avgStream.domain(domainStart1, domainEnd1) , stream1.domain(domainStart1, domainEnd1));
It will be a good idea to call it something like this-
copy.domainOffset(uint4(*domainStart1, 0, 0, 0));
copy.domainSize(uint4(*domainEnd1 - *domainStart1, 1, 1, 1));
Similary a call to haar_wavelet kernel can be changed. Keep in mind that calculation of idx1 and idx2 inside kernel will change as now instance() value will vary from *domainStart1 to *domainend1 (not from 0 to stream width).