I think I found an interesting exercise in sample directory.
Can somebody tell me if
/usr/local/amdbrook/samples/bin/CPP/lnx_x86_64/haar_wavelet -i 2 -e -y 128 -x 128 -p
gives
-e Verify correct output.
Computing Haar Wavelet Transform on CPU ... Done
./haar_wavelet: Failed!
-p Compare performance with CPU.
Width Height Iterations CPU Total Time GPU Total Time Speedup
128 128 2 0 0.057 0
but success with -x 128 -y 127
Before fixing, I have try to look on an obvious improvement doing stream init and result copy out of the iteration loop.
Surprise that fix also .
I don't understand at all why but that fix.
I also got some perfomance improvement.
diff -u /usr/local/amdbrook/samples/legacy/apps/haar_wavelet/haar_wavelet.br haar_wavelet.br
--- /usr/local/amdbrook/samples/legacy/apps/haar_wavelet/haar_wavelet.br 2008-12-03 01:12:53.000000000 +0100
+++ haar_wavelet.br 2009-01-10 17:13:18.000000000 +0100
@@ -171,10 +171,10 @@
// Record GPU Total time
Start(0);
+ // Write to stream
+ streamRead(stream0, io[0]);
for (i = 0; i < cmd.Iterations; ++i)
{
- // Write to stream
- streamRead(stream0, io[0]);
// Run the brook program
while (w > 1)
@@ -199,16 +199,16 @@
inp = 1 - inp;
}
+ }
- // Write data back from stream
- if(!inp)
- {
- streamWrite(stream0, io[1]);
- }
- else
- {
- streamWrite(stream1, io[1]);
- }
+ // Write data back from stream
+ if(!inp)
+ {
+ streamWrite(stream0, io[1]);
+ }
+ else
+ {
+ streamWrite(stream1, io[1]);
}
Stop(0);
}
My patch is full buggy. Really I don't understand how it can give the right result. With this fix insted of doing the wavelet transform i time on the same data, it does the new iteration with the result of the last iteration...
I'm very confuse with this.
When the test fail the gpu output equal the input.
Try setting environment variable BRT_RUNTIME=cpu and see if it works.
Cep, you are right. Thank you
I had thinked to imp variable but not to w.
I afraid that using gpu for haar wavelet is useless because perf aren't very good :
Width Height Iterations CPU Total Time GPU Total Time Speedup
4096 4096 100 44.084000 69.486000 0.634430
That's annoying because I would like to do Dirac video encoding
haar wavelet uses domain in a loop multiple times. Domain operator has bad performance and it is suggested to avoid use of this operator.
You can try emulating domain by passing different constant parameters(specify domain using these constants) to kernel and specifying domain of execution of the kernel.
e.g. rather calling a kernel like this-
copy(avgStream.domain(domainStart1, domainEnd1) , stream1.domain(domainStart1, domainEnd1));
It will be a good idea to call it something like this-
copy.domainOffset(uint4(*domainStart1, 0, 0, 0));
copy.domainSize(uint4(*domainEnd1 - *domainStart1, 1, 1, 1));
copy(avgStream, stream1);
Similary a call to haar_wavelet kernel can be changed. Keep in mind that calculation of idx1 and idx2 inside kernel will change as now instance() value will vary from *domainStart1 to *domainend1 (not from 0 to stream width).