How come I cant increase the height for bitonic_sort_d.exe from 64 to 128?
C:\Program Files\ATI\ATI Brook+ 1.4.0_beta\samples\bin\CPP\xp_x86_32>bitonic_sort_d.exe -v -e -p -t -x 128 -y 64 -i 1
Width Height Iterations GPU Total Time
128 64 1 0.0288799
-e Verify correct output.
Performing Bitonic Sort on CPU ... Done
bitonic_sort_d.exe: Passed!
-p Compare performance with CPU.
Width Height Iterations CPU Total Time GPU Total Time Speedup
128 64 1 0.117355 0.0288799 4.06354
C:\Program Files\ATI\ATI Brook+ 1.4.0_beta\samples\bin\CPP\xp_x86_32>bitonic_sort_d.exe -v -e -p -t -x 128 -y 128 -i 1
Error occured
Kernel Execution : Error with input streams
Kernel Execution : Error with input streams
Kernel Execution : Error with input streams
Kernel Execution : Error with input streams
Kernel Execution : Error with input streams
Width Height Iterations GPU Total Time
128 128 1 0.0264846
-e Verify correct output.
Performing Bitonic Sort on CPU ... Done
bitonic_sort_d.exe: Failed!
-p Compare performance with CPU.
Width Height Iterations CPU Total Time GPU Total Time Speedup
128 128 1 0.466925 0.0264846 17.6301
C:\Program Files\ATI\ATI Brook+ 1.4.0_beta\samples\bin\CPP\xp_x86_32>
Same issue with the binary search example:
C:\Program Files\ATI\ATI Brook+ 1.4.0_beta\samples\bin\CPP\xp_x86_32>binary_search_d.exe -v -e -p -p -q -t -x 128 -y 128 -i 1
Verbose and Quiet cancel each other out.
Error occured
Kernel Execution : Error with input streams
Width Height Iterations GPU Total Time
128 128 1 0.0251148
-e Verify correct output.
Performing Binary Searches on CPU ... Done
binary_search_d.exe: Failed!
-p Compare performance with CPU.
Width Height Iterations CPU Total Time GPU Total Time Speedup
128 128 1 0.0981394 0.0251148 3.90763
C:\Program Files\ATI\ATI Brook+ 1.4.0_beta\samples\bin\CPP\xp_x86_32>
I am running this with:
Catalyst 9.9
Windows vista 32-bit
Stream Computing SDK 1.4 (beta)
The GPU I am using is: FireStream 9250
The errorLog on stream says "Kernel Execution : Error with input streams". That means there is some error with input streams of kernel. You should try to check error and errorLog on the input streams of kernel and see what they return.
As for binary search, it only allows max 8192 elements
128x128 > 8192
Try -y 1 -x 8192 you shall see huge speedup
Try size <= 8192 like -y 8192 -x 1 it shall runs with no problem, but negative performance improvement
All the code in examples emphasize data reuse so we shall see huge performance improvement for matrix multiplication, binary search of that in the example, that's what I conclude
For application that has high rate data reuse, ATI hardware is da' best