I thought I'd do a quick and dirty comparison of the speed of the CPU vs CAL back-end by using a large value for the Length variable in the hello_brook sample. I ran into some odd behaviour that I can't immediately figure.
For values of Length up to 100,000 everything works fine. For values of Length > 100,000 up to some (undertermined) limit the program returns "failed to get usable kernel fragment to implement requested reduction".
For very large values (Length=1,000,000) the CPU route returns the correct result (eventually) but the CAL route returns "There are 0.000000 elements...".
Have I done something stupid, or have I missed something fundamental? This is running on Win XP 32-bit, Radeon 4870 with 8.10 driver and the debug build of hello_brook.
I'm at work right now so I can't test this (no access to a 4870!) but reading through hello_brook.br I noticed that there's a difference from the documented way of specifying the reduction variable. In hello_brook.br we have:
reduce void hello_brook_sum(float input<>, reduce float val<>
Where val is a stream with a single element. The documentation for reduction kernels shows the reduction variable being given as a simple data type, i.e. a float. I'll test this when I get home, but if anyone can confirm that this is the source of the issue I've noticed then that would be great.
If this is the cause are there many of these gotchas in Brook+ and are they being addressed in 1.3?