I've modified "BROOK\samples\legacy\tests\sum" to test how C = A + 1 performs

Each execution does 10000 iterations without performing streamRead/streamWrite

My setup is Athlon X2 4850E, Radeon 4850, WinXP 64, VS2005

Streams

--------------------------------------------------------------------------------

s1<1024, 1024> = 1, 2, 3...

s2<1024, 1024> = 1, 1, 1...

s3<1024, 1024>

s4<1,1> = 1

Kernels

--------------------------------------------------------------------------------

kernel void inc1(float a< >, float b< >, out float c< > ) { c = a + b; }

kernel void inc2(float a< >, float b<1, 1>, out float c< > ) { c = a + b; }

kernel void inc3(float a< >, float b, out float c< > ) { c = a + b; }

kernel void inc4(float a< >, out float c< > ) { c = a + 1.0f; }

Time

--------------------------------------------------------------------------------

inc1(s1, s2, s3); > 2.54s

inc2(s1, s4, s3); > 8.85s

inc3(s1, 1.0f, s3); > 5.49s

inc4(s1, s3); > 2.46s

1. Why does inc2 is so slow? I think implicit-resize was much faster in previous SDK (nearly as much as inc4)

2. Why does inc3 doubles time? Using a constant parameter is that slow?

3. Shouldn't inc4 be a little faster? It requires half the data compared to inc1

I would appreciate any hints on these, thanks

EDIT

--------------------------------------------------------------------------------

Using the stream.error() workaround on the output stream:

inc1(s1, s2, s3); > 2.93s

inc2(s1, s4, s3); > 8.81s

inc3(s1, 1.0f, s3); > 2.86s

inc4(s1, s3); > 2.78s

Now inc1 and inc4 become a bit slower, but inc3 works fine, so

looks like stream.error() bug also affects constant parameters, if

so don't forget to fix it. On the other hand inc2 is still too slow.

Each execution does 10000 iterations without performing streamRead/streamWrite

My setup is Athlon X2 4850E, Radeon 4850, WinXP 64, VS2005

Streams

--------------------------------------------------------------------------------

s1<1024, 1024> = 1, 2, 3...

s2<1024, 1024> = 1, 1, 1...

s3<1024, 1024>

s4<1,1> = 1

Kernels

--------------------------------------------------------------------------------

kernel void inc1(float a< >, float b< >, out float c< > ) { c = a + b; }

kernel void inc2(float a< >, float b<1, 1>, out float c< > ) { c = a + b; }

kernel void inc3(float a< >, float b, out float c< > ) { c = a + b; }

kernel void inc4(float a< >, out float c< > ) { c = a + 1.0f; }

Time

--------------------------------------------------------------------------------

inc1(s1, s2, s3); > 2.54s

inc2(s1, s4, s3); > 8.85s

inc3(s1, 1.0f, s3); > 5.49s

inc4(s1, s3); > 2.46s

1. Why does inc2 is so slow? I think implicit-resize was much faster in previous SDK (nearly as much as inc4)

2. Why does inc3 doubles time? Using a constant parameter is that slow?

3. Shouldn't inc4 be a little faster? It requires half the data compared to inc1

I would appreciate any hints on these, thanks

EDIT

--------------------------------------------------------------------------------

Using the stream.error() workaround on the output stream:

inc1(s1, s2, s3); > 2.93s

inc2(s1, s4, s3); > 8.81s

inc3(s1, 1.0f, s3); > 2.86s

inc4(s1, s3); > 2.78s

Now inc1 and inc4 become a bit slower, but inc3 works fine, so

looks like stream.error() bug also affects constant parameters, if

so don't forget to fix it. On the other hand inc2 is still too slow.