cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ryta1203
Journeyman III

Odd Performance Results

Given three simple kernels:

kernel void foo1(float4 a<>, out float4 f1<>{ f1=a;}

kernel void foo2(float4 a<>, out float4 f1<>, out float4 f2<>{f1=a;f2=a;}

kernel void foo3(float4 a<>, out float4 f1<>, out float4 f2<>, out float4 f3<>{f1=a;f2=a;f3=a;}

Why would foo3 running faster than foo1 and foo2 given small stream sizes: <8,8>, <16,16>, etc...??? This confused me, it doesn't seem to happen at larger stream sizes, say <1024, 1024>.

I looked at the ISA and it's the same except that foo2 has 1 more bundle than foo1 (all MOV instr) and has burstcount(1) and foo3 has 1 more bundle than foo2 (all MOV instr) and has burstcount(2).

0 Likes
9 Replies
ryta1203
Journeyman III

In case this wasn't clear, foo3 is running FASTER than foo2 and foo3 is running FASTER than foo1, but only for very small stream sizes, for example 1 or 2 wavefronts (This is all I have tested so far).

0 Likes

I'm going to assume that this is some kind of bug and AMD has no idea why this happens.

0 Likes

Ryta,

 There are a lot of reasons why the performance can be different and there just is not enough information right now to make a valid judgement on it. When you are dealing with sizes that small, you are not longer hitting the normal bottlenecks on the chips and they require detailed analysis to figure out exactly what is causing the perceived performance differences.

 

0 Likes

Originally posted by: MicahVillmow Ryta,

 There are a lot of reasons why the performance can be different and there just is not enough information right now to make a valid judgement on it. When you are dealing with sizes that small, you are not longer hitting the normal bottlenecks on the chips and they require detailed analysis to figure out exactly what is causing the perceived performance differences.

 

 

OK. Yeah, it just seems odd that output 2 floats is slower than outputing 3, everything else equal.

0 Likes

Well the problem is mainly that you are outputting such a small amount of data that the performance characteristics for that case are not easily determined. Maybe the three output case is causing the board clocks to be increased and the two output case is not or it could be some software/hardware setup overhead or many other things.

0 Likes

Micah,

  Thanks, I wasn't aware that the board clock would change like that. This has probably been discussed but there is a way to stop that right, to keep a constant clock regardless?

0 Likes

The board clock goes into low power state until a large enough workload, which is determined by the device driver, kicks it into full power. I think there is a way it can be changed in catalyst control center, but not 100% sure. 

0 Likes

Ok, thanks Micah. I might be able to create a profile and manually change the profile using a text editor. I will look into that, thanks, that makes sense.

0 Likes

I check the clocks and set them manually in  aprofile using CCC and I get the same results, but thanks anyways.

0 Likes