Archives Discussions

darkmen · ‎01-20-2013

Hi everyone.

I have updated today the AMD Catalist drivers to 13.1 and got 20% performance loss on my HD7970.

Does anyone have the same experiance?

Also which is the easiest way to rollback to 12,10? Uninstalling 13.1 and reinstalling 12.10 gives the same lower speed (opencl reporting NEW runtime version)

Claggy · ‎01-20-2013

I reported that last week too:

http://devgurus.amd.com/message/1286437#1286437

I had to delete a whole lot of files to be able to reinstall Cat 12.8,

since then an AMD Catalyst Un-install Utility has appeared on the AMD Game Driver download site:

http://sites.amd.com/us/game/downloads/Pages/catalyst-uninstall-utility.aspx

Not tried it properly yet, except that it didn't work on Vista, and it says it is for Windows 7 only,

Claggy

darkmen · ‎01-20-2013

I have uninstall 13.1, deleted syswow64\amdocl.dll and reinstalled 12.10

OCL Runtime version is now 1016 and speed is back.

BR

darkhmz · ‎01-20-2013

Hi!

I have experienced the same issue with Catalyst 13.1. In my case the performance drop was around 39% on my HD5830. I've tested kernel performance with different versions of amdocl.dll and the OpenCL version shipped with Catalyst 13.1 was the worst. According to APP profiler, kernel execution times were ~17.51ms and ~24.38ms (12.10 vs 13.1).

himanshu_gautam · ‎01-20-2013

Hi,

I am sorry to hear this.

If I am not asking for more, Can you please post a simple code that shows the performance degradation.

Thanks,

darkmen · ‎01-21-2013

Hi, i have just tried the 13.2 version with OCL runtime 1124.2,

Performance goes even more down then 13.1.

And this is all goes to a compiler. Now comparing ISA sources produced by 12.10 and 13.1 (btw, AMD APP KernelAnalyzer crashes on 13.2)

Seems there are some changes around branches and\or loops.

The source pseudo code:

for(uint i=0;i<STEP;i++){

if(check_data(...))

output[0] = i;

}

12.10 ISA:

s_mov_b64 exec, s[10:11]

s_addk_i32 s3, 0x001f

s_addk_i32 s2, 0x0001

s_cmp_ge_u32 s2, 0x00002100

s_cbranch_scc1 label_3CC4

s_branch label_0707

s_getpc_b64 s[10:11]

s_sub_u32 s10, s10, 0x0000d6e4

s_subb_u32 s11, s11, 0

s_setpc_b64 s[10:11]

label_3CC4:

13.1 ISA:

s_mov_b64 exec, s[10:11]

s_addk_i32 s3, 0x001f

s_addk_i32 s2, 0x0001

s_cmp_ge_u32 s2, 0x00002100

s_cbranch_scc0 label_3F7E

s_getpc_b64 s[10:11]

s_add_u32 s10, s10, 0x00000038

s_addc_u32 s11, s11, 0

s_setpc_b64 s[10:11]

label_3F7E:

s_getpc_b64 s[10:11]

s_sub_u32 s10, s10, 0x0000d19c

s_subb_u32 s11, s11, 0

s_setpc_b64 s[10:11]

s_getpc_b64 s[10:11]

s_sub_u32 s10, s10, 0x0000d1b0

s_subb_u32 s11, s11, 0

s_setpc_b64 s[10:11]

As you can see, the new compiler seems makes more instructions for same code.

realhet · ‎01-21-2013

Wow, that's funny code...

s_getpc_b64 s[10:11]

s_add_u32 s10, s10, 0x00000038

s_addc_u32 s11, s11, 0

s_setpc_b64 s[10:11]

It can be realized with an "s_branch 0x000E" (0x000E comes from 0x0038/4, /4 because of dword align)

I guess they prepared the compiler to do bigger loops than 128KB (which can't be encoded in s_branch), so they replaced almost every jumps with these 4cycle far jumps. Even when the jump targets are well known absolute locations in s_branch's reach

(Btw: 64KByte is running out of the GCN's 32KByte code cache! You should keep that loop below 32K)

Tho', I think the performance issue could be rather inside the check_data(...) region, not in this rarely executed loop management code.

darkmen · ‎01-21-2013

Well, I agree: offcourse this will not give 20% perf loss.

I can see positive experience also (atleast in theory):

Loops even more unrolled now
exec mask instruntions are more effective (i can see even less branches in code):

12.10 ISA:

s_mov_b64 s[48:49], exec

s_andn2_b64 exec, s[48:49], s[46:47]

s_andn2_b64 s[44:45], s[44:45], exec

s_cbranch_scc0 label_086E

s_andn2_b64 exec, s[48:49], exec

s_mov_b64 exec, s[48:49]

s_mov_b64 exec, s[44:45]

s_branch label_0838

label_086E:

13.1 ISA:

s_mov_b64 vcc, exec

s_andn2_b64 exec, vcc, s[46:47]

s_andn2_b64 s[44:45], s[44:45], exec

s_cbranch_scc0 label_0C76

s_mov_b64 exec, s[44:45]

s_branch label_0C42

label_0C76:

So, the question is still open, what makes it slower?

himanshu_gautam · ‎01-21-2013

Hi everyone,

From the last few posts, it looks like, there have been some optimizations in the driver 13.1 which have affected a few applications adversely. It will be helpful, if someone can help in pin-pointing this issue. You can point any SDK sample, or a small testcase, which can showcase the performance drop just by using a different driver.

I tried a few SDK Samples: MatrixMulImage, BlackScholes & LDSMemoryBandwidth. But did not see any changes in performance.

darkhmz · ‎01-24-2013

Hi!

Here is a small testcase that shows quite a big (~33% difference in fps) performance drop on my HD5830 just by using different amdocl.dll versions. I've included the two dlls from 12.10 and 13.1 to make the testing easier, and two pictures to show the obvious performance difference on my card. Hope it helps.

http://www.mediafire.com/?nip722foiqoc4v8

himanshu_gautam · ‎01-24-2013

Thanks darkhmz,

Will look into the test case and let you know.

Is this windows issue or linux? It is helpful if you can give any more details about your setup.

darkhmz · ‎01-24-2013

Hi!

Win7 x64 + Catalyst 12.10 here...

himanshu_gautam · ‎01-29-2013

Hi darkhmz,

I have been trying to work on it. I was able to see the slow down in kernel execution (from the outputs of codexl) using the dlls you provided.

But i also tried to create a fresh system, with just the AMD driver installed. When I installed catalyst 12.10, and tried running the executable, using your dlls (12.10 & 13.1), I did not saw the performance degradation. When using the catalyst's amdocl.dll also, the fps was consistent. Still digging more on it.

Did you made any progress, in narrowing down the issue?

Surprisingly codexl still shows the diffference in kernel timings (~33%) when run on the fresh machine just having the driver . Will it be possible for you to share some code, which i can compile. It is a 32-bit exe on a 64bit win7 platoform. Do you see similar performance drop on a 64-bit executable too?

Message was edited by: Himanshu Gautam

darkhmz · ‎01-31-2013

Hi Himanshu,

I've compiled a 64 bit version and tested again, this time with amdocl64.dlls and the performance difference is still here. Though if i change the scene, the difference is gone in some cases. For example with the following simple plane + bumpy torus scene i didnt see fps difference.

float4 de(float4 p, float4 q)

{

float dst1 = dfPlane(p, (float4)(0.0f, 1.0f, 0.0f, -1.0f));

float dst2 = dfTorus(p, (float2)(2.5f, 0.8f)) - max(perlin(p * 3.0f) * 0.1f, 0.0f);

return (float4)(U(dst1, dst2), 0.0f, 0.2f, 0.0f);

}

Im going to try a fresh test system and share my code sometime this week, then test again.

himanshu_gautam · ‎01-31-2013

Thanks for the update.

As I am able to see the difference in kernel execution times, I am planning to forward it to AMD Engg team.

I will send them the source code, once you attach it here.

Thanks for reporting the problem.

BTW I accidently marked this post as assumed answered. Not sure how to revert it though.

darkhmz · ‎02-01-2013

Hi, here is my code.

np, glad i could help.

edit: link removed, file attached.

yurtesen · ‎02-19-2013

I have a small program which is getting about ~25% performance drop with 13.1 drivers. Do you have an email that I can send it to? (it is small but I would rather not upload it to public forum unless absolutely necessary). ?

himanshu_gautam · ‎02-20-2013

Hi yurtsen,

I guess it is necessary to send your testcase via public medium only. I would recommend you to start a new thread, so it is easy to track. My apologies for the inconvenience.

yurtesen · ‎02-20-2013

himanshu.gautam wrote:
Hi yurtsen,
I guess it is necessary to send your testcase via public medium only. I would recommend you to start a new thread, so it is easy to track. My apologies for the inconvenience.

I would understand this if I was looking for a problem in my program. But it doesnt make much sense since the problem appears to be the driver and nobody else (other than AMD) has to see the code. However, I will try to find out if I am allowed to share the code with public and return back to you in a new thread if I can.

himanshu_gautam · ‎02-20-2013

Thanks for your support.

I had asked for private message channel, but there appears to be some legal problems with that. Hope you will be able to reproduce your problem with a small testcase, which is easy to share for your in public domain.

yurtesen · ‎02-21-2013

himanshu.gautam wrote:
Thanks for your support.
I had asked for private message channel, but there appears to be some legal problems with that. Hope you will be able to reproduce your problem with a small testcase, which is easy to share for your in public domain.

The code itself does not have any copyright/license and our own experimental research code. The code is already a small testcase, we simply do not want it out in public yet. But I will see how we can flex that....

Archives Discussions

OpenCL performance dropped down 12.10 >> 13.1