cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

darkmen
Journeyman III

OpenCL performance dropped down 12.10 >> 13.1

Hi everyone.

I have updated today the AMD Catalist drivers to 13.1 and got 20% performance loss on my HD7970.

Does anyone have the same experiance?

Also which is the easiest way to rollback to 12,10? Uninstalling 13.1 and reinstalling 12.10 gives the same lower speed (opencl reporting NEW runtime version)

0 Likes
20 Replies
Claggy
Adept II

I reported that last week too:

http://devgurus.amd.com/message/1286437#1286437

I had to delete a whole lot of files to be able to reinstall Cat 12.8,

since then an AMD Catalyst Un-install Utility has appeared on the AMD Game Driver download site:

http://sites.amd.com/us/game/downloads/Pages/catalyst-uninstall-utility.aspx

Not tried it properly yet, except that it didn't work on Vista, and it says it is for Windows 7 only,

Claggy

0 Likes

I have uninstall 13.1, deleted syswow64\amdocl.dll and reinstalled 12.10

OCL Runtime version is now 1016 and speed is back.

BR

0 Likes
darkhmz
Adept I

Hi!

I have experienced the same issue with Catalyst 13.1. In my case the performance drop was around 39% on my HD5830. I've tested kernel performance with different versions of amdocl.dll and the OpenCL version shipped with Catalyst 13.1 was the worst. According to APP profiler, kernel execution times were ~17.51ms and ~24.38ms (12.10 vs 13.1).

0 Likes

Hi,

I am sorry to hear this.

If I am not asking for more, Can you please post a simple code that shows the performance degradation.

Thanks,

0 Likes
darkmen
Journeyman III

Hi, i have just tried the 13.2 version with OCL runtime 1124.2,

Performance goes even more down then 13.1.

And this is all goes to a compiler. Now comparing ISA sources produced by 12.10 and 13.1 (btw, AMD APP KernelAnalyzer crashes on 13.2)

Seems there are some changes around branches and\or loops.

The source pseudo code:

for(uint i=0;i<STEP;i++){

          if(check_data(...))

     output[0] = i;

}

12.10 ISA:

  s_mov_b64     exec, s[10:11]     

  s_addk_i32    s3, 0x001f         

  s_addk_i32    s2, 0x0001         

  s_cmp_ge_u32  s2, 0x00002100     

  s_cbranch_scc1  label_3CC4       

  s_branch      label_0707         

  s_getpc_b64   s[10:11]           

  s_sub_u32     s10, s10, 0x0000d6e4

  s_subb_u32    s11, s11, 0        

  s_setpc_b64   s[10:11]           

label_3CC4:                        

13.1 ISA:

  s_mov_b64     exec, s[10:11]     

  s_addk_i32    s3, 0x001f         

  s_addk_i32    s2, 0x0001         

  s_cmp_ge_u32  s2, 0x00002100     

  s_cbranch_scc0  label_3F7E       

  s_getpc_b64   s[10:11]           

  s_add_u32     s10, s10, 0x00000038

  s_addc_u32    s11, s11, 0        

  s_setpc_b64   s[10:11]           

label_3F7E:                        

  s_getpc_b64   s[10:11]           

  s_sub_u32     s10, s10, 0x0000d19c

  s_subb_u32    s11, s11, 0        

  s_setpc_b64   s[10:11]           

  s_getpc_b64   s[10:11]           

  s_sub_u32     s10, s10, 0x0000d1b0

  s_subb_u32    s11, s11, 0        

  s_setpc_b64   s[10:11]           

As you can see, the new compiler seems makes more instructions for same code.

0 Likes

Wow, that's funny code...

  s_getpc_b64   s[10:11]       

  s_add_u32     s10, s10, 0x00000038

  s_addc_u32    s11, s11, 0       

  s_setpc_b64   s[10:11]          

It can be realized with an "s_branch 0x000E" (0x000E comes from 0x0038/4, /4 because of dword align)

I guess they prepared the compiler to do bigger loops than 128KB (which can't be encoded in s_branch), so they replaced almost every jumps with these 4cycle far jumps. Even when the jump targets are well known absolute locations in s_branch's reach

(Btw: 64KByte is running out of the GCN's 32KByte code cache! You should keep that loop below 32K)

Tho', I think the performance issue could be rather inside the check_data(...) region, not in this rarely executed loop management code.

0 Likes

Well, I agree: offcourse this will not give 20% perf loss.

I can see positive experience also (atleast in theory):

  • Loops even more unrolled now
  • exec mask instruntions are more effective (i can see even less branches in code):

12.10 ISA:

  s_mov_b64     s[48:49], exec                             

  s_andn2_b64   exec, s[48:49], s[46:47]                   

  s_andn2_b64   s[44:45], s[44:45], exec                   

  s_cbranch_scc0  label_086E                               

  s_andn2_b64   exec, s[48:49], exec                       

  s_mov_b64     exec, s[48:49]                             

  s_mov_b64     exec, s[44:45]                             

  s_branch      label_0838                                 

label_086E:

13.1 ISA:

  s_mov_b64     vcc, exec                                  

  s_andn2_b64   exec, vcc, s[46:47]                        

  s_andn2_b64   s[44:45], s[44:45], exec                   

  s_cbranch_scc0  label_0C76                               

  s_mov_b64     exec, s[44:45]                             

  s_branch      label_0C42                                 

label_0C76:

So, the question is still open, what makes it slower?

0 Likes

Hi everyone,

From the last few posts, it looks like, there have been some optimizations in the driver 13.1 which have affected a few applications adversely. It will be helpful, if someone can help in pin-pointing this issue. You can point any SDK sample, or a small testcase, which can showcase the performance drop just by using a different driver.

I tried a few SDK Samples: MatrixMulImage, BlackScholes & LDSMemoryBandwidth. But did not see any changes in performance.

0 Likes

Hi!

Here is a small testcase that shows quite a big (~33% difference in fps) performance drop on my HD5830 just by using different amdocl.dll versions. I've included the two dlls from 12.10 and 13.1 to make the testing easier, and two pictures to show the obvious performance difference on my card. Hope it helps.

http://www.mediafire.com/?nip722foiqoc4v8

Thanks darkhmz,

Will look into the test case and let you know.

Is this windows issue or linux? It is helpful if you can give any more details about your setup.

0 Likes

Hi!

Win7 x64 + Catalyst 12.10 here...

0 Likes

Hi darkhmz,

I have been trying to work on it. I was able to see the slow down in kernel execution (from the outputs of codexl) using the dlls you provided.

But i also tried to create a fresh system, with just the AMD driver installed. When I installed catalyst 12.10, and tried running the executable, using your dlls (12.10 & 13.1), I did not saw the performance degradation. When using the catalyst's amdocl.dll also, the fps was consistent. Still digging more on it.

Did you made any progress, in narrowing down the issue?

Surprisingly codexl still shows the diffference in kernel timings (~33%) when run on the fresh machine just having the driver . Will it be possible for you to share some code, which i can compile. It is a 32-bit exe on a 64bit win7 platoform. Do you see similar performance drop on a 64-bit executable too?

Message was edited by: Himanshu Gautam

0 Likes

Hi Himanshu,

I've compiled a 64 bit version and tested again, this time with amdocl64.dlls and the performance difference is still here. Though if  i change the scene, the difference is gone in some cases. For example with the following simple plane + bumpy torus scene i didnt see fps difference.

float4 de(float4 p, float4 q)

{

          float dst1 = dfPlane(p, (float4)(0.0f, 1.0f, 0.0f, -1.0f));

          float dst2 = dfTorus(p, (float2)(2.5f, 0.8f)) - max(perlin(p * 3.0f) * 0.1f, 0.0f);

          return (float4)(U(dst1, dst2), 0.0f, 0.2f, 0.0f);

}

Im going to try a fresh test system and share my code sometime this week, then test again.

0 Likes

Thanks for the update.

As I am able to see the difference in kernel execution times, I am planning to forward it to AMD Engg team.

I will send them the source code, once you attach it here.

Thanks for reporting the problem.

BTW I accidently marked this post as assumed answered. Not sure how to revert it though.

0 Likes

Hi, here is my code.

np, glad i could help.

edit: link removed, file attached.

0 Likes

I have a small program which is getting about ~25% performance drop with 13.1 drivers. Do you have an email that I can send it to? (it is small but I would rather not upload it to public forum unless absolutely necessary).  ?

0 Likes

Hi yurtsen,

I guess it is necessary to send your testcase via public medium only. I would recommend you to start a new thread, so it is easy to track. My apologies for the inconvenience.

0 Likes

himanshu.gautam wrote:

Hi yurtsen,

I guess it is necessary to send your testcase via public medium only. I would recommend you to start a new thread, so it is easy to track. My apologies for the inconvenience.

I would understand this if I was looking for a problem in my program. But it doesnt make much sense since the problem appears to be the driver and nobody else (other than AMD) has to see the code. However, I will try to find out if I am allowed to share the code with public and return back to you in a new thread if I can.

0 Likes

Thanks for your support.

I had asked for private message channel, but there appears to be some legal problems with that. Hope you will be able to reproduce your problem with a small testcase, which is easy to share for your in public domain.

0 Likes

himanshu.gautam wrote:

Thanks for your support.

I had asked for private message channel, but there appears to be some legal problems with that. Hope you will be able to reproduce your problem with a small testcase, which is easy to share for your in public domain.

The code itself does not have any copyright/license and our own experimental research code. The code is already a small testcase, we simply do not want it out in public yet. But I will see how we can flex that....

0 Likes