16 Replies Latest reply on Sep 14, 2011 8:38 PM by corry

    HD6970/6990 question

    twisted_pi

      Hello

      At first sorry for my english.

      I have some problem now with my clients who uses my software and GPU. My code works on 5870/5970 cards in CAL(code created in OpenCL, compiled to IL and a bit modified) with some modifications in ISA ASM(pressured use some patching coz of lack of all ASM commands in compiler).

      Because is almost not possible to buy 5970, peoples buy 6970/6990 cards. To be still on market we must add new kernel for users, but here are problem. To be at TOP on market, out code must be fastest, it can be achieved by patching kernel for some ASM commands which are not supported by AMD compilet for unknown reason. Now kernels for 6970/6990 do not works correctly, so I assume that assmebler opcodes are changed(maybe all or maybe some), is this true? If yes, then is possible to get new Instruction Set info? If not, then how to fix that big problem if I cannot use 5970/6990 in OpenCL coz bug with 1 GPU working?

      And last question, why latest update still do not support full potential of AMD cards? I was really 99% sure that it will be available in APP SDK 2.3, but not. Do AMD programming crew is anyhow informed about really important problems?? 1 GPU bug already exist for 1 year and 4 months, so same problem is with compiler which do not use full GPU potential.

      We and our clients who mostly buy 2x5970 and they are thousands are really presured to change brand to NVIDIA, which show that their latest products are really close in integer numbers calculations(in real numbers they are faster always). We really love AMD , we can understand many things, we can understand that no one is perfect and make bugs, we can wait... but how long?

      In final words, our 6970/6990 kernels must works slower than 5970(1 core), even if that cards have higher clocks, but 5-way VLIW changed to 4-way make no difference in speed - because lack of info, again not used full potential by compiler. Maybe they are faster in games(better tesselation etc), but not in OpenCL what is future for GPGPU.Now when I saw info that CAL support will be discontinued(is already in 11.4) in next releases, we have no choice to inform our clients to prepare to swap GPUs from more serious company, who listen community.

      Sorry for sad words, but this is like we feel. We will not lose our customers, they will understand that is not our fault, AMD will lose.

      I do not blame you supporters, coz you own job and I know that will get response similar to "we are working on it", just please inform correct persons who are responsible for it.

       

        • HD6970/6990 question
          himanshu.gautam

          twisted_pl,

          can you provide a testcase where 5970 works but 6970 doesn't.

          Does it give compilation error or is it a run time problem?

          Please specify the system details also.

           

            • HD6970/6990 question
              twisted_pi

              Pure OpenCL code works on both cards(only 1 core, what is obvious for ~1,5 years), problem is with CAL. I compile OpenCL code to IL and I patch ELF kernel in RAM to use not supported BFI_INT instruction(missed for more that 1 year). I do test on remote client PC who have 6990, but I'll get 6990 in 2-3 days to test it at my PC. As I said in previous post, same code works at same speed on 6990 and 5970, just 6990 produce bad result after patching, probably of changed ISA opcodes for 69x0. Without patching code, 6990 will be slower than 5970(that's why I must patch it to get at least at same speed).

              simple code case:

              OpenCL

              __kernel void test(__global uint *out)
              {
              //out[get_global_id(0)] = bitselect(get_global_id(0),get_global_id(0),get_global_id(0));
              out[get_global_id(0)] = 1;
              }

              compiled to ISA

              ; --------  Disassembly --------------------
              00 ALU: ADDR(32) CNT(9) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)
                    0  x: MOV         R1.x,  1     
                       t: MULLO_INT   ____,  R1.x,  KC0[1].x     
                    1  x: ADD_INT     ____,  R0.x,  PS0     
                    2  w: ADD_INT     ____,  PV1.x,  KC0[6].x     
                    3  z: LSHL        ____,  PV2.w,  (0x00000002, 2.802596929e-45f).x     
                    4  y: ADD_INT     ____,  KC1[0].x,  PV3.z     
                    5  x: LSHR        R0.x,  PV4.y,  (0x00000002, 2.802596929e-45f).x     
              01 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R0].x___, R1, ARRAY_SIZE(4)  VPM
              END_OF_PROGRAM

              bitselect should be compiled to native ISA command(probably this should be as AMD extension).

              I'm not sure if You understand my problem, so will try explain again?

              first problem:

              I want to use 100% potential of expensive graphic card(s), but is not possible, so we(AMD users) must make dirty tricks, ELF file(in RAM) patching to have BFI_INT instruction, what is not compiled by compiler(bitselect in OpenCL). Without patching I lose speed. Patching works for 5x70 kernels, but not works on 69x0 but it should, so I assume ISA opcodes are changed

              second problem:

              I know that is not good coding style, but is only way to use potential of card. We are presured to use CAL, because in OpenCL do not works 2 GPU cores, is like faulty product(5970) for 1,5 years already. If I and my clients cannot use OpenCL and CAL(will be not supported soon), then what we can do? Simply give back on warranty as fault product? I buy graphic card not for games but for OpenCL/CAL.

              Both cases was already posted by some users, with no response or standard response "we are working on it". So I'm prepared to get similar answer. I need just serious response about what we(me and my clients) should do (many buy 6990) with their graphic cards(more than 1000 pcs)? Wait another year or give back and request money back?

               

            • HD6970/6990 question
              MicahVillmow
              twisted_pi,
              Thank you for bringing this about bitselect to our attention. I will see what we can do to produce better code. We have some optimizations that produce BFI_INT, but it requires you to code similiar to this pattern:
              (A & MASK) | (B & MASK2) where both MASK and MASK2 are compile time constants, or where MASK2 is equal to ~MASK.
                • HD6970/6990 question
                  afo

                  Hi,

                  "We have some optimizations that produce BFI_INT, but it requires you to code similiar to this pattern:
                  (A & MASK) | (B & MASK2) where both MASK and MASK2 are compile time constants, or where MASK2 is equal to ~MASK. "

                  Those instrucction patterns are documented somewhere? That would be great!. For example, is there an instrucction pattern that generates icbits?

                  By the way, Are there specific cases where multiGPU works? (i.e. one context one thread for each GPU). Looking at AMD posts, it seems that SDK is close to it, so if there are corner cases where multiGPU should work, I would like to know.

                  best regards,

                  Alfonso

                • HD6970/6990 question
                  MicahVillmow
                  twisted_pi,
                  I forgot to include a link to the to the 6990 ISA document found here:
                  http://developer.amd.com/gpu/A..._Set_Architecture.pdf

                  If you find the document in error, please let us know.
                    • HD6970/6990 question
                      twisted_pi

                      Hello

                       

                      Micah Thanks for reply and link. I just got to hands and testing HD6990 at my PC. I found out that problem is probably not with changed ISA opcodes, but code compiled from OpenCL to IL(modified to not use functions, same as for HD5x70, but without any patching), do not works. It do not calculate correct result. I see that new driver compile to IL and use UAV(11), before it was UAV(1), do it is possible that 66x0 need to use specific UAV, in this case number 11? I just trying to find out why my code do not produce correct results. Will try from simple code. About BFI_INT, are You mean that is already supported in this release or programming crew have solution and testing it?

                      EDIT: I tested simple code and works, UAV number not matter,interesting is that IL code compiled for Cypress and Cayman are identical, difference is only comment there ";device:cayman" and ";device:cypress", not more differences. Loaded and compiled IL code in CAL works for 5970 but do not works for 6990.  g_attribs.target return value 15 and I use it in calclCompile as target parameter. I have no more idea what can be wrong. I use vAbsTidFlat.x as global_id. I cannot post code becuase is product for sell. Will dig more, but maybe You have any ideas about if something is broken in IL > ISA compilation or ELF production for 6990?

                       

                      Best Regards

                    • HD6970/6990 question
                      MicahVillmow
                      twisted_pi,
                      We have some optimizations that detect certain instruction patterns and replace them with ubit_insert, which is slightly different than the BFI_INT instruction but compiles down to it.

                      As for UAV 11, all Evergreen and Northern Island devices will use this UAV as the default over UAV 1. UAV 11 is setup for caching, the other UAV's are not.
                        • HD6970/6990 question
                          twisted_pi

                          Hello

                           

                          Thanks for info about UAV 11. Do exists any document where that news/changes are wroted? Okay about BFI_INT is clear now. I have discovered that OpenCL ELF patching works, so is not a problem for now. Problem is IL to ISA compilation. As I mentioned before both codes compiled for 5x70 and 69x0 are 100% except one comment line. Of course they must be same because is IL language. That code works on 5x70 perfectly, but on 6x70 produce wrong results. I'll cut code to pieces and compare results step by step adding more and more code. Is really strange because it should works. Only difference what I see is that now IL code contain macro at start, something like checking for divide by zero?I think that it must be something simple, because simple code works without problem.

                           

                           

                          • HD6970/6990 question
                            corry

                            Micah,

                            I know this is an old thread, and I hate reviving old threads, in this case it seemed appropriate.

                            I just wanted to make sure information about the UAV caching hasn't fallen iff your radar.  I have the IL 2.0 spec, not sure if that is current, or was current as of this topics posting, but there is nothing in it about uav 11 being cached, nor all others not being cached.

                            So for my own information, let me get this 100% straight. 

                            If I use uav's in my code, anywhere at all, and I use dcl_uav(number) and number is not 11, I will not see the benefit of caching, however, the single uav that uses 11 *will* see the benefit of caching.  Do I have it right?

                            Thanks

                            Corry

                          • HD6970/6990 question
                            MicahVillmow
                            afo,
                            I'll see if we can get them added somewhere.
                            The patterns that I currently handle are as follows:
                            (A >> B) & C gets optimized to [u|i]bitfield_extract depending on the signedness
                            ((A & B) << C) | ((D & E) << F) and (A & B) | (C & ~B) gets optimized to ubitfield_insert with certain constraints.
                            These are:
                            1) B^E == 0
                            2) B, C, E and F are compile time constants
                            3) B and E are masks
                            4) ffb_lo(B) >= countbits(E) + ffb_lo(E) or ffb_lo(E) >= countbits(B) + ffb_lo(B)

                              • HD6970/6990 question
                                afo

                                Thanks for the information!

                                best regards,

                                Alfonso

                                • HD6970/6990 question
                                  twisted_pi

                                  Micach

                                   

                                  I fixed my problem, I found what is different in 5970/6990. I used 1 UAV to communicate on same range of data, for better explanation:

                                  offset 0-1000 was data, I read it and store too at this range. This is logical and works, but somehow not works on 6990, if U use write out of that range then works.

                                  I had today strange driver crash, my PC was suspended, after wake-up(windows 7 32bit) mouse pointer was destroyed and system freeze and do reboot. From this point I cant run win7 in normal mode. After remove drivers(must use any wipe sw, because uninstalling not works at emergency more) I cant install 10.3 drivers(I had 10.4b), installer say that I must have compatible hardware(I thought that my card is damaged).  After remove some files manually, I succesfully installed 10.3. Now I'm not 100% sure if is some differences in 5970/6990 or it was something with 10.4b driver. Anyway after finish important job I'll repeat test to confirm it.

                                   

                                • HD6970/6990 question
                                  MicahVillmow
                                  Caching for UAV11 has been in the SDK since 2.4 for very basic scenarios and has been improved in SDK 2.5.

                                  Basically if you mark a pointer as restrict, or const restrict, you will get caching on loads.
                                  As for IL, CAL only supports caching to UAV 11.
                                    • HD6970/6990 question
                                      corry

                                      Ok, at least I have it correct now, but I just did a search through my il 2.0e document for 11.  I didn't try b for hex 11, but I found nothing in the IL docs about UAV 11.  I also checked my cal Dec 2010 rev 2.03 for mention, of that, and found nothing.  Thus, the comment, I wanted to see if this was still on your radar for documentation.  (amounst other concerns of course)

                                      Thanks

                                    • HD6970/6990 question
                                      MicahVillmow
                                      corry,
                                      Because the CAL/IL interface has been deprecated, I do not know if the documentation will get updated.
                                        • HD6970/6990 question
                                          corry

                                          Oh!  I know CAL was, but I didn't realize that IL wasn't going to be published anymore as well.  :-/  I figured they were still going to give us code "hackers" willing to patch OpenCL binaries half-hearted support :-)