2 Replies Latest reply on Jun 19, 2012 4:28 PM by aisesal@gmail.com

    Sequential work group sync.




      I use hex editor to modify binary OpenCL kernels to modify some instructions. I've been enable to take advantage of some instructions like MUL_PREV, FLT_TO_INT_FLOOR (available in xyzw slots). There are few more such useful instructions that compiler doesn't use.


      I'm looking for some information on how to use GROUP_SEQ_START and GROUP_SEQ_END instructions available on Evergreen architecture:

      1. Is it available on HD 6850 card?
      2. Does it require all threads within workgroup to reach it, like with GROUP_BARRIER instruction?
      3. Does it take any arguments?
      4. Does it require any additional information in binary file for it to work? Maybe it requires some setup by OpenCL runtime that is not accessible with current API?
      5. Does it span multiple ALU clauses?
      6. Where should it be placed: first/last instruction of ALU clause? Or can it be anywhere within a clause? What slot in VLIW it should use?


      So far I was unable to make this instruction work. It doesn't cause any crashes and acts as NOP on my video card. I found no information on the web, so I thought writing a post myself. I understand I'm asking for some low level stuff, but since there's no official way of taking full potential of your video card, people have to resort to such hacks, as manually editing binaries.

        • Re: Sequential work group sync.

          These instructions require hardware setup outside of the binary that is not accessible. So while your modification for certain instructions will work, instructions that have external dependencies will not work.


          For the instructions you are having to modify the binary for, what cases do you need to expose them? If you can provide examples, we can fix the compiler so it generates them correctly.

          1 of 1 people found this helpful
            • Re: Sequential work group sync.

              Hi again.


              I've just tried few more cases with GROUP_SEQ_BEGIN/END and it seems it worked, but not in a way I was hoping. Documentation has some conflicting statements. One says that each work item will run in sequence, the other one that each wavefront in a workgroup will run sequentially. It seems that the second one is likely to be correct, though I can't be sure, because as you said, there's some hardware setup is needed.


              As for instructions I modify: compiler seems to never generate ADD_PREV, MUL_PREV, MULADD_PREV. There's nothing magic about these instructions, but they help with ALU packing. There's also some float->int and int->float conversion instructions that can go into xyzw slots, allowing for 4 conversions per VLIW instruction, instead of 1.


              Also compiler doesn't seem to generate code that uses destination register modifiers like: ADD_SAT, MUL_SAT, MULADD/2, MULADD*2, etc.


              It also seems that compiler always generates SET_?? and PREDE_INT/PREDNE_INT pair of instructions instead of directly using PRED_SET?? instruction.


              Some time ago I wrote about a problem with read_image function using SAMPLE instruction and doing some math with coordinates, instead of LD then it's possible.