6 Replies Latest reply on Jul 18, 2008 8:38 PM by Nexis

    Loop on the GPU

    Nexis
      Is it possible to run my kernel in an infinite loop?

      Would it be possible to run a kernel inside an infinite loop on the GPU so that I only need to set a flag to 1 to start it and have my kernel set is back to 0 when it ends it's calculations so I can start all over again for the next iteration without needing to call the kernel again?

      I guess there would be some limitations like the domain of the kernel would have to be small enough so that all the threads can run simultaneously on the GPU...

      Thanks for the help

        • Loop on the GPU
          Nexis

          I'm trying to implement a kernel running in an infinite loop on the GPU but I seem to be unable to make uncached memory access to the remote memory...

          I have the following kernel where g[] is mapped to some remote memory:

          il_ps_2_0
          dcl_output_generic o0
          dcl_cb cb0[1]
          mov r0, r0.0000
          mov r0.x, cb0[0]
          whileloop
           sub r0, r0, r1.1000
           break_logicalz r0
          endloop
          mov o0, g[0]
          ret_dyn
          end

          I set the constant value cbo[0] to 10000000 so that the kernel runs in a loop for about a second or so and during that time, on the PC, I modify the values in the remote memory. As expected, the output of the kernel are the values I set while the kernel was running and not the inital ones...

          My problem is that if I use the following kernel, the output values are no more the ones I set when the kernel was running, but instead the initial values that were there when the kernel was launch...

          il_ps_2_0
          dcl_output_generic o0
          dcl_cb cb0[1]
          mov o0, g[0]
          mov r0, r0.0000
          mov r0.x, cb0[0]
          whileloop
           sub r0, r0, r1.1000
           break_logicalz r0
          endloop
          mov o0, g[0]
          ret_dyn
          end

           After investigation, I found out that the compiler removed the last global memory read:

          00 MEM_GLOBAL_READ: R0, DWORD_PTR[0], ELEM_SIZE(3)
          01 ALU: ADDR(32) CNT(1) KCACHE0(CB0:0-15)
                0  x: MOV         R1.x,  KC0[0].x
          02 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX
              03 ALU_BREAK: ADDR(33) CNT(2)
                    1  x: ADD         R1.x,  R1.x, -1.0f
                    2  x: PREDNE_INT  ____,  R1.x,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED
          04 ENDLOOP i0 PASS_JUMP_ADDR(3)
          05 EXP_DONE: PIX0, R0
          END_OF_PROGRAM

          So instead, I directly used the dissasembly to compile my kernel and I added the last global memory read myself:

          00 MEM_GLOBAL_READ: R0, DWORD_PTR[0], ELEM_SIZE(3)
          01 ALU: ADDR(32) CNT(1) KCACHE0(CB0:0-15)
                0  x: MOV         R1.x,  KC0[0].x
          02 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX
              03 ALU_BREAK: ADDR(33) CNT(2)
                    1  x: ADD         R1.x,  R1.x, -1.0f
                    2  x: PREDNE_INT  ____,  R1.x,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED
          04 ENDLOOP i0 PASS_JUMP_ADDR(3)
          05 MEM_GLOBAL_READ: R0, DWORD_PTR[0], ELEM_SIZE(3)
          06 EXP_DONE: PIX0, R0
          END_OF_PROGRAM

          I was very surprised to see that the behavior of this last kernel is still the same, the outpouts values are the initial ones and not the ones I set while the kernel is running! Also, if I comment the first global memory read, the outputs values are the ones I set while the kernel is running... The only explanation I can find to explain this behavior is that the memory read I'm doing are cached so the second read doesn't actually read in the remote memory but from the cache. From what I could understand from the documentation, scattered reads are not supposed to be cached...

          Are the reads I'm doing scatter(using g[])? And how do I make real uncached reads?

          Thanks a lot for any help

          • Loop on the GPU
            MicahVillmow
            Nexis,
            We are looking into this, but since many people have already left for break, the answer will probably have to wait until next week.
              • Loop on the GPU
                Nexis

                Thanks you Micah,

                To help you reproduce my situation, here is what I do on the CPU:

                // Get the pointer to the remote memory
                calResMap((void**)&dataPtr, &pitch, remoteRes, 0);
                // Set initial values in remote memory
                dataPtr[0] = 1;
                dataPtr[1] = 2;
                dataPtr[2] = 3;
                dataPtr[3] = 4;
                calResUnmap(remoteRes);
                // Launch the kernel
                calCtxRunProgram(&event, ctx, entry, &domain);
                calCtxIsEventDone(ctx, event);
                // Wait that the kernel actually starts on the GPU
                for (int i=0; i<10000000; ++i);
                // Set new values in remote memory
                dataPtr[0] = 100;
                dataPtr[1] = 200;
                dataPtr[2] = 300;
                dataPtr[3] = 400;
                // Wait for the kernel to finish
                while(calCtxIsEventDone(ctx, event) == CAL_RESULT_PENDING);
                calResMap((void**)&dataPtr, &pitch, outputRes, 0);
                printf("%f\n", dataPtr[0]);
                printf("%f\n", dataPtr[1]);
                printf("%f\n", dataPtr[2]);
                printf("%f\n", dataPtr[3]);
                calResUnmap(outputRes);

                You may have to modify the for loop that waits for the kernel to start on the GPU depending on your CPU speed. If the loop is too short, the new values will be written in memory before the kernel begins and you would get the new values as output even if you just put a global memory read at the beginning of the kernel... If it's to long, the values will be set after the kernel finishs...

                Thanks a lot for your help

                  • Loop on the GPU
                    Nexis

                    Any news on how I could make the GPU re-read a flag set in remote global memory?

                    Isn't there any way I could trick the GPU in reading the flag as if it was declared volatile?

                • Loop on the GPU
                  MicahVillmow
                  Nexis,
                  From talking to other engineers, your current implementation is invalid as you are using a pointer after you unmap the memory. At this point there is no guarantee on the behavior and can be considered an illegal operation. Is there a specific reason why you want to keep the GPU busy 100% of the time instead of only using it when required? By putting the GPU in an infinite loop you pretty much keep windows from updating the screen and run the possibility of windows killing the driver because it might consider it hung and buggy. This causes a blue screen if windows kills it.
                    • Loop on the GPU
                      Nexis

                      Well, the goal of keeping the GPU busy 100% of the time is to avoid the overhead of calling the kernel at each 50us. By looping on the GPU, the kernel would just be polling a variable on the host to know when to start a new iteration. By running this on a card with no display connected to it, the OS wouldn't kill the GPU's driver...

                      As for the use of the pointer after the memory has been unmaped, I didn't think it could pose a problem since it's a pointer to host memory... Unmaping the memory doesn't change the value of the pointer and if you use it to change values in memory, the new values will be seen by any kernel launched after the values are modified. As my test showed, they are even seen by a kernel that is running while the values are modified as long as they are not read before they are modified... But if you're telling me it's not possible to do so, then I guess I'll just abandon this idea...

                      Thanks for your help and your time...