Archives Discussions

Nexis · ‎07-02-2008

Is it possible to run my kernel in an infinite loop?

Would it be possible to run a kernel inside an infinite loop on the GPU so that I only need to set a flag to 1 to start it and have my kernel set is back to 0 when it ends it's calculations so I can start all over again for the next iteration without needing to call the kernel again?

I guess there would be some limitations like the domain of the kernel would have to be small enough so that all the threads can run simultaneously on the GPU...

Thanks for the help

Nexis · ‎07-03-2008

I'm trying to implement a kernel running in an infinite loop on the GPU but I seem to be unable to make uncached memory access to the remote memory...

I have the following kernel where g[] is mapped to some remote memory:

il_ps_2_0
dcl_output_generic o0
dcl_cb cb0[1]
mov r0, r0.0000
mov r0.x, cb0[0]
whileloop
sub r0, r0, r1.1000
break_logicalz r0
endloop
mov o0, g[0]
ret_dyn
end

I set the constant value cbo[0] to 10000000 so that the kernel runs in a loop for about a second or so and during that time, on the PC, I modify the values in the remote memory. As expected, the output of the kernel are the values I set while the kernel was running and not the inital ones...

My problem is that if I use the following kernel, the output values are no more the ones I set when the kernel was running, but instead the initial values that were there when the kernel was launch...

il_ps_2_0
dcl_output_generic o0
dcl_cb cb0[1]
mov o0, g[0]
mov r0, r0.0000
mov r0.x, cb0[0]
whileloop
sub r0, r0, r1.1000
break_logicalz r0
endloop
mov o0, g[0]
ret_dyn
end

After investigation, I found out that the compiler removed the last global memory read:

00 MEM_GLOBAL_READ: R0, DWORD_PTR[0], ELEM_SIZE(3)
01 ALU: ADDR(32) CNT(1) KCACHE0(CB0:0-15)
      0 x: MOV         R1.x, KC0[0].x
02 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX
    03 ALU_BREAK: ADDR(33) CNT(2)
          1 x: ADD         R1.x, R1.x, -1.0f
          2 x: PREDNE_INT ____, R1.x, 0.0f      UPDATE_EXEC_MASK UPDATE_PRED
04 ENDLOOP i0 PASS_JUMP_ADDR(3)
05 EXP_DONE: PIX0, R0
END_OF_PROGRAM

So instead, I directly used the dissasembly to compile my kernel and I added the last global memory read myself:

00 MEM_GLOBAL_READ: R0, DWORD_PTR[0], ELEM_SIZE(3)
01 ALU: ADDR(32) CNT(1) KCACHE0(CB0:0-15)
      0 x: MOV         R1.x, KC0[0].x
02 LOOP_DX10 i0 FAIL_JUMP_ADDR(5) VALID_PIX
    03 ALU_BREAK: ADDR(33) CNT(2)
          1 x: ADD         R1.x, R1.x, -1.0f
          2 x: PREDNE_INT ____, R1.x, 0.0f      UPDATE_EXEC_MASK UPDATE_PRED
04 ENDLOOP i0 PASS_JUMP_ADDR(3)
05 MEM_GLOBAL_READ: R0, DWORD_PTR[0], ELEM_SIZE(3)
06 EXP_DONE: PIX0, R0
END_OF_PROGRAM

I was very surprised to see that the behavior of this last kernel is still the same, the outpouts values are the initial ones and not the ones I set while the kernel is running! Also, if I comment the first global memory read, the outputs values are the ones I set while the kernel is running... The only explanation I can find to explain this behavior is that the memory read I'm doing are cached so the second read doesn't actually read in the remote memory but from the cache. From what I could understand from the documentation, scattered reads are not supposed to be cached...

Are the reads I'm doing scatter(using g[])? And how do I make real uncached reads?

Thanks a lot for any help

MicahVillmow · ‎07-03-2008

Nexis,
We are looking into this, but since many people have already left for break, the answer will probably have to wait until next week.

Nexis · ‎07-03-2008

Thanks you Micah,

To help you reproduce my situation, here is what I do on the CPU:

// Get the pointer to the remote memory
calResMap((void**)&dataPtr, &pitch, remoteRes, 0);
// Set initial values in remote memory
dataPtr[0] = 1;
dataPtr[1] = 2;
dataPtr[2] = 3;
dataPtr[3] = 4;
calResUnmap(remoteRes);
// Launch the kernel
calCtxRunProgram(&event, ctx, entry, &domain);
calCtxIsEventDone(ctx, event);
// Wait that the kernel actually starts on the GPU
for (int i=0; i<10000000; ++i);
// Set new values in remote memory
dataPtr[0] = 100;
dataPtr[1] = 200;
dataPtr[2] = 300;
dataPtr[3] = 400;
// Wait for the kernel to finish
while(calCtxIsEventDone(ctx, event) == CAL_RESULT_PENDING);
calResMap((void**)&dataPtr, &pitch, outputRes, 0);
printf("%f\n", dataPtr[0]);
printf("%f\n", dataPtr[1]);
printf("%f\n", dataPtr[2]);
printf("%f\n", dataPtr[3]);
calResUnmap(outputRes);

You may have to modify the for loop that waits for the kernel to start on the GPU depending on your CPU speed. If the loop is too short, the new values will be written in memory before the kernel begins and you would get the new values as output even if you just put a global memory read at the beginning of the kernel... If it's to long, the values will be set after the kernel finishs...

Thanks a lot for your help

Nexis · ‎07-10-2008

Any news on how I could make the GPU re-read a flag set in remote global memory?

Isn't there any way I could trick the GPU in reading the flag as if it was declared volatile?

MicahVillmow · ‎07-18-2008

Nexis,
From talking to other engineers, your current implementation is invalid as you are using a pointer after you unmap the memory. At this point there is no guarantee on the behavior and can be considered an illegal operation. Is there a specific reason why you want to keep the GPU busy 100% of the time instead of only using it when required? By putting the GPU in an infinite loop you pretty much keep windows from updating the screen and run the possibility of windows killing the driver because it might consider it hung and buggy. This causes a blue screen if windows kills it.

Nexis · ‎07-18-2008

Well, the goal of keeping the GPU busy 100% of the time is to avoid the overhead of calling the kernel at each 50us. By looping on the GPU, the kernel would just be polling a variable on the host to know when to start a new iteration. By running this on a card with no display connected to it, the OS wouldn't kill the GPU's driver...

As for the use of the pointer after the memory has been unmaped, I didn't think it could pose a problem since it's a pointer to host memory... Unmaping the memory doesn't change the value of the pointer and if you use it to change values in memory, the new values will be seen by any kernel launched after the values are modified. As my test showed, they are even seen by a kernel that is running while the values are modified as long as they are not read before they are modified... But if you're telling me it's not possible to do so, then I guess I'll just abandon this idea...

Thanks for your help and your time...

Archives Discussions

Loop on the GPU