4 Replies Latest reply on Sep 25, 2011 3:18 AM by notzed

    Persistent thread on the CPU

    spectral

      Hi,

       

       

      I'm trying to implement some "persistent thread" on the CPU to batch a set of tasks, but I got some strange results.
      I have put some "printf" in the following code. What is strange is that I see the "BEFORE" before the "START" !
      I have a local barrier and so I should see "START" before !!


      { const size_t lid = get_local_id(0); __local volatile int localPoolNextRay[1]; __local volatile int localPoolRayCount[1]; if (lid < 1){ localPoolNextRay[0] = localPoolRayCount[0] = 0; printf("START %d %d : %d\n", get_global_id(0), lid, localPoolNextRay[0]); } barrier(CLK_LOCAL_MEM_FENCE); while(true) { // Local pool is empty if (localPoolRayCount[0] < 1 && lid < 1) { localPoolNextRay[0] = atom_add(globalPoolNextRay, LOAD_BALANCER_BATCH_SIZE); localPoolRayCount[0] = LOAD_BALANCER_BATCH_SIZE; } mem_fence(CLK_LOCAL_MEM_FENCE); //barrier(CLK_LOCAL_MEM_FENCE); printf("BEFORE %d %d : %d\n", get_global_id(0), lid, localPoolNextRay[0]); // Get rays from local pool int myRayIndex = localPoolNextRay[0] + lid; if (myRayIndex >= globalPoolRayCount) return; printf("AFTER %d\n", myRayIndex); mem_fence(CLK_LOCAL_MEM_FENCE); if (lid < 1) { localPoolNextRay[0] += 32; localPoolRayCount[0] -= 32; //mem_fence(CLK_LOCAL_MEM_FENCE); } mem_fence(CLK_LOCAL_MEM_FENCE); // Execute trace(myRayIndex, tasks); } }

        • Persistent thread on the CPU
          genaganna

           

          Originally posted by: viewon01 Hi,

          I'm trying to implement some "persistent thread" on the CPU to batch a set of tasks, but I got some strange results.I have put some "printf" in the following code. What is strange is that I see the "BEFORE" before the "START" !I have a local barrier and so I should see "START" before !!

           

          Could you please give us SDK Version, driver version, CPU, OS?

           

           

            • Persistent thread on the CPU
              spectral

              Hi,

               

              I'm able to run it correctly, but for this I have to put barrier everywhere ! In fact, because there is no "natural" SIMT behavior on the CPU I'm not sure that using persitent thread will help on the CPU ! (Lot of barriers or atomic functions !!).

              BTW, 
              Global work = 704
              Local work = 32


                • Persistent thread on the CPU
                  notzed

                   

                  Originally posted by: viewon01 Hi,
                  I'm able to run it correctly, but for this I have to put barrier everywhere ! In fact, because there is no "natural" SIMT behavior on the CPU I'm not sure that using persitent thread will help on the CPU ! (Lot of barriers or atomic functions !!).
                  BTW, 
                  Global work = 704
                  Local work = 32

                   



                  Yes, you will have to put a barrier everywhere it is needed.  On CPU implementations (afaik) the workgroup is implemented as a single thread.

                  The barrier instructions basically break the code up into sections, and conceptually the first bit of your code will be implemented something like this:  (again, afaik ...)

                  // first barrier block

                  for (int lid = 0;lid

                    if (lid < 1) {

                    localPoolNextRay[0] = ...

                   }

                  }

                  // second barrier block

                  for (int lid=0;lid

                    if (localpoolraycount[0] < 1 ...) {

                   ...

                  }

                  }

                  This is more efficient than it looks since the compiler can (possibly) optimise out most of the code.  It also requires no critical sections or thread communication overheads.

                  But you can see that if the barrier wasn't there, lid ==0 would just run to completion before lid==1 even got a shot.  (i.e. think of there being a single implicit barrier at the end, and with no other barriers you have the single loop over lid).

                   

                  But, overall your code looks more complex than it might need to be.

                  e.g. off the top of my head the following looks simpler and does the same thing:

                  local int localPoolBase; (you can just use simple types as locals, they don't need to be arrays)

                  // allocate first batch to our pool

                  if (lid == 0) {

                     localPoolBase = atom_add(globalpoolnextray, 32);

                  }

                  barrier(LOCAL);

                  do {

                    if (lid + localPoolBase < globalRayCount) {

                    // process ray id = (lid + localPoolBase)

                    }

                   

                  barrier(LOCAL);

                  // get next batch

                  if (lid == 0) {

                     localPoolBase = atom_add(globalpoolnextray,32 );

                  }

                  barrier(LOCAL);

                  // while we have at least 1 more lot of work to do

                  } while (localPoolBase < globalRayCount);

                  I also removed the implied inner loop: it doesn't seem useful to have batches of batches, 1 batch is enough.  If you want to do more than 32 lots of work per batch, just increase the workgroup size.  Actually having batches of batches isn't a good idea here: you're striding across 32 items at a time which will be particularly cache unfriendly.  i.e. lid==0 will do rays 0, 32, ... LOAD_BALANCER_BATCH_SIZE-31, before lid==1 even starts on rays 1, 33, ...

                  NB: barriers and CPU are kind of touchy: every SIMT `thread' in the workgroup needs to execute the same number of barriers.  i.e. you can't early terminate some of them.  Which is why the code above has all work items run the same number of loops.

                   

                   

              • Persistent thread on the CPU
                MicahVillmow
                viewon01,
                Printf is synchronous on the GPU because the GPU does not have direct access to stdout. Also, unless you are tagging your printf instructions with the global ID, you will not know what thread the printf is coming from. Printf on a GPU is basically a function that has race conditions.