Hi,
{ const size_t lid = get_local_id(0); __local volatile int localPoolNextRay[1]; __local volatile int localPoolRayCount[1]; if (lid < 1){ localPoolNextRay[0] = localPoolRayCount[0] = 0; printf("START %d %d : %d\n", get_global_id(0), lid, localPoolNextRay[0]); } barrier(CLK_LOCAL_MEM_FENCE); while(true) { // Local pool is empty if (localPoolRayCount[0] < 1 && lid < 1) { localPoolNextRay[0] = atom_add(globalPoolNextRay, LOAD_BALANCER_BATCH_SIZE); localPoolRayCount[0] = LOAD_BALANCER_BATCH_SIZE; } mem_fence(CLK_LOCAL_MEM_FENCE); //barrier(CLK_LOCAL_MEM_FENCE); printf("BEFORE %d %d : %d\n", get_global_id(0), lid, localPoolNextRay[0]); // Get rays from local pool int myRayIndex = localPoolNextRay[0] + lid; if (myRayIndex >= globalPoolRayCount) return; printf("AFTER %d\n", myRayIndex); mem_fence(CLK_LOCAL_MEM_FENCE); if (lid < 1) { localPoolNextRay[0] += 32; localPoolRayCount[0] -= 32; //mem_fence(CLK_LOCAL_MEM_FENCE); } mem_fence(CLK_LOCAL_MEM_FENCE); // Execute trace(myRayIndex, tasks); } }
Originally posted by: viewon01 Hi,
I'm trying to implement some "persistent thread" on the CPU to batch a set of tasks, but I got some strange results.I have put some "printf" in the following code. What is strange is that I see the "BEFORE" before the "START" !I have a local barrier and so I should see "START" before !!
Could you please give us SDK Version, driver version, CPU, OS?
Hi,
Originally posted by: viewon01 Hi, I'm able to run it correctly, but for this I have to put barrier everywhere ! In fact, because there is no "natural" SIMT behavior on the CPU I'm not sure that using persitent thread will help on the CPU ! (Lot of barriers or atomic functions !!).BTW,Global work = 704Local work = 32
Yes, you will have to put a barrier everywhere it is needed. On CPU implementations (afaik) the workgroup is implemented as a single thread.
The barrier instructions basically break the code up into sections, and conceptually the first bit of your code will be implemented something like this: (again, afaik ...)
// first barrier block
for (int lid = 0;lid
if (lid < 1) {
localPoolNextRay[0] = ...
}
}
// second barrier block
for (int lid=0;lid
if (localpoolraycount[0] < 1 ...) {
...
}
}
This is more efficient than it looks since the compiler can (possibly) optimise out most of the code. It also requires no critical sections or thread communication overheads.
But you can see that if the barrier wasn't there, lid ==0 would just run to completion before lid==1 even got a shot. (i.e. think of there being a single implicit barrier at the end, and with no other barriers you have the single loop over lid).
But, overall your code looks more complex than it might need to be.
e.g. off the top of my head the following looks simpler and does the same thing:
local int localPoolBase; (you can just use simple types as locals, they don't need to be arrays)
// allocate first batch to our pool
if (lid == 0) {
localPoolBase = atom_add(globalpoolnextray, 32);
}
barrier(LOCAL);
do {
if (lid + localPoolBase < globalRayCount) {
// process ray id = (lid + localPoolBase)
}
barrier(LOCAL);
// get next batch
if (lid == 0) {
localPoolBase = atom_add(globalpoolnextray,32 );
}
barrier(LOCAL);
// while we have at least 1 more lot of work to do
} while (localPoolBase < globalRayCount);
I also removed the implied inner loop: it doesn't seem useful to have batches of batches, 1 batch is enough. If you want to do more than 32 lots of work per batch, just increase the workgroup size. Actually having batches of batches isn't a good idea here: you're striding across 32 items at a time which will be particularly cache unfriendly. i.e. lid==0 will do rays 0, 32, ... LOAD_BALANCER_BATCH_SIZE-31, before lid==1 even starts on rays 1, 33, ...
NB: barriers and CPU are kind of touchy: every SIMT `thread' in the workgroup needs to execute the same number of barriers. i.e. you can't early terminate some of them. Which is why the code above has all work items run the same number of loops.