OpenCL

mdclemen

I have been trying to figure out why I am getting differing results when running an OpenCL kernel on my CPU and my GPU. The basic idea is that I launch this kernel teamwork on the device, then interact with it on the host by periodically writing to device memory and causing the kernel to advance as device memory is updated.
Here is the kernel code:

__kernel void teamwork(__global volatile unsigned int* hb,
					   __global volatile unsigned int* tl_buf,
					   const int nsamples) {
	volatile unsigned int tl0 = 0xdeadbeef;	/* always run one dummy loop */
	volatile unsigned int tl1;
	unsigned int lhb = hb[0];
	const unsigned int tid = get_global_id(0);
	for (int i = 0; i < nsamples; i++) {
		while((tl1 = hb[0]) == tl0) {
			if (tl1 == 0xff)
				break;
		}
		if (hb[0] != 0xff)
			lhb = hb[0];
		tl_buf[i] = lhb;
		tl0 = tl1;
	}
}

I am using a python script using pyOpenCL to launch the kernel and communicate with it while it is running:

# teamwork.py
import pyopencl as cl, numpy as np, os, time
print("PID: %d" % os.getpid())

ctx = cl.create_some_context()
cq0 = cl.CommandQueue(ctx)
cq1 = cl.CommandQueue(ctx)

# compile kernel
krnl_src=open("teamwork.cl", "r")
prg = cl.Program(ctx, krnl_src.read()).build("-I./")
krnl_src.close()

hb = np.array([0xdeadbeef], dtype=np.uint32)
nsamples=np.int32(os.getenv("NSAMPLES"))
dt=np.uint32(os.getenv("DT"))
print("DT = %dus | nsamples = %d" % (dt, nsamples))

tl_buf=np.zeros([nsamples], dtype=np.uint32)

d_hb = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, hb.nbytes)
d_tl_buf = cl.Buffer(ctx, cl.mem_flags.WRITE_ONLY, tl_buf.nbytes)

cl.enqueue_copy(cq0, d_hb, hb)
print("d_hb={} (before kernel launch)".format(hb))

# run kernel
event = prg.teamwork(cq0, (1,), None, d_hb, d_tl_buf, nsamples)
time.sleep(2)

# tlatch sim
for i in range(nsamples):
    t_actul = np.uint32((i+1)*dt)
    # copy to GPU memory
    event = cl.enqueue_copy(cq1, d_hb, t_actul)
    event.wait()
t_actul = np.uint32(0xff) # write 0xff so the kernel doesn't hang
event = cl.enqueue_copy(cq1, d_hb, t_actul)
event.wait()

print("tlatch_sim() complete...")
event = cl.enqueue_copy(cq0, hb, d_hb)
event.wait()
event = cl.enqueue_copy(cq0, tl_buf, d_tl_buf)
event.wait()

print("d_hb={} (after kernel launch)".format(hb))
print("d_tl_buf={}".format(tl_buf))

Some further detail on how this works: 1) on the host I setup two command queues: cq0 and cq1, 2) cq0 initializes device memory and launches the kernel, which has 3 arguments: d_hb, d_tl_buf and nsamples, 3) cq1 is used in a for loop (of nsamples iterations) to periodically write new values to d_hb[0], which are then saved in the device array d_tl_buf. The expected result of this kernel is that after the kernel launch, the kernel also enters a for loop of nsamples iterations, then enters a while loop that waits for a new value to be written to d_hb[0]. Once a new value is written to d_hb[0] by the host script, the kernel will break out of the while loop, and record this new value in d_tl_buf, complete the for loop iteration, then re-enter the while loop and wait for new data. Using my CPU, I get the expected result:

$ PYOPENCL_COMPILER_OUTPUT=0 NSAMPLES=10 DT=10 python3 teamwork.py
PID: 6997
/home/mitchell/.local/lib/python3.11/site-packages/pyopencl/cache.py:495: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  _create_built_program_from_source_cached(
DT = 10us | nsamples = 10
d_hb=[3735928559] (before kernel launch)
tlatch_sim() complete...
d_hb=[255] (after kernel launch)
d_tl_buf=[ 10  20  30  40  50  60  70  80  90 100]
$ PYOPENCL_COMPILER_OUTPUT=0 NSAMPLES=10 DT=10 python3 teamwork.py
  Device Type:					 CL_DEVICE_TYPE_CPU

But when I run this on my AMD Firepro W5100 (amdgpu-pro driver 21.10-1263777), I get this unexpected result:

$ PYOPENCL_COMPILER_OUTPUT=0 NSAMPLES=10 DT=10 python3 teamwork.py
PID: 2432
DT = 10us | nsamples = 10
d_hb=[3735928559] (before kernel launch)
tlatch_sim() complete...
d_hb=[255] (after kernel launch)
d_tl_buf=[255 255 255 255 255 255 255 255 255 255]
$ clinfo | grep DEVICE_TYPE
  Device Type:					 CL_DEVICE_TYPE_GPU

I have been racking my brain trying to figure out why this kernel doesn't run as expected on the GPU. I have gotten kernels like this to run on NVIDIA GPUs using CUDA/pyCUDA and using two streams in the same manner that I use the two command queues. The fact that d_tl_buf is all 0xff (255) when run on the GPU, implies that the entire host for loop is run before the kernel even launches, which is why the time.sleep() is there to ensure adequate time elapses to wait for the kernel to launch on the device, but it seems to have no affect. It is also worth noting is that if I remove the 'if (tl1 == 0xff) break' clause from the kernel code, the kernel hangs on the GPU.

Does anyone know why I'm getting different results?

mdclemen

Apparently what I am trying to do here is not defined in the OpenCL 1.2 standard. Writing to a buffer before a kernel has finished is "undefined". From the OpenCL 1.2 spec:

"Concurrent reading from, writing to and copying between both a buffer object and its sub-buffer
object(s) is undefined."

View solution in original post

mdclemen

Apparently what I am trying to do here is not defined in the OpenCL 1.2 standard. Writing to a buffer before a kernel has finished is "undefined". From the OpenCL 1.2 spec:

"Concurrent reading from, writing to and copying between both a buffer object and its sub-buffer
object(s) is undefined."

mdclemen

If someone with an OpenCL 2.0+ capable GPU would be so kind as to run this code and tell me results, that would be greatly appreciated.

OpenCL

Different OpenCL kernel result with CPU and GPU