Hi everyone,
we are writing an OpenCL program to compute the number of solutions for the n queens problem on GPUs. For this chess-mathematical problem the input is the board size n and the needed amount of computation increases hugely when n is incremented. We use an iterative backtracking algorithm in the kernel. For testing we have a RX 6650 XT (gfx1032).
We use 64 as the default workgroup size and it works fine until a specifc board size is reached. This board size limit varies between different AMD GPUs: on a Radeon R9 380 the limit was a lower board size than on the RX 6650 XT.
The behaviour when the program fails:
I don't know if that helps in this issue but here is our complete kernel code:
kernel void nqfaf_amd(global int *ld_arr, global int *rd_arr, global int *col_arr, global int *start_jkl_arr, global long *result, constant uint *jkl_queens_arr) {
// gpu intern indice
int g_id = get_global_id(0); // global thread id
short l_id = get_local_id(0); // thread id within workgroup
int jkl_queens_idx = get_group_id(0)*N;
// variables
uint L = 1 << (N-1);// queen at the left border of the board (right border is represented by 1)
// start_jkl_arr contains [6 queens free][5 queens for start][5 queens for i][5 queens for j][5 queens for k][5 queens for l]
char start = start_jkl_arr[g_id] >> 20;
if(start == 69) {// if we have a pseudo constellation we do nothing
return;
}
char k = (start_jkl_arr[g_id] >> 5) & 31; // in row k queen at left border, in row l queen at right border
char l = start_jkl_arr[g_id] & 31;
// describe the occupancy of the board
uint ld = ld_arr[g_id]; // left diagonals, 1 means occupied
uint rd = rd_arr[g_id]; // right diagonals, 1 means occupied
uint col = ~(L-2) ^ col_arr[g_id]; // columns, 1 means occupied
// for memorizing board-leaving diagonals
uint ld_mem = 0;
uint rd_mem = 0;
ld &= ~((L >> k) << start);// remove queen k from ld
if(l != N-1) // only remove queen k from rd, if no queen in corner (N-1,N-1)
rd &= ~((1 << l) >> start); // otherwise we continue in row N-1 and find too many solutions
// initialize current row as start and solutions as 0
char row = start;
ulong solutions = 0;
// calculate the occupancy of the first row
uint free = ~(ld | rd | col | jkl_queens_arr[jkl_queens_idx + row]); // free is 1 if a queen can be set at the queens location
uint queen = -free & free; // the queen that will be set in the current row
// each row of queens contains the queens of the board of one workitem
local uint queens[WORKGROUP_SIZE][N]; // for remembering the queens for all rows for all boards in the work-group
queens[l_id][start] = queen; // we already calculated the first queen in the start row
// going forward (setting a queen) or backward (removing a queen)?
bool direction = 0;
// iterative loop representing the recursive setqueen-function
// this is the actual solver (via backtracking with Jeff Somers Bit method)
// the structure is slightly complicated since we have to take into account the queens at the border, that have already been placed
while(row >= start) { // while we haven't tried everything
direction = (free > 0); // forwards or backwards?
if(direction) { // if there are free slots in the current row
queen = -free & free; // this is the next free slot for a queen (searching from the right border) in the current row
queens[l_id][row] = queen; // remember the queen
row++; // increase row counter
ld_mem = ld_mem << 1 | ld >> 31; // place the queen in the diagonals and shift them and remember the diagonals leaving the board
rd_mem = rd_mem >> 1 | rd << 31;
ld = (ld | queen) << 1;
rd = (rd | queen) >> 1;
}
else { // if the row is completely occupied
row--;// decrease row counter
queen = queens[l_id][row]; // recover the queen in order to remove it
ld = ((ld >> 1) | (ld_mem << 31)) & ~queen; // shift diagonals one back, remove the queen and insert the diagonals that had left the board
rd = ((rd << 1) | (rd_mem >> 31)) & ~queen;
ld_mem >>= 1;
rd_mem <<= 1;
}
barrier(CLK_LOCAL_MEM_FENCE);
free = ~(jkl_queens_arr[jkl_queens_idx + row] | ld | rd | col); // calculate the occupancy of the next row
free &= ~(queen + direction-1); // occupy all bits right from the last queen in order to not place the same queen again
col ^= queen; // free up the column AFTER calculating free in order to not place the same queen again
solutions += (row == N-1); // increase the solutions, if we are in the last row
}
result[g_id] = solutions; // number of solutions of the work item
barrier(CLK_GLOBAL_MEM_FENCE);
}
What could be reasons for this behaviour?
What can we try and change in our algorithm / kernel to make it work, even with a smaller workgroup size? Our program has the best performance when the workgroup size is 64.
It works on Nvidia GPUs and we want it to run on AMD GPUs too.
Have a great day and thanks in advance for your help.
Ole
Hi @ollec ,
Thanks for reporting it. I have whitelisted you and moved the post to the OpenCL forum.
Thanks.
Hi @ollec ,
To reproduce the issue at our end, it would be helpful if you please provide the following information:
1) the host-side code for the above kernel, 2) setup details like OS, driver version etc. 3) clinfo output
Thanks.
Hi @dipak ,
Thanks for your reply.
Information on how to reproduce the issue:
The application is written in Java. You can download a zip file containing the Java runtime and an executable for your operating system from here: for windows / for mac / for linux. It's a command line application. Use it like this
[executable] -g -c <path_to_config_file> -N <board_size>
Copy&paste the following content into a file and pass the file path as the parameter to "-c":
{
"deviceConfigs": [
{"index":0, "workgroupSize":64, "weight":1, "maxGlobalWorkSize":1000000000}
]
}
index: the index of the GPU you want to use, run "[executable] -d" to see available GPUs and their indexes
workgroupSize: the workgroup size, you can play with that
weight: not important, just leave it like that
maxGlobalWorkSize: maximum global work size for enqueueing the kernel; used for partitioning the workload into multiple smaller workloads. currently a really high number, so the workload is not partitioned. you can play with that too
For "-N" (the board size), useful values for testing are probably values between 17 and 23. Here is some information on how the RX 6650 XT works with this program, to help you reproduce the issue:
1) host-side code
Information on our algorithm:
We generate a list containing all tasks for the GPU (1 task is for 1 work item). We need some data in these tasks to be exactly the same for a whole workgroup. So we group tasks, that have the same value for this specific property, to be coherent in the task list. To achieve, that the size of these groups is always a multiple of the workgroup size, we fill them up with "pseudo tasks". So when we enqueue the kernel, in each workgroup are always exclusively tasks of the same group. In the kernel a work item always first checks if it has to deal with a pseudo task and if yes, it does nothing and calls return (see line 12 of the kernel code above). (In the code, pseudo tasks are called pseudo constellations)
In the snippet below I included everything I thought is important, but if you need more information I will be glad to give it to you. Click here if you want to have a look at the whole Java host-side code by yourself.
We use lwjgl to access OpenCL through Java.
Simplified host-side code: (without workload partitioning and without multi device support)
long context = clCreateContext(ctxPlatform, ctxDevices, null, NULL, errBuf);
checkCLError(errBuf);
long program = clCreateProgramWithSource(context, getKernelSourceAsString("kernels.c"), errBuf);
String options = "-D N=" + N + " -D WORKGROUP_SIZE=" + device.config.workgroupSize;
int error = clBuildProgram(program, device.id, options, null, 0);
checkCLError(error);
kernel = clCreateKernel(program, "nqfaf_amd", errBuf);
checkCLError(errBuf);
long xqueue = clCreateCommandQueue(context, device.id, CL_QUEUE_PROFILING_ENABLE, errBuf); // queue for executing the kernel
checkCLError(errBuf);
long memqueue = clCreateCommandQueue(context, device.id, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, errBuf); // queue for memory operations
checkCLError(errBuf);
// create buffers
long ldMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, globalWorkSize * 4, errBuf);
checkCLError(errBuf);
long rdMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, globalWorkSize * 4, errBuf);
checkCLError(errBuf);
long colMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, globalWorkSize * 4, errBuf);
checkCLError(errBuf);
long startijklMem = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR,
globalWorkSize * 4, errBuf);
checkCLError(errBuf);
long resMem = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, globalWorkSize * 8, errBuf);
checkCLError(errBuf);
// fill buffers
// ld
ByteBuffer ldPtr = clEnqueueMapBuffer(memqueue, ldMem, true, CL_MAP_WRITE, 0, globalWorkSize * 4, null, null, errBuf, null);
checkCLError(errBuf);
for (int i = 0; i < globalWorkSize; i++) {
ldPtr.putInt(i * 4, workloadConstellations.get(i).getLd());
}
error = clEnqueueUnmapMemObject(memqueue, ldMem, ldPtr, null, null);
checkCLError(error);
// rd
ByteBuffer rdPtr = clEnqueueMapBuffer(memqueue, rdMem, true, CL_MAP_WRITE, 0, globalWorkSize * 4, null, null, errBuf, null);
checkCLError(errBuf);
for (int i = 0; i < globalWorkSize; i++) {
rdPtr.putInt(i * 4, workloadConstellations.get(i).getRd());
}
error = clEnqueueUnmapMemObject(memqueue, rdMem, rdPtr, null, null);
checkCLError(error);
// col
ByteBuffer colPtr = clEnqueueMapBuffer(memqueue, colMem, true, CL_MAP_WRITE, 0, globalWorkSize * 4, null, null, errBuf, null);
checkCLError(errBuf);
for (int i = 0; i < globalWorkSize; i++) {
colPtr.putInt(i * 4, workloadConstellations.get(i).getCol());
}
error = clEnqueueUnmapMemObject(memqueue, colMem, colPtr, null, null);
checkCLError(error);
// startijkl
ByteBuffer startijklPtr = clEnqueueMapBuffer(memqueue, startijklMem, true, CL_MAP_WRITE, 0, globalWorkSize * 4, null, null, errBuf, null);
checkCLError(errBuf);
for (int i = 0; i < globalWorkSize; i++) {
startijklPtr.putInt(i * 4, workloadConstellations.get(i).getStartijkl());
}
error = clEnqueueUnmapMemObject(memqueue, startijklMem, startijklPtr, null, null);
checkCLError(error);
// result memory
ByteBuffer resPtr = clEnqueueMapBuffer(memqueue, resMem, true, CL_MAP_WRITE, 0, globalWorkSize * 8, null, null, errBuf, null);
checkCLError(errBuf);
for (int i = 0; i < globalWorkSize; i++) {
resPtr.putLong(i * 8, workloadConstellations.get(i).getSolutions());
}
error = clEnqueueUnmapMemObject(memqueue, resMem, resPtr, null, null);
checkCLError(error);
error = clFlush(memqueue);
checkCLError(error);
error = clFinish(memqueue);
checkCLError(error);
// set kernel arguments
// ld
LongBuffer ldArg = stack.mallocLong(1);
ldArg.put(0, ldMem);
error = clSetKernelArg(kernel, 0, ldArg);
checkCLError(error);
// rd
LongBuffer rdArg = stack.mallocLong(1);
rdArg.put(0, rdMem);
error = clSetKernelArg(kernel, 1, rdArg);
checkCLError(error);
// col
LongBuffer colArg = stack.mallocLong(1);
colArg.put(0, colMem);
error = clSetKernelArg(kernel, 2, colArg);
checkCLError(error);
// startijkl
LongBuffer startijklArg = stack.mallocLong(1);
startijklArg.put(0, startijklMem);
error = clSetKernelArg(kernel, 3, startijklArg);
checkCLError(error);
// res
LongBuffer resArg = stack.mallocLong(1);
resArg.put(0, resMem);
error = clSetKernelArg(kernel, 4, resArg);
checkCLError(error);
// enqueue kernel
int localWorkSize = config.workgroupSize;
final PointerBuffer xEventBuf = BufferUtils.createPointerBuffer(1);
error = clEnqueueNDRangeKernel(xqueue, kernel, dimensions, null, globalWorkSize, localWorkSize, null, xEventBuf);
checkCLError(error);
// accurate time profiling using clEvent
long xEvent = xEventBuf.get(0);
error = clSetEventCallback(xEvent, CL_COMPLETE, CLEventCallback.create((event, event_command_exec_status, user_data) -> {
LongBuffer startBuf = BufferUtils.createLongBuffer(1), endBuf = BufferUtils.createLongBuffer(1);
int err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, startBuf, null);
checkCLError(err);
err = clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, endBuf, null);
checkCLError(err);
device.duration = (endBuf.get(0) - startBuf.get(0)) / 1000000; // convert nanoseconds to milliseconds
}), NULL);
checkCLError(error);
error = clFlush(xqueue);
checkCLError(error);
error = clFinish(xqueue);
checkCLError(error);
error = clWaitForEvents(xEvent);
checkCLError(error);
// read results
long solutions = 0;
error = clEnqueueReadBuffer(memqueue, resMem, true, 0, resPtr, null, null);
checkCLError(error);
for (int i = 0; i < globalWorkSize; i++) {
if (workloadConstellations.get(i).getStartijkl() >> 20 == 69) // start=69 means it's a pseudo constellation
continue;
long solutionsForConstellation = resPtr.getLong(i * * utils.symmetry(workloadConstellations.get(i).getStartijkl());
if (solutionsForConstellation >= 0)
solutions += solutionsForConstellation;
}
}
// release cl objects
checkCLError(clReleaseMemObject(ldMem));
checkCLError(clReleaseMemObject(rdMem));
checkCLError(clReleaseMemObject(colMem));
checkCLError(clReleaseMemObject(startijklMem));
checkCLError(clReleaseMemObject(resMem));
checkCLError(clReleaseEvent(xEvent));
checkCLError(clReleaseCommandQueue(xqueue));
checkCLError(clReleaseCommandQueue(memqueue));
checkCLError(clReleaseKernel(kernel));
2) setup details:
3) clinfo output:
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon RX 6650 XT
Device Topology: PCI[ B#5, D#0, F#0 ]
Max compute units: 16
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 2447Mhz
Address bits: 64
Max memory allocation: 7059013632
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 64
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7059013632
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2764046336
Max global variable size: 6353112064
Max global variable preferred total size: 8573157376
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 00007FF895428000
Name: gfx1032
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3444.0 (PAL,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0 AMD-APP (3444.0)
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_gl_event cl_khr_depth_images cl_khr_mipmap_image cl_khr_mipmap_image_writes cl_amd_liquid_flash cl_amd_copy_buffer_p2p cl_amd_planar_yuv
If you need more information I will be glad to provide it.
Thanks!
Edit: corrected a mistake in the paragraph on how to reproduce: "[...] and it happens more often the lower the workgroup size is [...]"
Hi @ollec ,
Thanks for the above information.
2) setup details:
OS: Windows 10 Pro x64, version 22H2, build 19045.3086
RX 6650 XT driver version: 31.0.12044.3 (date: 25.10.2022), only driver installed
From the above driver information, it looks like you are using an old driver. Could you please try the latest Adrenalin driver available here to check if the issue is still reproducible?
Thanks.
Thanks @dipak .
Now using driver version 31.0.14057.5006 (date: 23.05.2023) and the issue still exists.
I installed the drivers now with "minimal install" and each time the program fails, the Adrenalin Software detects a driver timeout.
workgroup sizes <32 also work for board sizes <= 20, but not reliably: sometimes the result is off by only a little bit, and it happens more often the lower the board size is (probably the same issue: it just quits too early)
This now happens more often than before, sometimes for example even with workgroup size 64 and board size 20, which worked good before the driver update. But in this case the AMD Software doesn't detect a driver timeout. Maybe that's not related to the main issue here?
Thanks for trying the latest driver and sharing the observation.
I installed the drivers now with "minimal install" and each time the program fails, the Adrenalin Software detects a driver timeout.
Yes, driver timeout can be observed for a long running kernel due to the TDR timeout limit as set by Windows.
For TDR related information: https://learn.microsoft.com/en-us/windows-hardware/drivers/display/timeout-detection-and-recovery
Thanks.
Thanks for the suggestion.
Unfortunately, this was not the solution to this issue.
For the Rx 6650 XT, the driver timeout for board sizes <=21 with workgroup size <64 is often about the same time: 2 minutes and 10 seconds. On the other hand, the driver timeout happens after some more minutes when a bigger board size, like 24, is used and the workgroup size is 64. So the driver timeout does not always happen at the same time.
Increasing the TdrDelay did not affect this.
I'm sorry I can't explain it clearly, because I don't know about the exact correlations between workgroup size, global work size and how workgroups are scheduled to be executed. I suppose, maybe it has to do with that, because changing the workgroup size or the global work size affects the behaviour.
We already had the thought, that maybe our kernel at some point uses too much of some resource like private memory or local memory, but this is probably not the case. And even if it would be, the program would exit instantly with an CL_OUT_OF_RESOURCES error, wouldn't it?
Anyways, here are some numbers regarding memory usage for workgroup size 64 and board size 24:
CL_DEVICE_LOCAL_MEMORY_SIZE is 65536 and we read that this space is used for both, the private and the local variables. So, theoretically, one compute unit could execute 6 workgroups in parallel without problems. Is that correct?
Thanks in advance.
EDIT: The memory usage was calculated for an updated version of the kernel, that uses an array in local memory instead of the last kernel argument, which was in constant memory. Apart from that, the kernel didn't change. The updated kernel can be found here .
CL_DEVICE_LOCAL_MEMORY_SIZE is 65536 and we read that this space is used for both, the private and the local variables. So, theoretically, one compute unit could execute 6 workgroups in parallel without problems. Is that correct?
On AMD platforms, the private variables are usually allocated to the vector registers (vGPRs), whereas Local Data Share (LDS) is used to allocate the local variables. Both of them can serve to limit the active waves or workgroups(WG) per compute unit (CU).
For example, RDNA based GPUs have total 128kB LDS memory per work-group processor (WGP). On WGP mode (default mode), a single WG may allocate up to 64kB of LDS [as shown in the clinfo output, CL_DEVICE_LOCAL_MEMORY_SIZE is 65536]. So, if a kernel that uses 16kB of local memory per WG, then it can run with 8 active WGs on each WGP.
Similarly, on RDNA1, each SIMD contains a total of 1,024 vGPRs. So, a kernel that uses 256 vGPRs can run with 4 active waves on each SIMD or 16 active waves on each WGP (4 SIMDs per WGP).
For more information on RDNA, please refer: https://www.amd.com/system/files/documents/rdna-whitepaper.pdf
Thanks.
Thank you very much for the in-depth knowledge and the linked document! Will definitely read it.
For now, do you have any other idea on why this behaviour could happen?
I have filed an internal bug ticket for this issue. I will let you know once I get any update on this.
Thanks.
Hi @ollec ,
Could you please provide the log files and temporary files?
To get logs: set env vars AMD_LOG_LEVEL=7, AMD_LOG_LEVEL_FILE="mylog", run the executable, then share "mylog".
To get temp files: set env var AMD_OCL_BUILD_OPTIONS_APPEND="-save-temps-all", run the executable, then provide all temp files generated (starts with _temp_).
Thanks.
Thanks for providing the log and temporary files.
Could you please share the driver version you used to generate the files and the clinfo output?
Thanks.
I updated my driver after generating the log and the temporary files, so unfortunately, I don't remember the driver version.
I generated the files again with the latest drivers (version: 23.11.1).
Just an extra information regarding the behaviour:
When the crash happens, clGetEventProfilingInfo falsely returns a higher value for CL_PROFILING_COMMAND_START than for CL_PROFILING_COMMAND_END, so that the calculated kernel duration would be <0.
Thanks for providing the files. I will inform the relevant team and share the details with them.
Thanks.