I dont understand what your problem with the above print is....
This is just group of "printf"s and they can appear in random order - you never know.
You should only look whether the global and local IDs match well with the "group id" that is printed.
Within each workgroup there are 8 threads (0,0,0) to (0,0,7).
And group id as per the print above - first starts from 0,0,0 and then moves to 1,0,0. The global ID has also changed appropriately. THe local ID again increments from (0,0,0) to (0,0,7)
I dont see anything wrong in this.
If there is a problem in things - as basic as this -- this forum will be flooded with errors.
You may want to read the specification on global and local IDs.
Also, first test your understanding on 1D and 2D workgroups...
Later, you can look into 3D workgroups.
That might help you...
Many thanks for your answer and rapid response! Please afford me a few more minutes to confirm the following.
Each "get" local, global ,and group method takes a singe argument pertaining to a "dimension" (1st, 2nd, or 3rd). Reiterating the aforementioned example, there are 8 threads (0,0,0) to (0,0,7). More specifically, the 8 threads are (1,1,8) as reported by get_local_size(0|1|2), respectively. Accordingly, get_local_size(2) equals 8, thereby, indicating that the 3rd-dimension furnishes 8 threads. OK, so here's a restatement of the problem: When the 3rd-dimension of the Local ID recycles from (x,x,0) to (x,x,7) why would the "1st-dimension" of the get_group_id(...) increment from (0,x,x) to (*1*,x,x)? Especially when the get_num_groups(0|1|2) reports (15,16,*2*), derived from the clEnqueueNDRangeKernel arguments 'global_work_size' (15,16,16) and 'local_work_size' (1,1,8) and indicating:
> 15 groups for the 1st-dimension (of 1 local thread),
> 16 groups for the 2nd-dimension (of 1 local thread), and
> 2 groups for the 3rd-dimension (16 global threads divided amongst 8 local threads)
It just doesn't make sense that the 1st-dimension of get_group_id(0) would increment if the 3rd-dimension of get_local_id(2) indicates a recycle from [0..7]. Further, get_num_groups(2) for the 3rd-dimension indicates a size of "2" for a global_work_size of (15,16,16) and a local_work_size of (1,1,8). Yet, the 1st-dimension of get_group_id(0) increments? In other words, I would have expected get_group_id(2) for the 3rd-dimension to increment (not get_group_id(0) for the 1st-dimension).
The specific description of the relationship between global, local, and group IDs is on page #24 of the OpenCL 1.2 specification. Based on the equation at the bottom of page #24:
gx = wx * Sx +sx, and restating in OpenCL API speak:
get_global_id(0|1|2) = ( get_group_id(0|1|2) * get_local_size(0|1|2) ) + get_local_id(0|1|2)
Where the get_global_id(0|1|2) uniquely identifies a work-item from the *total number of global work-items* (as defined in the 'global_work_size' argument to clEnqueueNDRangeKernel) and get_local_id(0|1|2) specifies a *unique work-item within a corresponding work-group*. This means that the dimension arguments must correspond and the global_id should NOT recycle with the local_id.
Finally, the printf's appear to correctly report the sequences. This is substantiated by the fact that, with a local_work_size of [1,1,1], the sequential reporting across all three dimensions is perfect. Further, if the printf's were encountering difficulties, I would have expected to see a more indicative manifestation of the problem (exacerbated by Linux not-so-realtime round-robin pre-emptive scheduling), yet the printf reporting is perfectly cyclical through the long sequences of my tasks. Further, the actual task I'm trying to perform operates across long sequences of digital samples (4K x 4K x 4K) and the indication of the problem germinated from an observation that sequences were *aliasing* for a local_work_size of greater than '1' in any dimension. A local_work_size of [1,1,1] is perfect, but when the local_work_size is higher in any dimension, I noticed that the get_global_id would prematurely recycle thereby reducing the *perceived sample-rate* to causing the *aliasing*.
Again, many thanks for your time and expertise !!!!
1 of 1 people found this helpful
So, I had written a rather long reply explaining the things in detail, but the forum platform ate it, so I'll try it again, but shorter this time.
With a global work size of 15x16x16 and local work size of 1x1x8 you are launching 15x16x2 workgroups, each with 8 workitems.
Since your launch grid is three-dimensional, ids (global, group, local) are triplets, the component of which are retrieving by using the 0, 1, 2 arguments to the get_*_id() functions.
Group ids will range from (0, 0, 0) to (14, 15, 1) inclusive; within each group, local ids will range from (0, 0, 0) to (0, 0, 7) inclusive. The global id is computed from the group id, local size and local id as you mention (plus the offset, if you have any). Note that the _entire_ global id doesn't recycle (i.e. the triplet), but multiple workitems across different workgroups <i>can</i> and <i>will</i> have equal <i>components</i>. For example, workitems in workgroup (0, 0, 0) will have global ids (0, 0, 0) to (0, 0, 7), and workitems in workgroup (1, 0, 0) will have global ids (1, 0, 0) to (1, 0, 7). The global workitems are different, but you will have workitems in the two groups sharing a common component 1 or 2; e.g. get_global_id(2) will return 3 both for global work item (0, 0, 3) and for global work item (1, 0, 3) (and for global work item (13, 10, 3) and in fact for any work item whose global id is in the form (g0, g1, 3), i.e. for any workitem in a workgroup for which the 3rd component of the group id is 0).
Note also that you should not make <i>any</i> assumption on the order in which workgroups are issued on the GPU. You <i>cannot</i> expect workgroup (0, 0, 1) to be issued <i>before</i> workgroup (1, 0, 0) (and in fact, most if not all GPUs will dispatch (1, 0, 0) before (0, 0, 1), which is why you see the printf in the order in which you see them). A <i>correct</i> OpenCL kernel should assume that workgroups will be issued in an arbitrary order, potentially even all at the same time.
The sequence in which you see the output of the printf is not an indication of a seriality between workitems and workgroups. In fact, GPUs will usually run all workitems in a workgroup “together”, and multiple workgroups concurrently (depending on the number of compute units on the device and on the number of the workgroups a compute unit can handle at the same time).
All the workgroups will work parallelly. So there is no gaurantee that which thread from which group executes the kernel. As you mentioned the global work size is (1516,16) there are 15 groups in the 1st dimention. So there is no surprise in incrementing the 1st dimension value from 0 to 1 when local id varies from (x,x,0) to (x,x,7). Why do you except it to print sequentially.
Many thanks to *gbilotta* and *Bruhaspati* for affording your time and expertise (and "tenacity", gbilotta) to help me understand the relationship between the global, group, and local IDs and sizes. I understand and confirm your descriptions. Following some experimentation I've observed some rather perplexing results. In particular, if I reverse the order of the dimensions (eg. from [512, 4096, 2048] to [2048, 4096, 512]) the kernel's results are almost perfect (and are, needless to say, much better). What seems somewhat perplexing is the fact that the kernel thtat's being invoked (depicted below) uses *absolute addressing* for both source and destination containers, so the GPU's out-of-order completion should be irrelevant. Agree? In other words, given that the kernel is invoked for each component across all dimensions (4-GFLOP = 512 x 4096 x 2048) and using *absolute addressing*, the complete solution should be completely and accurately resolved despite being completed out-of-order.
const uint n = get_global_id(0);
const uint p = n + get_global_id(1);
const uint m = get_global_id(2);
__global float c = ( dest + m + (n*M) );
__global float b = ( imag2 + m + (p*M) );
__global float a = ( imag1 + m + (p*M) );
*c = ( *a * *b ) + *c;
Your comments and suggestions are greatly appreciated! Again, many thanks for your time and expertise!
1 of 1 people found this helpful
I suspect you are having troubles with the linearization of the addresses. It's extremely important, when dealing with linearized multidimensional arrays, that indices are mapped to the correct dimension, and that the linearized indices are computed consistently between device and host. Without knowing what kind of linearization you are using on the host, it's hard to tell if your code is correct or not.
If the host has C-like multidimensional arrays, then you should keep in mind that they are stored in row-major order (in 2D). So if you have an array with two rows and three columns, you will first have all the elements of row 0 (column 0, 1, 2), then those of row 1 (column 0, 1, 2). In three dimensions, the slowest-advancing index is that of a slice, so if you have a 3D array with S slices of R rows by C columns, you will first see slice 0 (row 0 (column 0, 1, 2, ... C-1), row 1 (column 0, 1, 2, ... C-1), ..., row R-1 (column 0, 1, 2, ... C-1)), then slice 1 (again, row 0 ... row R-1) and so on until slice S-1.
When computing linearized indices, therefore, the formula is the following: to access element in row r, column c in a 2D array with R rows and C columns, you access element r*C + c. In a 3D array with S slices, R rows and C columns, if you want to access element in slice s, row r, column c you will compute the linearized index as (s*R + r)*C + c. Note that the number of rows in a 2D array and the number of slices in a 3D array (i.e. the slowest-advancing indices) do not appear in the formula for the linearized index of an element.
However, if your arrays come from FORTRAN or a FORTRAN-like language (e.g. MATLAB) that stores arrays column-major, your indices will be transposed. It is extremely important that the host and the device use consistent linearized indexing: otherwise, you will see workitems accessing the wrong elements.
So it is not important that e.g. get_global_id(0) is the column, (1) is the row and (2) is the column, but rather it is important that each maps to correct dimension depending on where you derived your worksize from. (Of course some orderings will be more efficient, but that's a separate matter.)
When I read the dialogue and code snippet above I was inclined to ask. In your kernel, do you have a write(s) to global memory before the barrier(CLK_GLOBAL_MEM_FENCE) and then possibly attempt to re-use the written values within the same kernel by performing some read(s) after the barrier?
Many thanks for your inquiry and suggestion. Three (3) __kernels are repeatedly invoked as a "trio". A read-after-write does not occur within any of the three kernels. Kernel #1 receives Host-CPU data via a clEnqueueWriteBuffer. Kernel-1 delivers [intermediate] results to Kernel-2 by way of a pass-by-reference pointer to a distinct Global memory buffer. Likewise, Kernel-2 delivers its intermediate results to Kernel-3 in the same way via its own distinct Global memory buffer. Both memory buffers are referred to as "bounce buffers" and are configured as CL_MEM_READ_WRITE and CL_MEM_HOST_NO_ACCESS. Each kernel creates a cl_event object and that object governs the invocation of the subsequent kernel. Completion of the "trio" is governed via a clEnqueueBarrierWithWaitList (to force the GPU to complete all queued tasks) followed by a clWaitForEvents (to block the Host-CPU until the trio is completed).
At this point I'm wondering if there's an unaccounted delay after the completion of the aforementioned process governors while the GPU writes results to its Global memory (i.e. a delay following the issuance of CL_SUCCESS at the completion of the clEnqueueBarrierWithWaitList and clWaitForEvents). In other words, it's assumed that the aforementioned process governors' CL_SUCCESS status (and their corresponding cl_event objects CL_COMPLETE status) account for the completion of all __kernel memory-writes to Global memory (i.e. not delayed as a result of a Global cache's eventual "write-back" with [or without] possible write-combining). Further, it's assumed that any memory-write would invalidate the corresponding Global memory location (or cache-line) to enforce an update for a subsequent memory-read from that same location by a subsequent __kernel (a process that's akin to the MESI [Modify, Exclusive, Shared, or Invalidate] memory consistency protocol). As an aside, I remember (years ago) about VxWorks' "virtual = physical" memory that required a memory-read immediately following any memory-write to force Cache-to-Host memory consistency between the Host processor's write-combining write-back cache and Host's SDRAM (Global) memory. Maybe that's going on here?
Many thanks for your reply *gbilotta* (and the time and effort you've afforded is definitely worth a reward )! I fully agree with and understand your assertions regarding multi-dimensional indices. Fortunately, my kernel only deals with 2D source and destination arrays with column-major indices incrementing as "r+(c*R)". My application utilizes the three (3) dimensions as follows:
- A "counting mechanism" to invoke the required quantity of computations [2048, 4096, 512] = 4-GFLOP across three nested loops, and
- An "index" into the source and destination arrays that directly map to respective memory address locations.
An alternative to using a 3D index would be to use a 2D index or a 1D index and embed the corresponding clEnqueueNDRangeKernel(...) dispatches in either a single loop or two nested loops, respectively, on the Host. Then, the Host must dispatch the required quantity of kernel instances (2D would require 2K kernel dispatches, and 1D would require 8M kernel dispatches), where each kernel instance would be governed by a clFinish(...) command. This is very slow and I have found that both of the reduced dimension approaches yield vastly sub-optimal performance as compared to the 3D approach. When dispatched for completion on the Host's CPU, the complete 4-GFLOP task takes ~20-seconds to complete. When dispatched as a 3D task on the GPU, the complete 4-GFLOP task takes about ~1.7-seconds to complete (or 2.5 GFLOPS).
On a separate note, although this 12x speed-up is good, it's far sub-standard as compared to the AMD HD7770 device's specified throughput of 1.28 TFLOPS. I'm thinking that much of the sub-standard performance might be attributed to my kernel's sole use of the GPU's Global GDDR5 memory. The sheer size of the kernel's 2D source and destination arrays prohibits the use of limited-capacity Local or Constant memory (AMD HD7770 furnishes 32-KB and 64-KB, respectively). Hence, the memory accesses are Global (Sapphire video card furnishes 992-MB of GDDR5 memory). But, the apparent 500x reduction in performance cannot possibly be attributed to a difference in Local to Global memory access times (I would have expected maybe a 2x to 3x reduction in performance). For example, the AMD HD7970 furnishes a (ref. "Heterogeneous Computing with OpenCL", ISBN #978-0-12-405894-1, p.129):
- Constant Memory: *110* GBps (L1-cache)
- Local Memory: 55 GBps (LDS)
- 768KB L2-cache: 700 GBps
- Global Memory: *260* GBps (GDDR5)
So, I'm somewhat perplexed as to what's contributing to the sub-standard performance. Any comments and suggestions are greatly appreciated. Many thanks for you time and expertise!
If your kernel is memory-bound, you should not look at the GFLOPS but at the effective bandwidth you achieve. This can be easily computed as (total amount of bytes read + total amount bytes written)/(kernel runtime). If you are significantly below 260GB/s, you are not exploiting the memory efficiently. This is usually caused by inefficient memory access patterns, such as strided access instead of consecutive address, which is what happens for example when the work-item indices are not mapped to the data topology efficiently.
The ‘natural’ mapping between a multi-dimensional global id and 2D indices is to use the first component (global_id(0)) as the faster-moving index: if your data is stored row-major (C-like), you want to use global_id(0) as the column index and global_id(1) as the row index. However, if your data is stored column-major (Fortran-like or Matlab-like) you want to use global_id(0) for the row and global_id(1) for the column.
Another important thing to keep in mind is that the hardware wavefront in AMD devices is typically 64 (which is the case also for the HD7770 you mention). If your workgroup has less than a wavefront of workitems (i.e. less than 64 in this case) you will never be able to fully exploit the computational power of your device. Specifically, if your workgroups have X workitems (with X < 64) then your top performance cannot be higher than (X/64) the peak theoretical performance. E.g., with 8 workitems per workgroup, you cannot get more than about 12.5% of the maximum theoretical throughput.
I apologize for the delayed response. I wanted to furnish a follow-up to our dialogue. Since then, I've characterized my observations for both throughput and memory consistency and synchronization. The following was performed using the latest driver and SDK (AMD Catalyst 13.11-beta6 for Linux / x86_64; and AMD APP-SDK v2.9 for Linux / x86_64, respectively).
To more precisely determine throughput, I've installed and utilized AMD's CodeXL utility. Altogether, with 100% GPU utilization (as reported by CodeXL), it appears that the process bottlenecks occur as a result of the process's reliance on the device's 992-MB of Global memory. An attempt to cache commonly accessed items to the device's 32-KB Local memory did improve throughput, but the large data-set sizes ( 1K x 4K x 4K ) and a subsequent restructuring of the process (described below) rendered caching to Local memory as superfluous. So, I'm kinda stuck with the the exclusive use of Global memory (for now).
More importantly (and central to the original dialogue), an attempt to improve the memory consistency and synchronization (between dependent processes) compelled a restructuring of the program, from one (x1) 3-D domain (1K x 4K x 4K) to two (x2) consecutive 2-D domains (1K x 4K) performed 4K times. This new implementation utilizes the Host-CPU to dispatch a pair of consecutive (1Kx4K) clEnqueueNDRangeKernel processes (via a "for" loop). The Host-CPU performs this dispatch 4K times. Each dispatch and process-pair are governed by a cl_event via clEnqueueBarrierWithWaitList (to force GPU task completion) and a clWaitForEvents (to force the Host-CPU to block until the GPU finishes the process-pair). The processes are performed out-of-place and a dedicated "CL_MEM_HOST_NO_ACCESS" *bounce* buffer in the GPU's Global memory holds the intermediate results between the process-pairs. Altogether, despite absolute memory addressing,and the aforementioned process coordination, the GPU implementation appears to mismanage memory accesses (incorrect and incomplete addresses) for clEnqueueNDRangeKernel *local_work_size* dimensions greater than (1,1). In fact, the problem gets exponentially worst with higher local_work_sizes (to a device maximum of (16, 16) ).
In contrast to the GPU implementation, I dispatched the identical Host and GPU OpenCL source code (of 4K dispatches of [1K x 4K] processes) on the Host-CPU (affording a maximum local_work_size of (32, 32)). The results are perfect on a quad-core (8-thread) Intel processor, albeit about 10x slower than the GPU). Strangely, my "hand-coded" non-OpenCL version (optimized to use the AVX constructs) is only 2x slower than the GPU's performance.
Altogether, I can't explain the reason for the disparity between the GPU and the CPU despite using the identical OpenCL host and __kernel source code. I'm going to try an NVIDIA device to determine if this disparity is endemic to the AMD device.
Your thoughts and suggestions are GREATLY APPRECIATED. As always, many thanks for your time and expertise.