I only know barrier(CLK_LOCAL_MEM_FENCE) make sure work-items in the same work-group run synchronously. But how to make all of work-items which are even not in the same work-group run synchronously ? use barrier(CLK_GLOBAL_MEM_FENCE) ?

I only know barrier(CLK_LOCAL_MEM_FENCE) make sure work-items in the same work-group run synchronously. But how to make all of work-items which are even not in the same work-group run synchronously ? use barrier(CLK_GLOBAL_MEM_FENCE) ?

*Originally posted by:*We don't have global synchronization so far in GPGPU programming.**genaganna**

barrier(CLK_GLOBAL_MEM_FENCE) is also applicable for only work-group.

thanks for reply ! but what's the difference between barrier(CLK_GLOBAL_MEM_FENCE) and barrier(CLK_LOCAL_MEM_FENCE) ?

*Originally posted by:*thanks for reply ! but what's the difference between barrier(CLK_GLOBAL_MEM_FENCE) and barrier(CLK_LOCAL_MEM_FENCE) ?**Fuxianjun**

Both blocks work-items in work-groups and following

CLK_LOCAL_MEM_FENCE - The barrier function

will either flush any variables stored in local memory

or queue a memory fence to ensure correct ordering of

memory operations to local memory.

CLK_GLOBAL_MEM_FENCE – The barrier function

will queue a memory fence to ensure correct ordering

of memory operations to global memory. This can be

useful when work-items, for example, write to buffer or

image objects and then want to read the updated data.*Originally posted by:***genaganna**

Both blocks work-items in work-groups and following

CLK_LOCAL_MEM_FENCE - The barrier function will either flush any variables stored in local memory or queue a memory fence to ensure correct ordering of memory operations to local memory.

CLK_GLOBAL_MEM_FENCE – The barrier function will queue a memory fence to ensure correct ordering of memory operations to global memory. This can be useful when work-items, for example, write to buffer or image objects and then want to read the updated data.

I got your email address from AMD forum. I'm sure you can help me because you posted too many messages .

Now , I am programming Neuronetwork algorithm with OpenCL and encounting some problem. Please let me present to you !

Part of my algorithm is like this:

For example the neuronetwork is x-y-z, it means there are a input vector with length of x, a middle vector with length of y and a output vector with length of z, also there are two matrixes of which factors' values are specified.

The first matrix's dimension is y*x and the second is z*y . Since then, the algorithm is: step1 , middle-vector=first-matrix * inputvector; step2, output-vector=second-matrix * middle-vector. Surely ,there are bias and activation functions in Neuronetwork, but for the sake of predigesting, they are ignorable.

In OpenCL programming , I can seperate this two steps in two kernels then global_work_sizes are y and z respectively. However, what I use is OpenCL.NET, if the algorithm is in two kernels , consumed time would get longer.

So, I can only implement the algorithm in one kernel.

My problem are:

1. How to specify global_work_size ? If global_work_size=max(y,z) , there are |y-z| workitems will be wasted in one step, will this work well ?

2. Before calculate step2, all factors of middle-vector must be figured out. So it need a synchronization function here. However , you told me there is no global synchronization so far in GPGPU programming, so I can only use one work-group to work. But CL_DEVICE_MAX_WORK_GROUP_SIZE of my GPU is 256, workitems can be used in my algorithm are limited. Is my analysis correct ?

3. For multiply matrix by vector, I counted out that GPU's operation speed is just several multiple of CPU's ,whatever how big the matrix is. Am I right ?

as per my understanding, It is just a matrix multiplication. I donot think it is possible to impletment two steps in single kernel. Please first implement with two kernels. you can optimize later. Please look SDK samples MatrixMultiplication and MatrixMulImage to understand how to write matrix multiplication.*Originally posted by:*For example the neuronetwork is x-y-z, it means there are a input vector with length of x, a middle vector with length of y and a output vector with length of z, also there are two matrixes of which factors' values are specified. The first matrix's dimension is y*x and the second is z*y . Since then, the algorithm is: step1 , middle-vector=first-matrix * inputvector; step2, output-vector=second-matrix * middle-vector. Surely ,there are bias and activation functions in Neuronetwork, but for the sake of predigesting, they are ignorable. In OpenCL programming , I can seperate this two steps in two kernels then global_work_sizes are y and z respectively. However, what I use is OpenCL.NET, if the algorithm is in two kernels , consumed time would get longer. So, I can only implement the algorithm in one kernel.**Fuxianjun**

First step : global_work_size = number of elements in middle-vectorMy problem are: 1. How to specify global_work_size ? If global_work_size=max(y,z) , there are |y-z| workitems will be wasted in one step, will this work well ?

Second step : global_work_size = number of elements in output-vector

2. Before calculate step2, all factors of middle-vector must be figured out. So it need a synchronization function here. However , you told me there is no global synchronization so far in GPGPU programming, so I can only use one work-group to work. But CL_DEVICE_MAX_WORK_GROUP_SIZE of my GPU is 256, workitems can be used in my algorithm are limited. Is my analysis correct ?

Yes you can solve this problem with single kernel with single work-group. but performance is very poor as most of the gpu is ideal. You should use two kernels where you can use more number of work-groups which gives good performance.

3. For multiply matrix by vector, I counted out that GPU's operation speed is just several multiple of CPU's ,whatever how big the matrix is. Am I right

This is purely based on size of matrix and vector.

*Originally posted by:***genaganna**

as per my understanding, It is just a matrix multiplication. I donot think it is possible to impletment two steps in single kernel. Please first implement with two kernels. you can optimize later. Please look SDK samples MatrixMultiplication and MatrixMulImage to understand how to write matrix multiplication.

thanks again, you have done me a big favor ! I have tested that separate this two step into two kernels is much too slower than combine into one. But what I use is OpenCL.NET that encapsulate all OpenCL API, I've no idea it is the reason or not. I just think that even not use the third party software, if excute one more kernel, the time in data transportation of CPU to GPU will get longer than just one kernel. I'm not sure of this ,can you shwo your opinion ?

*Originally posted by:***Fuxianjun**

thanks again, you have done me a big favor ! I have tested that separate this two step into two kernels is much too slower than combine into one. But what I use is OpenCL.NET that encapsulate all OpenCL API, I've no idea it is the reason or not. I just think that even not use the third party software, if excute one more kernel, the time in data transportation of CPU to GPU will get longer than just one kernel. I'm not sure of this ,can you shwo your opinion ?

Yes you are right, If you execute one more kernel, it will take more time then running single kernel. but i don't think you are able to implement this in single kernel.

*Originally posted by:***genaganna**

Yes you are right, If you execute one more kernel, it will take more time then running single kernel. but i don't think you are able to implement this in single kernel.

I'm so grateful to you !

I have tested this algorithm in one kernel, it is indeed too fast than two kernels, even though there are some limitations in the kernel. However, comparatively, I would rather choose one kernel with some limitations for the sake of performance.

The limitations are;

1.I can only use one work-group because there is no global synchronization so far in GPGPU programming.

2.Because of the randomicity of neuronetwork's scale, some workitems would be wasted .

Now, my OpenCL code which implement neuronetwork algorithm can be 3 times faster than CPU. It is just not to bad, I'm trying to

*Originally posted by:*If you use single work group, you are not using 90% GPU resources.**genaganna**

What are your matrix sizes?

Any size, but not too big. I worked out that it is indeed much slower in two kernels than in one. I have to implement the whole algorithm in one kernel even thought waste much resource, because time consumed by data transportation of CPU to GPU is too long .

*Originally posted by:*If you use single work group, you are not using 90% GPU resources.**genaganna**

I ran into the same problem as the OP. I can either create multiple work groups and synchronize globally just using something like clFinish() or I can use a single work group and have each thread perform multiple iterations (which for large sizes I don't think will give as good ALU utilization). So, are there any plans to support a feature like this in the future (even if it's AMD specific)?

Hi,

I am trying to implement a 1D FFT algorithm (Out-of-place, Radix-2 Decimation-in-Time) using a single Kernel but am running into the global synchronization problem between workgroups. Once the input size becomes larger than what can be handled by the maximum number of workgroups that can be executed concurrently for the underlying hardware with the resource utilization of our particular kernel i.e 16-registers and a workgroup size of 256.

Is there a solution to Global Synchronization available with OpenCL 1.1?

It is possible to do global synchronization if you're very careful with the use of atomics, make the global pointers volatile and use fences in the appropriate places. However, because there is no clean way to simply fill the machine with wavefronts in the current version of OpenCL you have to be very careful doing it.

If you can compute the total number of wavefronts that will fit on the device you could do it. If you cannot do this you will need to create global counters to see how many waves are concurrent on the device. Remember that you do not under any circumstances want to try to synchronize the entire dispatch because much of the dispatch will not be executing until the early waves have completed.

For most algorithms such as a matrix multiply or FFT the overhead of a second dispatch is minimal because the work groups will read data and then write data to global memory, the loop you would put inside the kernel to do global synchronization would be as much overhead as letting the hardware perform that loop for you. The dispatches themselves will amortize within he total execution time and not add much. The only time when there is an obvious advantage to doing global synchronization instead of a second dispatch is if it gives you a benefit in maintaining internal state such that you do not have to write intermediate data out, saving you global memory traffic overall. Think very carefully about whether it makes sense to use this approach over using large dispatches.

If you are seeing time savings from a single kernel over two kernels, is this because each kernel is only a single wavefront (given that you haven't implemented global synchronization I can only guess that this is what you're testing) in which case the overhead of dispatches would be very high for a small amount of work.

*Originally posted by:***LeeHowes**

It is possible to do global synchronization if you're very careful with the use of atomics, make the global pointers volatile and use fences in the appropriate places.

Plz guide me where i can get more information about "volatile global pointers". I just want to investigate single kernel approach for knowledge sake.

If you are seeing time savings from a single kernel over two kernels, is this because each kernel is only a single wavefront (given that you haven't implemented global synchronization I can only guess that this is what you're testing) in which case the overhead of dispatches would be very high for a small amount of work.

Yes this is the case. For small sizes that fit into a single workgroup the single kernel is is performing better than launching two kernels.

I just mean that you need to make sure that pointers are declared volatile (__global volatile int *blah) to ensure that the compiler expects that other work items may change that pointer rather than thinking it can keep the data in registers.

For small sizes that fit in a single workgroup most of your execution time is going to be in the runtime anyway, it's bound to go faster using one kernel than two. Don't assume that that has a meaningful relationship to running a much larger dataset with hundreds of workgroups. It would depend on how long your kernel is running for and what the overhead of a dispatch is once a kernel is already in the queue.

We don't have global synchronization so far in GPGPU programming.

barrier(CLK_GLOBAL_MEM_FENCE) is also applicable for only work-group.