I am programming Neuronetwork algorithm with OpenCL and encounting some problem.

Part of my algorithm is like this:

For example the neuronetwork is x-y-z, it means there are a input vector with length of x, a middle vector with length of y and a output vector with length of z, also there are two matrixes of which factors' values are specified.

The first matrix's dimension is y*x and the second is z*y . Since then, the algorithm is: step1 , middle-vector=first-matrix * inputvector; step2, output-vector=second-matrix * middle-vector. Surely ,there are bias and activation functions in Neuronetwork, but for the sake of predigesting, they are ignorable.

In OpenCL programming , I can seperate this two steps in two kernels then global_work_sizes are y and z respectively. However, what I use is OpenCL.NET, if the algorithm is in two kernels , consumed time would get longer.

So, I can only implement the algorithm in one kernel.

My problem are:

1. How to specify global_work_size ? If global_work_size=max(y,z) , there are |y-z| workitems will be wasted in one step, will this work well ?

2. Before calculate step2, all factors of middle-vector must be figured out. So it need a synchronization function here. However , you told me there is no global synchronization so far in GPGPU programming, so I can only use one work-group to work. But CL_DEVICE_MAX_WORK_GROUP_SIZE of my GPU is 256, workitems can be used in my algorithm are limited. Is my analysis correct ?

3. For multiply matrix by vector, I counted out that GPU's operation speed is just several multiple of CPU's ,whatever how big the matrix is. Am I right ?