OpenCL in neuronetwork

Discussion created by Fuxianjun on Aug 18, 2010

I am programming Neuronetwork algorithm with OpenCL and encounting some problem.

Part of my algorithm is like this:
For example the neuronetwork is x-y-z, it means there are a input vector with length of x, a middle  vector with length of y and a output  vector with length of z, also there are two matrixes of which factors' values are specified.
The first matrix's dimension is y*x and the second is z*y . Since then, the  algorithm is: step1 , middle-vector=first-matrix * inputvector; step2, output-vector=second-matrix * middle-vector. Surely ,there are bias and activation functions in Neuronetwork, but for the sake of predigesting,  they are ignorable.
In OpenCL programming , I can seperate this two steps  in two kernels then global_work_sizes are y and z respectively. However, what I use is OpenCL.NET, if the algorithm is in two kernels , consumed time would get longer.
So, I can only implement the algorithm in one kernel.
My problem are:
1. How to specify global_work_size ? If global_work_size=max(y,z) , there are |y-z| workitems will be wasted in one step, will this work well ?
2. Before calculate step2, all factors of middle-vector must be figured out. So it need a synchronization function here. However , you told me there is no  global synchronization so far in GPGPU programming, so I can only use one work-group to work. But CL_DEVICE_MAX_WORK_GROUP_SIZE of my GPU is 256,  workitems can be used in my algorithm are limited. Is my analysis correct ?
3. For multiply matrix by vector, I counted out that GPU's operation speed is just several multiple of CPU's ,whatever how big the matrix is. Am I right ?