Hi
I have split my big algorithm into 2 kernels kernel[0] and kernel[1];
kernel[0] updates outputbuffer(1280x720) and i read the outputbuffer by clEnqueueReadBuffer to "hostoutput"(1280x720).
Now I modify the "hostoutput" in application side then I want to send the modified "hostoutput" to kernel[1].here i just set kernel[1] arg with outputbuffer.
doubt:
1.should i create a new buffer object by clCreateBuffer again for kernel[1] with hostoutput....flags is set with CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
I thought there will be cache coherency between "hostoutput" and opencl device outputbuffer always till buffer release.
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=129436&enterthread=y
I tried using clEnqueueMapBuffer but i couldn't succeed in the getting the output.infact i am getting NULL output.
could you point to any examples using the above scenario(multiple kernels ,reusing the output buffer of kernel[0] as input buffer of kernel[1] and ofcourse some modifications in between kerne[0] and kernel[1] execution ).
Hope you understood what i see.
Hi Nou and All ,
Please help me in getting an example of above scenario.
I tried with clEnqueueMapBuffer and also with creating a new buffers for each kernel w/o clEnqueueMapBuffer...seems i am not getting the exact nerve of the buffer management.
Thanks in advance
Pavan
Are you doing something like this-
1. clCreateBuffer with CL_MEM_READ_WRITE
2. Call kernel[0]
3. clEnqueueMapBuffer
4. Modify data
5. clEnqueueUnMapBuffer
6. call kernel[1]
This should work without any problem.
Scenario: I want outputbuffer of kernel[0] modified in host application then send the modified buffer as input to kernel[1].
In kernel[1] modify the input buffer and store the results in same buffer.
I am doing something like this.
1. outputBuffer = clCreateBuffer(context,CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,sizeof(cl_uint) * width * 3 , output, &status)
2.output =(cl_uint *)clEnqueueMapBuffer(commandQueue,outputBuffer , CL_TRUE,CL_MAP_READ|CL_MAP_WRITE,0, sizeof(cl_uint) * width * 3 ,
0,NULL, &events[0],&status); /
//map the output host memory with device outputbuffer so that i can use device "outputbuffer" as input to kernel[1].
3.clSetKernelArg(kernel[0], 4,sizeof(cl_mem),(void *)&outputBuffer);
4.Call kernel[0].....clEnqueueNDRangeKernel.
5.clEnqueueReadBuffer(commandQueue,outputBuffer, CL_TRUE, 0, width * 3 * sizeof(cl_uint),output, 0,NULL,&events[1]);
6.Modiy the "output" in host application.
7.clSetKernelArg(kernel[1],0, sizeof(cl_mem),(void *)&outputBuffer); //here i am setting the same outputbuffer as input to kernel[1]....wihtout creating new bufferobject...assuming host buffer"output" and devicebuffer "outputBuffer" are in sync.
8.Call kernel[1].....clEnqueueNDRangeKernel.
9.Then clEnqueueReadBuffer(commandQueue,outputBuffer,CL_TRUE,0,width * 3 * sizeof(cl_uint),output,0,NULL, &events[1]);
...Please rectify for any wrong flow.
In clEnqueueReadBuffer, you are trying to copy data from the pointer output to the same pointer location.
I would suggest you to use map/unmap instaed of clEnqueueReadBuffer. As, your buffer is on host itself, it should be faster to use map/unmap.
Also, you have not unmapped the pointer after second step, you must unmap before launching kernel.