I create three commandQueues, one to write buffer, the other one to read buffer, the last one to execute kernel. There are a set of kernels, suquence execution. There are three "stages" in the program, the first provide inputs to correspond with writing buffer, the second execute kernels, the last read results of the execution of kernels. Reading buffer, writing buffer, and executing kernel are parallel, but, when reading buffer or writing buffer, the execution of the kernels are't continuous. Between the first and the last stage, this is where gap usually occurs. By the CodeXL, the gap between them is about 5ms. Regardless of the correct results, discarding read/write buffer, the execution of the kernels are not gaps. I have looked at the optimization guide in AMD's website not to find any reasons. Is there any modes to reduce the effects of clEnqueueReadBuffer/clEnqueueWriteBuffer, and not to reduce performance?
About environment, I am using the FirePro W9100 in Win7 64 environment. The amd CCC version about FirePro W9100 is 2015.0113.1141.20974.