cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

firemars
Journeyman III

Coherence between CPU and GPU in Fusion APU

If I allocate a memory buffer "D" in system memory. And launch two kernerls for CPU and GPU at the same time.

If GPU use atomic operation on "D", and CPU also use atomic operation on "D".
Can I get a coherent result?
Thanks


0 Likes
6 Replies

We have a test case that does just this internally, so it does work.
0 Likes

Thanks for your reply.

Actually, I also did some experiments on that.

It seems like that sometimes I can get coherent result, but sometimes not, whatever I use USWC or not.

When I just run one kernel on CPU or GPU, the result is good under atomic operations between threads. But it does not work very well when I run two kernels at the same time.

I am interested in how the atomic operations work in two different kernels.  And if I use atomic operation on CPU (use C, not in kernel), can I also get coherent result with the atomic operation inside kernel.

Thanks

0 Likes

firemars,
I believe this will only work with an APU, not with a discrete. There are no atomic operations across the PCI bus so the atomicity is broken at that point.
0 Likes

Hi Micah,

Thanks for your replying. 

I did the experiment on APU, which is A8-3850. The OS is windows 7. 

I launched two kernels, one for CPU, the other for GPU.

Both of the kernels will do atomic_add on one memory object, which is create by CL_MEM_ALLOC_HOST_PTR.

As what I seen, the atomic operations between these two kernels can not be guarantee. 

I think the reason might be that CPU use write combine buffer. Thus, GPU cannot read the most latest data. 

If it is true, is there any way to make sure CPU flush write to the memory. 

Thanks

0 Likes

Originally posted by: firemars Hi Micah,

 

Thanks for your replying. 

 

I did the experiment on APU, which is A8-3850. The OS is windows 7. 

 

I launched two kernels, one for CPU, the other for GPU.

 

Both of the kernels will do atomic_add on one memory object, which is create by CL_MEM_ALLOC_HOST_PTR.

 

As what I seen, the atomic operations between these two kernels can not be guarantee. 

 

I think the reason might be that CPU use write combine buffer. Thus, GPU cannot read the most latest data. 

 

If it is true, is there any way to make sure CPU flush write to the memory. 

 

Thanks

 

firemars,

It is not write combine buffer if buffer is created with CL_MEM_ALLOC_HOST_PTR.

It looks like some thing is going wrong. Could you please paste your experimental code here?

0 Likes

Because the code of my experiment is too long, I copy some relating part right here. The program works well when only use cpu or gpu.

However, use cpu+gpu, there is problem on coherency like the situation in the following.

The work did by CPU:

CPU: data: 1; 

CPU: data: 2; 

CPU: data 3; 

GPU: data: 2; 

CPU: data: 3;

CPU: data 4;

data 3 is repeated done by CPU.

Another thing is that if I increase the workload for each data, this situation is reduced.

 

///////////////////////////////main.cpp///////////////////////////////////////////
struct Shared_data
{
int data;
} shared_data ;

cl_mem d_shared_data;

main ()
{
cl_int err;
shared_data.data = 0;
d_shared_data = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(struct Shared_data), NULL, &err);

struct Shared_data *p_shared_data = clEnqueueMapBuffer(command_queue, d_shared_data, CL_TRUE, CL_MAP_WRITE, 0, sizeof(struct Shared_data), 0, NULL, NULL, &err);
memcpy(p_shared_data, &shared_data, sizeof(struct Shared_data));
clEnqueueUnmapMemObject(command_queue, d_shared_data, p_shared_data, 0, NULL, NULL);

create_thread(work) //for gpu
create_thread(work) //for cpu
}

void work()
{
clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_shared_data);

clEnqueueNDRangekernel(command_queue, kernel, 1, NULL, global, local, 0, NULL, NULL)
}

////////////////////////////Kernel.cl////////////////////////////////////////////
#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics: enable
__kernel fun(volatile __global struct Shared_data *shared_data)
{
int local_id = get_local_id(0);
__local int s_data[];

while(1)
{
if(local_id == 0)
{
s_data[0] = atomic_add(&(shared_data->data),1);
}
barrier(CLK_LOCAL_MEM_FENCE);

if(s_data[0] > threshold)
break;

//do work on s_data[0]
...........
}
}

 

0 Likes