Archives Discussions

pratapk · ‎07-28-2011

CPU and APU memory optimizations

In OpenCL targeted for APU,

1) In APU, graphics core shares the main memory ( Instead of VRAM)

Is it really required to do buffer copy and usage of local memory and global memory ( Except for synchornization)

Can't we just use host_ptr, and mapped memory after all it resides in main memory.

In APU, Which memory ( Is there something like cache) is used for 32 K memory ?

2) Is there a sample for APU ?

OpenCL Targeted for CPU,

1) I thought CPU workgroup size should in the order of 1, as CPU cores are not Stream processors ( I am talking about the warp size)

But, when queried for max workgroup using clGetDeviceInfo, it gives 1024

What is the best practise of workgroup size for CPU ( as similar to AMD GPU 64)

himanshu_gautam · ‎07-28-2011

1) In APU, graphics core shares the main memory ( Instead of VRAM)

In APU, the CPU and GPU share the same RAM, although I am not sure about the policy they use.

Is it really required to do buffer copy and usage of local memory and global memory ( Except for synchornization)

Can't we just use host_ptr, and mapped memory after all it resides in main memory.

That depends on the policy used for sharing the RAM. IF the RAM space for GPU and CPU are exclusively defined, you will have to do buffer copies from RAM to RAM.

In APU, Which memory ( Is there something like cache) is used for 32 K memory ?

CPU and GPU have their own dedicated caches.

2) Is there a sample for APU ?

Do you have an APU with you. AFAIK, current sdk samples should run on an APU. Do you face some issues.

OpenCL Targeted for CPU,

1) I thought CPU workgroup size should in the order of 1, as CPU cores are not Stream processors ( I am talking about the warp size)

But, when queried for max workgroup using clGetDeviceInfo, it gives 1024

What is the best practise of workgroup size for CPU ( as similar to AMD GPU 64)

As I think, 64 is not a hard rule. It is normally better when kernels are highly computational. If kernels are fetch or write bound, it is better to have more threads assigned to each workgroup. 1024 is the maximum number of workitems supported in a single workgroup. This value is 256 for most GPUs.

pratapk · ‎07-28-2011

Quoted:"That depends on the policy used for sharing the RAM. IF the RAM space for GPU and CPU are exclusively defined, you will have to do buffer copies from RAM to RAM."

Do you know of the policy for APU, Buffer copy from RAM to RAM, can't we operate on same set of data ?

Quoted:"Do you have an APU with you. AFAIK, current sdk samples should run on an APU. Do you face some issues."

Current examples are optimized ( May be targeted) for CPU and discreet GPU combinations, I didn't really find an example optimized for APU.

Quoted:"What is the best practise of workgroup size for CPU ( as similar to AMD GPU 64)

As I think, 64 is not a hard rule. It is normally better when kernels are highly computational. If kernels are fetch or write bound, it is better to have more threads assigned to each workgroup. 1024 is the maximum number of workitems supported in a single workgroup. This value is 256 for most GPUs."

Can you be specific to CPU ?

maximmoroz · ‎07-28-2011

> Is it really required to do buffer copy and usage of local memory and global memory ( Except for synchornization)

1. You better use enqeueMapBuffer instead of enqeueReadBuffer/enqeueWriteBuffer. It is fast for APUs as no actual copying occurs.

2. Intergrated GPU has dedictaed local memory so if algorithm benefits from using local memory then it is better to use it.

maximmoroz · ‎07-28-2011

> I thought CPU workgroup size should in the order of 1, as CPU cores are not Stream processors ( I am talking about the warp size)

Actually, the fact that AMD APP SDK is not able to auto-vectorize kernels when compiling for CPU leads to CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE parameter to be equal to 1. Generally workgroup size should be greater than warp/wavefront size. And it is usually this way.

> What is the best practise of workgroup size for CPU

Set it to 64. In most cases such local worksize will be near the most efficient one.

pratapk · ‎07-28-2011

Quoted:"You better use enqeueMapBuffer instead of enqeueReadBuffer/enqeueWriteBuffer. It is fast for APUs as no actual copying occurs"

Also, When we fallback on CPU, we can make use of above as no actual copying is required, right ?

maximmoroz · ‎07-28-2011

Correct.

himanshu_gautam · ‎07-28-2011

pratapk,

Thanks fr the request of the sample for using APUs. I hope it falls in the AMD's plans. I would suggest you to use the information mentioned in the opencl programming guide for the timebeing.

Archives Discussions

CPU and APU optimizations