Now I'm trying to accelerate vector arithmetic, which has more than 100,000 elements.
Firstly I wrote following .cl code;
//! Add each element
/*!
\param result vector which result is stored to
\param left added vector
\param right adding vector
\param C coefficient for adding vector
*/
__kernel void AddVectorVector(
__global Real* result,
const __global const Real* left,
const __global const Real* right,
const Real C)
{
// get element index
int i = get_global_id(0);
// add each element
result = left + C * right;
}
But this is very slow, slower than single CPU.
How can I optimize this code? Is it limit of GPU?
OS: Windows 7
CPU: Xenon E3-1245
GPU: Radeon HD 6950
Given the simplicity of your kernel there isn't much you can do to improve performance. If you have to transfer data to the gpu and back just to apply this kernel you will not see good performance. A kernel like this can only benefit from a discrete GPU if your data resides on the GPU to begin with.
You might want to try using this kernel on the CPU (either with AMD or intel opencl sdk). In that case you would not incur the PCIe transfers and your performance should be better.
What type is 'Real'? This is not a valid OpenCL type, so the issue is with this type. Also you need to vectorize your kernel to reach peak bandwidth, not just do a scalar copy kernel.
>What type is 'Real'?
Sorry I forgot copy "typedef double Real" at line#1.
Now I know "vectorizing" as following,
typedef double Real;
typedef double2 RealN;
#define VLOADN vload2
#define VSTOREN vstore2
//! Add one element as vector per one work-item
/*!
\param result vector which result is stored to
\param left adding vector
\param right added vector
\param C coefficient for added vector
*/
__kernel void AddOneVector(
__global Real* result,
__global const Real* left,
__global const Real* right)
{
// get element index
int i = get_global_id(0);
// load as vector
RealN leftVector = VLOADN(i, left);
RealN rightVector = VLOADN(i, right);
// add as vector
RealN resultVector = leftVector + rightVector;
// store result
VSTOREN(resultVector, i, result);
}
Is this what you mean? This is not so fast, processing time is almost same.
No, this:
vload/vstore do sequential scalar loads, what you want are vector loads.
Hi Micah,
I have a follow-up question to your answer. I ran the kernel you posted above through the APP KernelAnalyzer, changing RealN from float through float16. I reasoned that the performance gain (Threads/sec * vec_step(RealN)) from float to float2 to float4 may be attributed to the hardware being VLIW4/5 as the performance increase is near linear. However, once I move on to float8 and float16 the performance drops to near constant, but for some GPUs a little better than float4.
What can explain this? Since each vector type may somewhat be considered as contiguous scalar types (although optimized for vector load/stores), increases in vector type lengths are somewhat analogous to strides between work-items--float16 is unrolled to 4x float4's in the IL/ASM, similarly for float8. However, in the APP Programming Guide it shows an example of coalesced memory access in which increasing strides between elements for work-items decreases performance. Anyways, could you help me reason these results?
Thanks!
This can be attributed to bank conflicts on load/store. Our hardware is optimized for sequential work-items loading either 64 or 128 bytes of data depending on the location, once you get into the larger vector types, you are doing strided loads and caused the bank conflicts within a wavefront. This causes performance to degrade.
Hi Micah,
Your explanation was my understanding as well, but the APP KernelAnalyzer's results contradict it since float8 and float16 outperform float4 and float2. I believe float4 corresponds to sequential work-items loading 128 bytes, and float2 corresponds to 64 bytes. So I'm left wondering if there is something wrong with the KernelAnalyzer, or are there some other unaccounted for factor(s) in play?