Hello every body,
I'm new with OpenCL. I try to illustrate the power of float16, but I failed to.
I built a program which add to 1024*1024*16-array of float. With GPU, when I run with float16, the time of computation is 0.03 secondes. With GPU when I run with 16 * float, the time of computation is 0.006 secondes. And with CPU, the time of computation is 2 secondes. But Why it's longer with float16 than 16 * float?
Thanks for your help.
A part of my code :
Fichier Main.cpp :
// Define an index space (global work size) of threads for execution.
// A workgroup size (local work size) is not required, but can be used.
size_t globalWorkSize[1];
size_t localWorkSize[1];
// There are nbKernel threads
globalWorkSize[0] = nbKernel/16;
localWorkSize[0] = 512;
// Execute the kernel.
// 'globalWorkSize' is the 1D dimension of the work-items
status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, globalWorkSize,
localWorkSize, 0, NULL, NULL);
clFinish(cmdQueue);
Fichier.cl :
__kernel void vecadd(__global float16 const * const A, __global float16 const * const B, __global float16 * const C)
{
unsigned int const i = get_global_id(0);
C = A + B;
I gave you a long list of possible reasons when you asked this same question on the khronos forums.
I have already post this question on the khronos forums.
http://www.khronos.org/message_boards/viewtopic.php?f=28&p=12518#p12518
But I post the question in several forums to have a lot of point of view.
Actually, I 'm not convencied by all the responses.
Thanks for your help
Thanks for your answer.
So, the same problem arrive with float2 or float4.
So, I undestand that we have not to use floatn if we want the fastest application?
If it is not the case, can you give me a simple simple program which illustrates float2 or float4 is quicker than float.
Tanks for your help.
Xavier
I'm sorry but I failed to find a vector type in the sdk samples :
http://developer.nvidia.com/opencl-sdk-code-samples
I tried to look at :
OpenCL Vector Addition
Element by element addition of two 1-dimensional arrays. Implemented in OpenCL for CUDA GPU's, with functional comparison against a simple C++ host CPU implementation.
But, they have not post any thing on type vector float or int or else.
Can you give me the URL of your SDK sample ?
Thanks for your help.
Xavier