cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

xfaure
Journeyman III

[XF] Float16 vs 16 float

Interest of float16

Hello every body,

I'm new with OpenCL. I try to illustrate the power of float16, but I failed to.
I built a program which add to 1024*1024*16-array of float. With GPU, when I run with float16, the time of computation is 0.03 secondes. With GPU when I run with 16 * float, the time of computation is 0.006 secondes. And with CPU, the time of computation is 2 secondes. But Why it's longer with float16 than 16 * float?

Thanks for your help.

A part of my code :


Fichier Main.cpp :

// Define an index space (global work size) of threads for execution.
// A workgroup size (local work size) is not required, but can be used.
size_t globalWorkSize[1];
size_t localWorkSize[1];
// There are nbKernel threads
globalWorkSize[0] = nbKernel/16;
localWorkSize[0] = 512;

// Execute the kernel.
// 'globalWorkSize' is the 1D dimension of the work-items
status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, globalWorkSize,
localWorkSize, 0, NULL, NULL);

clFinish(cmdQueue);

Fichier.cl :

__kernel void vecadd(__global float16 const * const A, __global float16 const * const B, __global float16 * const C)
{

unsigned int const i = get_global_id(0);

C = A + B;


Thanks

Xavier Faure

0 Likes
7 Replies
notzed
Challenger

I gave you a long list of possible reasons when you asked this same question on the khronos forums.

 

0 Likes

I have already post this question on the khronos forums.

http://www.khronos.org/message_boards/viewtopic.php?f=28&p=12518#p12518

But I post the question in several forums to have a lot of point of view.

Actually, I 'm not convencied by all the responses.

Thanks for your help

0 Likes

One reason is that Float16 is not a very efficient data type because it causes memory conflicts on writes. Each thread is strided by 64 bytes which means 16 byte stores from two threads cannot be merged together.
0 Likes

Thanks for your answer.

So, the same problem arrive with float2 or float4.

So, I undestand that we have not to use  floatn if we want the fastest application?

If it is not the case, can you give me a simple simple program which illustrates float2 or float4 is quicker than float.

Tanks for your help.

Xavier

0 Likes

Look at the samples in the SDK, they are written in many cases with the optimal vector length in mind.
0 Likes

I'm sorry but I failed to find a vector type in the sdk samples :

http://developer.nvidia.com/opencl-sdk-code-samples

I tried to look at :

OpenCL Vector Addition For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Element by element addition of two 1-dimensional arrays. Implemented in OpenCL for CUDA GPU's, with functional comparison against a simple C++ host CPU implementation.

 

But, they have not post any thing on type vector float or int or else.

Can you give me the URL of your SDK sample ?

Thanks for your help.

 

Xavier

0 Likes

xfaure,
Those are nvidia's samples. This is AMD developer forum, so you should look at the AMD samples that come when you install the SDK.
0 Likes