Archives Discussions

ascho · ‎06-06-2011

I'm using OpenMP to parallelize my program. Now I want to want to exploit the vector feature of my CPU (SSE instructions) in certain spots within the OpenMP parallel region. I know I can do this using "intrinsics" but I want to keep my code as portable as possible. So my idea is to use OpenCL to vectorize the code. Of course OpenCL should not create additional threads within the OpenMP parallel region. Is that possible?

nou · ‎06-06-2011

IMHO you should utilize OpenCL to paralize across multiple cores.

laobrasuca · ‎06-06-2011

I believe the compiler use SSE instructions when building code with the OpenCL base vector types (int2, int4, ...), at least for cases like

int4 a, b, c;

a = (int4)(1);

b = (int4)(2);

c = a + b.

Can anyone confirm this?

And bulldozer architecture seems to feet quite nicely on the OpenCL programming architecture.

LeeHowes · ‎06-06-2011

I'm not sure this makes much sense. All you really want is the vector types, yes? If that's the case, just create a set of C++ classes and wrap the intrinsics in them.

If you want to parallelise using OpenCL then do that by all means, but I wouldn't try to mix it within OpenMP - they both target the same problem, so you're only likely to cause confusion and a loss of efficiency.

ascho · ‎06-06-2011

Originally posted by: LeeHowes I'm not sure this makes much sense. All you really want is the vector types, yes? If that's the case, just create a set of C++ classes and wrap the intrinsics in them.

This is not cross platform. What about PowerPC "Altivec" instructions or other processors like Cell? OpenCL takes care of that which is really fine.

If you want to parallelise using OpenCL then do that by all means, but I wouldn't try to mix it within OpenMP - they both target the same problem, so you're only likely to cause confusion and a loss of efficiency.

My idea was to use OpenMP for the CPU - as is now - and within each of the CPU threads execute OpenCL kernels in order to utilize CPU vectorization features or - if available - execute the kernels on a GPU, if available. Probably my thinking is too naive? That would be bad news.

LeeHowes · ‎06-06-2011

Yes, I know what you want to do. I just think that it's risky. You're interfacing multiple OpenMP threads with a single OpenCL runtime, but when letting OpenCL dispatch a thread per core itself would be less overhead. You may be able to split the OpenCL device and create a queue for each, then take those in each OpenMP thread - I don't know how the synchronization would work within the runtime, it might be efficient enough.

In the end, though, I don't know how much you really gain in terms of cross-platform support. Do you expect that all devices you care about in cross-platform terms will have an OpenCL runtime?

Enabling GPU support makes more sense, but given your description of not being able to move the entire algorithm into OpenCL it sounds like the overhead of trying to get the smaller units of work to the GPU would likely ruin performance. The GPU isn't always faster than the CPU afterall and rarely more than 5x faster.

It might be worth a try. I bet someone else has written a cross-platform vector class library, though, that would be better suited to what you want to do barring the GPU offload. There's one in the Bullet physics SDK that might be a good fit (though I forget the terms of the bullet licence).

ascho · ‎06-07-2011

Originally posted by: LeeHowes Yes, I know what you want to do. I just think that it's risky. You're interfacing multiple OpenMP threads with a single OpenCL runtime, but when letting OpenCL dispatch a thread per core itself would be less overhead. You may be able to split the OpenCL device and create a queue for each, then take those in each OpenMP thread - I don't know how the synchronization would work within the runtime, it might be efficient enough.

In the end, though, I don't know how much you really gain in terms of cross-platform support. Do you expect that all devices you care about in cross-platform terms will have an OpenCL runtime?

Enabling GPU support makes more sense, but given your description of not being able to move the entire algorithm into OpenCL it sounds like the overhead of trying to get the smaller units of work to the GPU would likely ruin performance. The GPU isn't always faster than the CPU afterall and rarely more than 5x faster.

It might be worth a try. I bet someone else has written a cross-platform vector class library, though, that would be better suited to what you want to do barring the GPU offload. There's one in the Bullet physics SDK that might be a good fit (though I forget the terms of the bullet licence).

Thank you for the tip with Bulletphysics SDK! I will try the vectormath library and see how well it works. Definitely easier than struggling with OpenCL.

From your words I read that OpenCL allows to spread work to a GPU from within an OpenMP parallel region - although it is possibly inefficient. So I can try that later on.

That's good news.

Thanks to all for the valuable comments.

Marix · ‎06-07-2011

Originally posted by: LeeHowes I'm not sure this makes much sense. All you really want is the vector types, yes? If that's the case, just create a set of C++ classes and wrap the intrinsics in them.

There is not even a need to create such a wrapper yourself. There is a library called Vc providing this functionality, even with compatibility across different sets of vector instructions.

ascho · ‎06-06-2011

Originally posted by: nou IMHO you should utilize OpenCL to paralize across multiple cores.

I fear that is not possible. The OpenMP parallel region extends over a chain of algorithms, all written in C++, using STL, Templates and so on. Rewriting this in OpenCL is not an option.

Archives Discussions

Utilize OpenCL for Vectorization