cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

spectral
Adept II

OpenCL & SIMT

Hi,

I'm currently working on the OpenCL Parallel Primitive library and so I currently read a lot of literature that come from NVidia.

A lot of optimizations come from the fact that they use the WARP concept (32 threads as SIMT).

So, I would like to know is there are some equivalence in ATI hardware and if I can benefit from it ?

Also, someone tell me that there is no "shared" memory on ATI hardware, that the "__local" memory is emulated with global memory. Is it right ?

So, if it is true... is there some way to optimize my code to avoid "global memory" access ?

 

Thanks for your help

0 Likes
13 Replies
maximmoroz
Journeyman III

Please, read AMD Accelerated Parallel Processing OpenCL Programming Guide. It answers all your questions.

0 Likes
davibu
Journeyman III

Originally posted by: viewon01 Hi,

 

 

Also, someone tell me that there is no "shared" memory on ATI hardware, that the "__local" memory is emulated with global memory. Is it right ?

 

 

This was true only for H4xxx GPU family, all new AMD GPUs have "real" local memory.

 

0 Likes

Thanks David, Thanks Maxim,

About wavefront, I know that I have 16,32,64 items in a wavefronts. But I don't want to hard-code this and search for a way to retreive the SIMT capability for a device ! If possible !!! (Not sure it is possible, and of course it is different than work-group size etc...)

0 Likes

hi viewon01,

AMD has the concept of wavefront whose value is fixed to 64.(it is 32 for one device( i guess cedar) and 16 for none AFAIK). Refer Chapter 1 AMD OpenCL Programming Guide.

 

0 Likes

Originally posted by: viewon01 Thanks David, Thanks Maxim,

About wavefront, I know that I have 16,32,64 items in a wavefronts. But I don't want to hard-code this and search for a way to retreive the SIMT capability for a device ! If possible !!! (Not sure it is possible, and of course it is different than work-group size etc...)

Is is rather easy to do in OpenCL 1.1 (AMD, Intel implementations support 1.1, but NVidia still supports 1.0 only in its release drivers, so I hardcode 32 in case OpenCL platform supports 1.0):

Use clGetKernelWorkGroupInfo function with CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE parameter.

I determine this value even before building actual kernels by building simple one looking like:

__kernel void testWavefrontSize(const __global int * a, __global int * b) {*b = *a;}

0 Likes

Thanks Maxim,

 

But CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE returns 512 for NVidia card !!! I need 32 ! I don't need a multiple but the exact SIMT capability because I use it to avoid barriers synchro in some cases !

0 Likes

What OpenCL version does NVidia platform support? And did you get this parameter value for the kernel I specified (simple one)?

0 Likes

My suggestion would be to use that test as people suggest, but rather than dropping barriers yourself set your work groups to that size. Unless you have a very good reason for using a bigger work group that's the best thing to do anyway, and it means the compiler will remove the barriers for you.

You'll have to do custom versions of those kernels for AMD, NVIDIA and so on anyway because all of your memory accesses will change given the vector size. You can't write an efficient vector reduction without knowing the vector size, for example.

If you drop the barriers completely you're no longer portable (which might be ok, given that you'll have a standard fall back version, presumably) and at risk of compiler changes hurting you. At the very least make sure you put fences in everywhere you might have a barrier, otherwise the compiler will optimise the code by moving reads and writes around not realising that they ever have to touch memory.

0 Likes

Thanks Lee,

It is the way I will go.

So, do you think that AMD can support us to develop the clPP project ? I have see that you already provide some libraries but no parallel primitives !

PP are a building block of a lot of applications and to have something fine it require a time (and skills).

By example, I have read the Duane Merrill paper and code about radix sort and it give me headache... for sure it will not be an easy task to port to OpenCL correctly, but I think that it is important if we want more peoples to use OpenCL (and not CUDA).

What do you think ? Are you aware of this problem at AMD ?

Regards and again, thanks for your help

0 Likes

My opinion of a bunch of the parallel primatives is that they're fairly useless for adapting to a specific application. For example, I beleive radix sort requires you pass it a single large array. I can't use this (efficiently) to say, sort the rows or columns of a matrix, as I have have to make many kernel calls. What we need are primatives that are effectively "make the current thread block do x."

Under this model, each thread block can sort a row or column of a matrix, so you can do them all in a single kernel call.

I've added a few of these primatives already to my clUtil library (code.google.com/p/clutil). Currently, we have functions that make the current thread block compute the max of an array, the sum, and I have two terrible implementations of radix sort (greatest to least and least to greatest). Instead of creating a separate project, I think it would be awesome to collaberate and add more of these primatives to clUtil.

0 Likes

I wrote a radix sort, and a colleague has been working to improve it based on that paper. It's pretty good now though we still want to fix a couple of things.

My aim is to make it public, I don't know quite how much procedure we'll have to go through to achieve that, though. If it goes out on the web site I'm sure the licence would allow you to include it in your primitives library.

I do think large-scale primitives like that make sense, but having smaller group-level primitives would be helpful too. The real problem is that it's not feasible to make such primitives cross-platform so you're going to end up with a pile of platform-specific optimisations.

0 Likes

I think it's useful just to get them working in the immediate future. If you really need an optimized implementation, you can write your own. For the applications I've written, these primative routines aren't really the bulk of the computation, so it's okay if they run a little slower.

Also, I think library developers should focus on making them clearly documented, working in all edge cases, and with few assumptions about work group dimensionality. While getting things working for all arrays that are multiples of 64 is nice for publication, when you want to do meaningful work you need them to work all the time.

0 Likes

I agree with Lee... it is why this project can be a little bit more complex than the Thrust library (Where they already tune some stuffs for specific hardware).

Anyway, I agree with rick too... everything must be well documented and work in all the cases.

For now we are mainly in a "start" phase, so we have several algorithms, but once we have finish our "test" phase we will be able to provide a "generic" version of each algorithm. Once done we will work on optimized version for each case.

Regards

0 Likes