Archives Discussions

omgi · ‎06-28-2011

I am currently in a situation where I practically have to choose between parallelizing more, or using local memory. I am currently passing in a couple of vectors that are several orders larger than the optimal work group size. I want to perform only one set of calculations per element in each vector, move on to the next and then repeat many times (so many times that the kernel overhead would be a problem if re-running between each iteration). The problem with using several work groups is that I have much overlap, and would need to synchronize between them. The optimal would be to have everything in the same work group, but it is more than fits in local.

1. Is it even worth it to store to local before using, if I only use each element a few times?

2. Is it worth it to have "full" parallelization if I have to work with global memory all the time, VS very small parallelization but being able to use local memory?

3. Is it possible to have data stored in local memory, that will "live on" there between kernel re-runs, or is the local memory tied to each run of the kernel?

maximmoroz · ‎06-28-2011

3. No, it is not possible. Local memory is "tied" to the single group within single kernel run.

notzed · ‎06-28-2011

"I am currently in a situation where I practically have to choose between parallelizing more, or using local memory."

Or both?

1. Depends entirely on the access pattern from memory and the algorithm. So, maybe.

2. Same.

3. If that's all you need, there isn't that much overhead to just load a work-group sized chunk of data to work on at each invocation, assuming you can coalesce the read or at least do it in one unconditional step.

Usual solution until you have lots of experience: try them all and see which works faster. Then keep trying alternatives until you get something fast enough or run out of time ...

You can't synchronise between workgroups apart from using multiple kernel calls. See opencl 1.1 spec, section 3.3.1 Memory Consistency, and 3.4.3 Synchronisation - it looks like you could do with reading more up on the programming model, the spec is very good as is the amd doco. The nvidia stuff not so much - apart from using cuda terminology throughout it just isn't that great overall.

From your other post "Help with effective parallelization/synchronization" it looks like each 'vector N' result is dependent on the result of the previous one. You can't get any parallelisation out of that inner loop without changing the algorithm - assuming the pseudocode there is correct and complete in it's list of memory accesses and ordering. Since you are only using associative operations however, this should be possible by re-arranging terms (e.g. search parallel prefix sum).

But that is only one possible solution: sometimes the overheads of complex addressing schemes or synchronisation is outweighed by simpler solutions that run faster or can utilise more parallelism (e.g. the more regs/LS you use, the fewer workgroups that can run concurrently).

omgi · ‎06-28-2011

Thank you notzed, and thank you maximmoroz! I am aware that my questions might be a bit vague and naïve, given that I provide little information about my situation. This is because I have a hard time formulating the situation without over-complicating it.

Originally posted by: notzed "I am currently in a situation where I practically have to choose between parallelizing more, or using local memory."
Or both?
1. Depends entirely on the access pattern from memory and the algorithm. So, maybe.
2. Same.

This will still lead to the problem that it might not be worth it to save to local memory. When analyzing it closer though, it might still be useful. The algorithm applied on an element in one of the vectors pretty much looks like:
1. Read the value of the element into a variable called "oldX". (+1 Read/mov)
2. Add a random float to this variable and save to a new variable called "newX". (+1 Flop)
3. Access points to the left and right and store to variables called "left" and "right". (+ 2 Reads)
4. Take the difference between the points: d1 = (left - newX), d2 = (newX - right), d3 = (left - oldX), d4 = (oldX - right), and then do 0.5*A*d^2 (+16 Flops)
5. Add/subtract the results to a variable (+4 Flops)
6. Access same index in all other vectors, there are D of them (+(D-1) Reads)
7. Another calculation between these points (+4*(D-1) Flops)
8. Add result of (5.) with (7.) (+1 Flop)
9. Replace oldX with newX in the vector if an if-statement is true (+1 Write)

So of D = 6, I have (depending on if I'm analyzing this correctly):
8 Reads, whereas
32 Flops
1 Write

Maybe it is easier to answer now if it is worth the time to save these values to local? Depending on the situation, I might only be able to save some of them to local too.

Originally posted by: notzed
3. If that's all you need, there isn't that much overhead to just load a work-group sized chunk of data to work on at each invocation, assuming you can coalesce the read or at least do it in one unconditional step.

The overhead is typically microseconds. But I want to do iterations more than 10^6 times, maybe up to 10^9. That would lead to huge extra time.

Originally posted by: notzed
Usual solution until you have lots of experience: try them all and see which works faster. Then keep trying alternatives until you get something fast enough or run out of time ...

Yes that is a very good idea. I have derived 4 models that parallelizes my problem differently. I dont know if there is time to try all of them out though as this project is soon over. This is why I'm writing this vague topic, in order to "fish" for ideas.

himanshu_gautam · ‎06-28-2011

Nice to hear your thoughts.

Although I think it should generally be better to use Global memory and run the kernels throughout the device, or other devices which might be available.

You would be able to scale your performance across multigpu systems this way.And if the data input is read only, you can look for caching also.

Also try to find some data which can be put in LDS.

notzed · ‎06-28-2011

Probably not much point adding up flops and whatnot since they are fixed by the maths. And what on earth is himanshu talking about, just advertising spam?

"The overhead is typically microseconds. But I want to do iterations more than 10^6 times, maybe up to 10^9. That would lead to huge extra time."

No, it's faster, you're talking about a single (per thread) or small memory read from the device, not across pci etc. And if it's unavoidable it's unavoidable anyway.

I mentioned this before but i'll ask again more directly since it isn't entirely clear: for a given iteration of the loop 1-9, is 'oldX' equal to the result generated in step 9 for the previous generation?

If yes then you have dependencies that make it much more difficult to achieve high levels of parallelism which starts to get beyond describing in a forum like this.

If there are such dependencies I would probably parallelise on DxE instead, and loop on N. Then use a small amount of local memory (the size of the workgroup) to cache the 'current index' value (or maybe the next), and just do everything else in registers. i.e. 'left' is just the result of the last iteration - already in a register.

Then all you have to decide is how big the local workgroup size is - it just has to be a multiple of D at least (so each workgroup can communicate the common D row inputs). The code wont need any real changes apart from some barriers for the local array use.

This should then scale fairly well: the total time of any thread is bounded by N (which although high isn't that high), very little local memory so many threads can run on a given device, and as D*E grows more devices can be utilitised concurrently (device == processing device, not card).

omgi · ‎07-07-2011

This reply is kind of late, but I have been trying some different models and came to the conclusion that the overhead is a very big concern, as I feared. Re-running the kernel is really not an option as it takes a lot of more time.

Originally posted by: notzed
"The overhead is typically microseconds. But I want to do iterations more than 10^6 times, maybe up to 10^9. That would lead to huge extra time."
No, it's faster, you're talking about a single (per thread) or small memory read from the device, not across pci etc. And if it's unavoidable it's unavoidable anyway.

Hm everyone I've talked to says that it is microseconds. Have I in some way misunderstood what you are talking about?

Originally posted by: notzed
I mentioned this before but i'll ask again more directly since it isn't entirely clear: for a given iteration of the loop 1-9, is 'oldX' equal to the result generated in step 9 for the previous generation?
If yes then you have dependencies that make it much more difficult to achieve high levels of parallelism which starts to get beyond describing in a forum like this.

If there are such dependencies I would probably parallelise on DxE instead, and loop on N. Then use a small amount of local memory (the size of the workgroup) to cache the 'current index' value (or maybe the next), and just do everything else in registers. i.e. 'left' is just the result of the last iteration - already in a register.

Yes there is such a dependency. I can not however parallelize over D, as at the highest dependency lies over D. Instead I have parallelized over N which also gives higher parallelization. The thing is that I would really like to be able to synchronize between workgroups withouth having to re-run the kernel. That would give me the highest possible performance and parallelization, as well as making the performance scale very well with N and D.

Jawed · ‎07-09-2011

One option is to make a single work item handle multiple elements. To do this break up a vector into "tiles".

The GPU has 256KB of memory per SIMD available as registers, but only 32KB per SIMD for local memory. Make sure you are maximising your use of registers by reducing the count of wavefronts per SIMD (and the workgroup size) and putting lots of elements into registers.

This way you will avoid memory access overheads. You will increase the number of FLOPs per read and write. This is crucial. If you get the balance right you can make it so that memory accesses are free even if they are accesses to global memory.

The tiles need to overlap because of the "neighbour-fetching" that you do.

You really should post your kernel to get concrete help.

omgi · ‎07-11-2011

Jawed: So if I use vectors instead, I could basically fetch various elements at once with only one "global memory read overhead"?

Jawed · ‎07-11-2011

Yes. Though the type of memory you use (buffer or image) might still have an effect on performance due to caching (or lack of caching).

If you were to use a vector (it could be a vector of 8 or 16, since OpenCL supports those sizes too), you must be careful about the overlap between tiles.

So if the width of a tile corresponds with a vector (let's say vector width 😎 then the tile can compute 7 results. The tile to the right of this tile then computes the next 7 results, and overlaps the first tile by 1 column.

The choice of memory access models is quite varied and complex. It will take a while for you to understand the choices. Then you will have to spend more time working out what gives the best performance...

You might find that you can compute four vector-16s per work item. You need to experiment.

Archives Discussions

More parallelization VS using local memory