cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

mannerov
Adept I

OpenCL amdgpu-pro generated code performance - please convert 'select' to cndmask

Hi,

I don't know if this place is the best place to report opencl compiler performance issues, but well I didn't find a better place.

SUMMARY: Please AMD devs, when an OpenCL dev takes the time to explicitly use 'select', it should translate to cndmask and not to anything using an if.

LONG VERSION:

I have a code that computes a lot of things (typically from 200 to 2000) for every work item, and I need to keep the N best results.

N is lower or equal to 32. I need to store both the score (to keep track of the best results) and an unique identifier to identify the result associated to the score.

I don't need the scores to be sorted.

As N is small enough, I put the weights and identifiers into a __private table, thus storing them in registers, which means a maximum of 64 registers blocked for it, depending on N, but it is worth it as we access them extremely often. This also implies any loop accessing the table elements should be unrolled and indirect indexing is to be avoided.

I have two equivalent codes for the table insertion management.

In the first version, I retrieve the table maximum. Then I insert at the location of the maximum the new data (this maintains the property that the table containes the data of minimum score).

In the second version, which in theory should be faster, I keep a sorted table (thus the maximum is already known), and I handle the insertion by either shifting the elements or inserting the new element.

The code for both paths is attached to this post.

As can be seen on the first version of the code, the ideal generated code should have:
. Computation of the maximum
. set exec mask to whether the new score is above maximum or not.
. jump instruction if exec mask is empty
. Else for each register in best_score, one comparison followed by 3 v_cndmask instructions

This is indeed what the amdgpu-pro OpenCL compiler generates (checked using rcprof of CodeXL with drivers 17.50 and 18.20)

As can be seen on the second version of the code, the ideal generated code should have:

. set exec mask to whether the new score is above maximum (first element) or not.

. jump instruction if exec mask is empty
. Else for each register in best_score, one comparison followed by 2 v_cndmask instructions, plus an 'and' on the exec mask, and a jump instruction if exec mask is empty.

This is not what the amdgpu-pro OpenCL compiler generates (checked using rcprof of CodeXL with drivers 17.50 and 18.20)

Instead of the 2 v_cndmask instructions, the code generated is full of mov instructions.The movs correspond to what the code is equivalent to in terms of register movement for a given insertion index.

To be more clear, my understand of what the compiler does is the following:

#define maintain_best_scores(best_scores, \

                             best_identifiers, \

                             num_to_track, \

                             score, \

                             identifier) \

    /* Maintain ordered best scores. Note they are positive. */ \

    if (score < best_scores[0]) { \

        unroll_loop() \

        for (int mbs_i = 0; mbs_i < num_to_track; mbs_i++) { \

            if (mbs_i < (num_to_track - 1)) { \

                bool cond = (best_scores[mbs_i+1] <= score); \

                if (cond) { \

                    unroll_loop() \

                    for (int mbs_j = 0; mbs_j < mbs_i; mbs_j++) { \

                        best_scores[mbs_j] = best_scores[mbs_j+1]; \

                        best_identifiers[mbs_j] = best_identifiers[mbs_j+1]; \

                    } \

                    best_scores[mbs_i] = score;

                    best_identifiers[mbs_i] = identifier;

                    break;\

                } \

            } else { \

                unroll_loop() \

                for (int mbs_j = 0; mbs_j < mbs_i; mbs_j++) { \

                    best_scores[mbs_j] = best_scores[mbs_j+1]; \

                    best_identifiers[mbs_j] = best_identifiers[mbs_j+1]; \

                } \

                best_scores[mbs_i] = score; \

                best_identifiers[mbs_i] = identifier; \

            } \

        } \

    }

#endif

Which is obviously bad (2 instructions replaced by a maximum of 2N instructions).

In practice though, the code generated is even worse (the compiler movs the indexes of mbs_i and (mbs_i+1) into registers and does stuff with).

My guess is that the compiler has converted the select instructions into an if, and does some clever trick to get equivalent ifs that seemed less costly.

Maybe it would indeed be less costly if there was no thread divergence (but I doubt it because of all the additional instructions), but in my case there definitely is.

I guess the driver fix is not easy, but I should come down to 'do NOT convert select to an if'. If I wanted to allow the driver to do anything else than a cndmask, I wouldn't use select in my OpenCL code.

Please let me know if you have taken notice of this issue report and if it is going to be investigated.

If requested, I can send by mail a complete code running the attached code and having the issue. I have done so with intel devs to help fix issues on their side.

Yours.

0 Likes
8 Replies
dipak
Big Boss

Thank you for reporting this and describing the problem in details. I've reported it to our OpenCL compiler team. Once I get any feedback, I'll get back to you.

0 Likes
dipak
Big Boss

Hi,

The compiler team has confirmed that, in general, cndmask is produced from select. However, in a complex case like this with loops etc. the resulting code is not simply governed by a couple of selects but by a lot of different considerations.

Please provide a complete test-case to investigate in details.

Thanks.

0 Likes

Hi,

Attached here is a demo that runs a simple code displaying the issue.

The same kernel is compiled with two versions of the code to maintain the best N values. One version keeps an ordered table, which the other keeps an unordered table in memory.

The former has the issue raised in this thread, which the latter doesn't. I have attached as well the isa generated for both variants (generated with AMD's RCP). The demo displays the time taken, and we see the version with ordered table is much slower.

On Nvidia or Intel OpenCL, both versions take about the same time.

0 Likes

Thanks for providing the repro.

While running the demo on Windows, I also observed similar findings as you described. Ordered version was much slower than unordered version.

After doing some experiment with the kernel code, I found an interesting point. When I modified the ordered version of maintain_best_scores function as shown below, I got similar timing for both those versions. Also, now I see a number of cndmask instructions in the resulted ISA code.

Modification:

Instead of  "cond" variable, use the condition itself in the "select" function. For example,

Replace below code:

bool cond = (best_scores[mbs_i+1] <= score);

best_scores[mbs_i] = select(best_scores[mbs_i+1], score, cond);

By:

best_scores[mbs_i] = select(best_scores[mbs_i+1], score, (best_scores[mbs_i+1] <= score));

I attached the modified kernel code. Can you please run it and share your findings?

Another point is, in the unordered version of maintain_best_scores, I see two loops - 1st loop to find the maximum, 2nd loop to match the same element and replace it by new value. I guess, 2nd one can be avoided if index of that max. element is stored with the value. What do you think?

Thanks.

0 Likes

Your code indeed executes faster. I found that the core change was replacing the line:

if (cond) break;

with

if ((best_scores[mbs_i+1] <= score)) break;

I also notice that replacing cond with its definition inside the selects affect very slightly the performance (maybe the compiler recomputes the test ?).

Something weird is going on with this code and the shader compiler.

For the unordered version, your suggestion would cause the __private table be stored into global memory instead of registers, because of the indirect adressing.

Since the table is accessed a lot, it is better for performance to have it stored in registers.

Yours

0 Likes

Yes, that "if" statement seems the problematic one. Actually, I just wanted see the result without the "cond"  variable, so I simply replaced all the dependent statements. I didn't check each line individually.

In general, register usage is very high for this kernel. As per CodeXL, it uses ~100 VGPRs, hence occupancy is very low - only 2 wavefronts / SIMD or 20%. Any improvement in this area will greatly boost the performance.

Another point that I observed during the testing is that appropriate setting of work group size significantly improves the performance compared to default one. For example, I observed best performance (for both versions of logic)  with work group size (8x8) i.e. 1 wavefront per work group. I would suggest you to do some experiment with the work group size.

Thanks.

0 Likes

The demo code I included in this thread is purely a demo to run the table code.

My table code is used inside more complex kernels, and I have selected for them the best work group size and other memory pattern accesses in order to maximize performance. Using ~100 VGPRS is ok if the memory accesses are few and their latency well hidden, my kernels are not limited by memory accesses according to the perf counters.

I have found the amd opencl compiler to be rather reliable, except on a few things (sub_group_reduce_add and work_group_reduce_add seem to generate poor isa compared to the examples described on gpuopen, but that's for another bug report). This bug report is for me the biggest 'bad compilation' case I hit.


My code is intended to be released open source when ready, and I would like to avoid special paths to workaround compilator issues.

0 Likes

I just shared my observations based on the demo only. It's good to know that you have already considered those factors in your application.

0 Likes