AnsweredAssumed Answered

OpenCL amdgpu-pro generated code performance - please convert 'select' to cndmask

Question asked by mannerov on Aug 11, 2018
Latest reply on Sep 7, 2018 by dipak

Hi,

 

I don't know if this place is the best place to report opencl compiler performance issues, but well I didn't find a better place.

 

SUMMARY: Please AMD devs, when an OpenCL dev takes the time to explicitly use 'select', it should translate to cndmask and not to anything using an if.

 

LONG VERSION:

I have a code that computes a lot of things (typically from 200 to 2000) for every work item, and I need to keep the N best results.

N is lower or equal to 32. I need to store both the score (to keep track of the best results) and an unique identifier to identify the result associated to the score.

 

I don't need the scores to be sorted.

 

As N is small enough, I put the weights and identifiers into a __private table, thus storing them in registers, which means a maximum of 64 registers blocked for it, depending on N, but it is worth it as we access them extremely often. This also implies any loop accessing the table elements should be unrolled and indirect indexing is to be avoided.

 

I have two equivalent codes for the table insertion management.

 

In the first version, I retrieve the table maximum. Then I insert at the location of the maximum the new data (this maintains the property that the table containes the data of minimum score).

 

In the second version, which in theory should be faster, I keep a sorted table (thus the maximum is already known), and I handle the insertion by either shifting the elements or inserting the new element.

 

The code for both paths is attached to this post.

 

As can be seen on the first version of the code, the ideal generated code should have:
. Computation of the maximum
. set exec mask to whether the new score is above maximum or not.
. jump instruction if exec mask is empty
. Else for each register in best_score, one comparison followed by 3 v_cndmask instructions

 

This is indeed what the amdgpu-pro OpenCL compiler generates (checked using rcprof of CodeXL with drivers 17.50 and 18.20)

 

As can be seen on the second version of the code, the ideal generated code should have:

 

. set exec mask to whether the new score is above maximum (first element) or not.

. jump instruction if exec mask is empty
. Else for each register in best_score, one comparison followed by 2 v_cndmask instructions, plus an 'and' on the exec mask, and a jump instruction if exec mask is empty.

 

This is not what the amdgpu-pro OpenCL compiler generates (checked using rcprof of CodeXL with drivers 17.50 and 18.20)

 

Instead of the 2 v_cndmask instructions, the code generated is full of mov instructions.The movs correspond to what the code is equivalent to in terms of register movement for a given insertion index.

 

To be more clear, my understand of what the compiler does is the following:

 

 

#define maintain_best_scores(best_scores, \
                             best_identifiers, \
                             num_to_track, \
                             score, \
                             identifier) \
    /* Maintain ordered best scores. Note they are positive. */ \
    if (score < best_scores[0]) { \
        unroll_loop() \
        for (int mbs_i = 0; mbs_i < num_to_track; mbs_i++) { \
            if (mbs_i < (num_to_track - 1)) { \
                bool cond = (best_scores[mbs_i+1] <= score); \
                if (cond) { \
                    unroll_loop() \
                    for (int mbs_j = 0; mbs_j < mbs_i; mbs_j++) { \
                        best_scores[mbs_j] = best_scores[mbs_j+1]; \
                        best_identifiers[mbs_j] = best_identifiers[mbs_j+1]; \
                    } \
                    best_scores[mbs_i] = score;
                    best_identifiers[mbs_i] = identifier;
                    break;\
                } \
            } else { \
                unroll_loop() \
                for (int mbs_j = 0; mbs_j < mbs_i; mbs_j++) { \
                    best_scores[mbs_j] = best_scores[mbs_j+1]; \
                    best_identifiers[mbs_j] = best_identifiers[mbs_j+1]; \
                } \
                best_scores[mbs_i] = score; \
                best_identifiers[mbs_i] = identifier; \
            } \
        } \
    }


#endif

 

 

 

 

Which is obviously bad (2 instructions replaced by a maximum of 2N instructions).

In practice though, the code generated is even worse (the compiler movs the indexes of mbs_i and (mbs_i+1) into registers and does stuff with).

 

My guess is that the compiler has converted the select instructions into an if, and does some clever trick to get equivalent ifs that seemed less costly.

Maybe it would indeed be less costly if there was no thread divergence (but I doubt it because of all the additional instructions), but in my case there definitely is.

 

I guess the driver fix is not easy, but I should come down to 'do NOT convert select to an if'. If I wanted to allow the driver to do anything else than a cndmask, I wouldn't use select in my OpenCL code.

 

Please let me know if you have taken notice of this issue report and if it is going to be investigated.

 

If requested, I can send by mail a complete code running the attached code and having the issue. I have done so with intel devs to help fix issues on their side.

 

 

Yours.

Attachments

Outcomes