[Bug tree-optimization/88398] New: vectorization failure for a small loop to do byte comparison

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] New: vectorization failure for a small loop to do byte comparison

kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

            Bug ID: 88398
           Summary: vectorization failure for a small loop to do byte
                    comparison
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For the small case below, GCC -O3 can't vectorize the small loop to do byte
comparison in func2.

void *malloc(long unsigned int);
typedef struct {
        unsigned char *buffer;
} data;

static unsigned char *func1(data *d)
{
        return d->buffer;
}

static int func2(int max, int pos, unsigned char *cur)
{
        unsigned char *p = cur + pos;
        int len = 0;
        while (++len != max)
                if (p[len] != cur[len])
                        break;
        return cur[len];
}

int main (int argc) {
        data d;
        d.buffer = malloc(2*argc);
        return func2(argc, argc, func1(&d));
}

At the moment, the following code is generated for this loop,

  4004d4:       38616862        ldrb    w2, [x3,x1]
  4004d8:       6b00005f        cmp     w2, w0
  4004dc:       540000a1        b.ne    4004f0 <main+0x50>
  4004e0:       38616880        ldrb    w0, [x4,x1]
  4004e4:       6b01027f        cmp     w19, w1
  4004e8:       91000421        add     x1, x1, #0x1
  4004ec:       54ffff41        b.ne    4004d4 <main+0x34>

In fact, this loop can be vectorized by checking if the comparison size is
aligned to SIMD register length. It may introduce run time overhead, but cost
model could make decision on doing it or not.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
                 CC|                            |pinskia at gcc dot gnu.org
           Severity|normal                      |enhancement

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Maybe it should be converting it to memcmp instead ....
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

kargl at gcc dot gnu.org
In reply to this post by kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #2 from Jiangning Liu <jiangning.liu at amperecomputing dot com> ---
memcmp doesn't return the position where they differ.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

kargl at gcc dot gnu.org
In reply to this post by kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The loop in itself, without knowing the buffer comes from malloc and has
certain size (and at that point it would figure probably out the testsuite is
UB) can't be vectorized.  You have two different pointers, both potentially
with different alignment and no assurance that bytes from 0 to max - 1 are
actually accessible after the one that differs.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

kargl at gcc dot gnu.org
In reply to this post by kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #4 from Jiangning Liu <jiangning.liu at amperecomputing dot com> ---
I expect "gcc -O3 -flto" could work.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

kargl at gcc dot gnu.org
In reply to this post by kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
             Blocks|                            |53947

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Well, the main issue is that the loop has two exits which is something the
vectorizer doesn't support (with a hard check).  Jakub explains why that is so.

You could change the loop to sth like

  unsigned char *p = cur + pos;
  int len = 0;
  int diff_pos = __INT_MAX__;
  while (++len != max)
    if (p[len] != cur[len])
      diff_pos = len < diff_pos ? len : diff_pos;
  return cur[diff_pos];

which then runs into a limitation recently reported (two dependent
reductions not recognized).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

kargl at gcc dot gnu.org
In reply to this post by kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2018-12-07
                 CC|                            |ktkachov at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #6 from ktkachov at gcc dot gnu.org ---
(In reply to Richard Biener from comment #5)

> Well, the main issue is that the loop has two exits which is something the
> vectorizer doesn't support (with a hard check).  Jakub explains why that is
> so.
>
> You could change the loop to sth like
>
>   unsigned char *p = cur + pos;
>   int len = 0;
>   int diff_pos = __INT_MAX__;
>   while (++len != max)
>     if (p[len] != cur[len])
>       diff_pos = len < diff_pos ? len : diff_pos;
>   return cur[diff_pos];
>
> which then runs into a limitation recently reported (two dependent
> reductions not recognized).

Could GCC generate this version and select it at runtime if max is sufficiently
large?
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

kargl at gcc dot gnu.org
In reply to this post by kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
No, because that has different behavior.  In particular, it reads all bytes
from 0 to max - 1 from both arrays, rather than stopping at the first one.
If both pointers are aligned the same modulo simd size, then it is in theory
vectorizable just by loading both simd words, comparing all the individual
bytes,
if all of them compare equal, continue looping, otherwise e.g. in a scalar loop
determine the exact one.  Though, if they are different, it is harder, one simd
word would need to be read in the previous iteration of the loop and not read
in the current one until you prove that there isn't any difference.

Essentially, you are looking for reimplementation of memcmp e.g. from glibc
(like the C version in
https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=string/memcmp.c;hb=HEAD
) just with the minor tweak that rather than returning the difference of the
bytes you return the length of the common prefix of both strings and you don't
have the guarantee memcmp has (that all bytes between 0 and max-1 are
accessible).

E.g. glibc x86_64 memcmp unrolled loop looks like:
        movdqu    (%rdi,%rsi), %xmm0
        pcmpeqb   (%rdi), %xmm0
        pmovmskb  %xmm0, %edx
        subl      $0xffff, %edx
        jnz       L(neq)
        addq       $16, %rdi
repeated a couple of times, but that can be done this way only if both pointers
% simd_size are the same (you can even use movdqa after scalar loop for
alignment).  If they aren't, you need to compare separately one set of bytes
first and only if they are all equal you can read the next (aligned) word and
in both cases you need to do the whole vector shifts.

All in all, I'd say you want to have a function for this and implement it
efficiently, rather than hoping the compiler will pattern match inefficient
code and turn it into the smart thing for you, it is way too specialized thing.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

kargl at gcc dot gnu.org
In reply to this post by kargl at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #8 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jiangning Liu from comment #0)

> For the small case below, GCC -O3 can't vectorize the small loop to do byte
> comparison in func2.
>
> void *malloc(long unsigned int);
> typedef struct {
>         unsigned char *buffer;
> } data;
>
> static unsigned char *func1(data *d)
> {
>         return d->buffer;
> }
>
> static int func2(int max, int pos, unsigned char *cur)
> {
>         unsigned char *p = cur + pos;
>         int len = 0;
>         while (++len != max)
>                 if (p[len] != cur[len])
>                         break;
>         return cur[len];
> }
>
> int main (int argc) {
>         data d;
>         d.buffer = malloc(2*argc);
>         return func2(argc, argc, func1(&d));
> }
>
> At the moment, the following code is generated for this loop,
>
>   4004d4:       38616862        ldrb    w2, [x3,x1]
>   4004d8:       6b00005f        cmp     w2, w0
>   4004dc:       540000a1        b.ne    4004f0 <main+0x50>
>   4004e0:       38616880        ldrb    w0, [x4,x1]
>   4004e4:       6b01027f        cmp     w19, w1
>   4004e8:       91000421        add     x1, x1, #0x1
>   4004ec:       54ffff41        b.ne    4004d4 <main+0x34>
>
> In fact, this loop can be vectorized by checking if the comparison size is
> aligned to SIMD register length. It may introduce run time overhead, but
> cost model could make decision on doing it or not.

The only optimization that can be done here is unrolling. For this kind of
string matching the number of bytes that match is will be small on average, so
even if vectorization was feasible, the startup overhead alone would kill
performance. With unrolling you can remove the comparison with max each
iteration and do 4 bytes per iteration like this:

loop4:
ldrb    w2, [x3,4]!
ldrb    w0, [x4,4]!
cmp     w0, w2
bne     exitloop1
ldrb    w2, [x3,1]
ldrb    w0, [x4,1]
cmp     w0, w2
bne     exitloop2
ldrb    w2, [x3,2]
ldrb    w0, [x4,2]
cmp     w0, w2
bne     exitloop3
ldrb    w2, [x3,3]
ldrb    w0, [x4,3]
cmp     w0, w2
bne     exitloop4
add     x1, x1, 4
cmp     x1, w19
blt     loop4