[Bug tree-optimization/88398] New: vectorization failure for a small loop to do byte comparison

classic Classic list List threaded Threaded
23 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] New: vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

            Bug ID: 88398
           Summary: vectorization failure for a small loop to do byte
                    comparison
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jiangning.liu at amperecomputing dot com
  Target Milestone: ---

For the small case below, GCC -O3 can't vectorize the small loop to do byte
comparison in func2.

void *malloc(long unsigned int);
typedef struct {
        unsigned char *buffer;
} data;

static unsigned char *func1(data *d)
{
        return d->buffer;
}

static int func2(int max, int pos, unsigned char *cur)
{
        unsigned char *p = cur + pos;
        int len = 0;
        while (++len != max)
                if (p[len] != cur[len])
                        break;
        return cur[len];
}

int main (int argc) {
        data d;
        d.buffer = malloc(2*argc);
        return func2(argc, argc, func1(&d));
}

At the moment, the following code is generated for this loop,

  4004d4:       38616862        ldrb    w2, [x3,x1]
  4004d8:       6b00005f        cmp     w2, w0
  4004dc:       540000a1        b.ne    4004f0 <main+0x50>
  4004e0:       38616880        ldrb    w0, [x4,x1]
  4004e4:       6b01027f        cmp     w19, w1
  4004e8:       91000421        add     x1, x1, #0x1
  4004ec:       54ffff41        b.ne    4004d4 <main+0x34>

In fact, this loop can be vectorized by checking if the comparison size is
aligned to SIMD register length. It may introduce run time overhead, but cost
model could make decision on doing it or not.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
                 CC|                            |pinskia at gcc dot gnu.org
           Severity|normal                      |enhancement

--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
Maybe it should be converting it to memcmp instead ....
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #2 from Jiangning Liu <jiangning.liu at amperecomputing dot com> ---
memcmp doesn't return the position where they differ.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The loop in itself, without knowing the buffer comes from malloc and has
certain size (and at that point it would figure probably out the testsuite is
UB) can't be vectorized.  You have two different pointers, both potentially
with different alignment and no assurance that bytes from 0 to max - 1 are
actually accessible after the one that differs.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #4 from Jiangning Liu <jiangning.liu at amperecomputing dot com> ---
I expect "gcc -O3 -flto" could work.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org
             Blocks|                            |53947

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Well, the main issue is that the loop has two exits which is something the
vectorizer doesn't support (with a hard check).  Jakub explains why that is so.

You could change the loop to sth like

  unsigned char *p = cur + pos;
  int len = 0;
  int diff_pos = __INT_MAX__;
  while (++len != max)
    if (p[len] != cur[len])
      diff_pos = len < diff_pos ? len : diff_pos;
  return cur[diff_pos];

which then runs into a limitation recently reported (two dependent
reductions not recognized).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

ktkachov at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2018-12-07
                 CC|                            |ktkachov at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #6 from ktkachov at gcc dot gnu.org ---
(In reply to Richard Biener from comment #5)

> Well, the main issue is that the loop has two exits which is something the
> vectorizer doesn't support (with a hard check).  Jakub explains why that is
> so.
>
> You could change the loop to sth like
>
>   unsigned char *p = cur + pos;
>   int len = 0;
>   int diff_pos = __INT_MAX__;
>   while (++len != max)
>     if (p[len] != cur[len])
>       diff_pos = len < diff_pos ? len : diff_pos;
>   return cur[diff_pos];
>
> which then runs into a limitation recently reported (two dependent
> reductions not recognized).

Could GCC generate this version and select it at runtime if max is sufficiently
large?
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
No, because that has different behavior.  In particular, it reads all bytes
from 0 to max - 1 from both arrays, rather than stopping at the first one.
If both pointers are aligned the same modulo simd size, then it is in theory
vectorizable just by loading both simd words, comparing all the individual
bytes,
if all of them compare equal, continue looping, otherwise e.g. in a scalar loop
determine the exact one.  Though, if they are different, it is harder, one simd
word would need to be read in the previous iteration of the loop and not read
in the current one until you prove that there isn't any difference.

Essentially, you are looking for reimplementation of memcmp e.g. from glibc
(like the C version in
https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=string/memcmp.c;hb=HEAD
) just with the minor tweak that rather than returning the difference of the
bytes you return the length of the common prefix of both strings and you don't
have the guarantee memcmp has (that all bytes between 0 and max-1 are
accessible).

E.g. glibc x86_64 memcmp unrolled loop looks like:
        movdqu    (%rdi,%rsi), %xmm0
        pcmpeqb   (%rdi), %xmm0
        pmovmskb  %xmm0, %edx
        subl      $0xffff, %edx
        jnz       L(neq)
        addq       $16, %rdi
repeated a couple of times, but that can be done this way only if both pointers
% simd_size are the same (you can even use movdqa after scalar loop for
alignment).  If they aren't, you need to compare separately one set of bytes
first and only if they are all equal you can read the next (aligned) word and
in both cases you need to do the whole vector shifts.

All in all, I'd say you want to have a function for this and implement it
efficiently, rather than hoping the compiler will pattern match inefficient
code and turn it into the smart thing for you, it is way too specialized thing.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #8 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jiangning Liu from comment #0)

> For the small case below, GCC -O3 can't vectorize the small loop to do byte
> comparison in func2.
>
> void *malloc(long unsigned int);
> typedef struct {
>         unsigned char *buffer;
> } data;
>
> static unsigned char *func1(data *d)
> {
>         return d->buffer;
> }
>
> static int func2(int max, int pos, unsigned char *cur)
> {
>         unsigned char *p = cur + pos;
>         int len = 0;
>         while (++len != max)
>                 if (p[len] != cur[len])
>                         break;
>         return cur[len];
> }
>
> int main (int argc) {
>         data d;
>         d.buffer = malloc(2*argc);
>         return func2(argc, argc, func1(&d));
> }
>
> At the moment, the following code is generated for this loop,
>
>   4004d4:       38616862        ldrb    w2, [x3,x1]
>   4004d8:       6b00005f        cmp     w2, w0
>   4004dc:       540000a1        b.ne    4004f0 <main+0x50>
>   4004e0:       38616880        ldrb    w0, [x4,x1]
>   4004e4:       6b01027f        cmp     w19, w1
>   4004e8:       91000421        add     x1, x1, #0x1
>   4004ec:       54ffff41        b.ne    4004d4 <main+0x34>
>
> In fact, this loop can be vectorized by checking if the comparison size is
> aligned to SIMD register length. It may introduce run time overhead, but
> cost model could make decision on doing it or not.

The only optimization that can be done here is unrolling. For this kind of
string matching the number of bytes that match is will be small on average, so
even if vectorization was feasible, the startup overhead alone would kill
performance. With unrolling you can remove the comparison with max each
iteration and do 4 bytes per iteration like this:

loop4:
ldrb    w2, [x3,4]!
ldrb    w0, [x4,4]!
cmp     w0, w2
bne     exitloop1
ldrb    w2, [x3,1]
ldrb    w0, [x4,1]
cmp     w0, w2
bne     exitloop2
ldrb    w2, [x3,2]
ldrb    w0, [x4,2]
cmp     w0, w2
bne     exitloop3
ldrb    w2, [x3,3]
ldrb    w0, [x4,3]
cmp     w0, w2
bne     exitloop4
add     x1, x1, 4
cmp     x1, w19
blt     loop4
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #9 from ktkachov at gcc dot gnu.org ---
For context, this is the hot loop in 557.xz_r from SPEC2017
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
If the compiler knew say from PGO that pos is usually a multiple of certain
power of two and that the loop usually iterates many times (I guess the latter
can be determined from comparing the bb count of the loop itself and its
header), it could emit something like:
static int func2(int max, int pos, unsigned char *cur)
{
  unsigned char *p = cur + pos;
  int len = 0;
  if (max > 32 && (pos & 7) == 0)
    {
      int l = ((1 - ((uintptr_t) cur)) & 7) + 1;
      while (++len != l)
        if (p[len] != cur[len])
          goto end;
      unsigned long long __attribute__((may_alias)) *p2 = (unsigned long long
*) &p[len];
      unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long long
*) &cur[len];
      while (len + 8 < max)
        {
          if (*p2++ != *cur2++)
            break;
          len += 8;
        }
      --len;
    }
  while (++len != max)
    if (p[len] != cur[len])
      break;
end:
  return cur[len];
}

or so (untested).  Of course, it could be done using SIMD too if there is a way
to terminate the loop if any of the elts is different and could be done in that
case at 16 or 32 or 64 characters at a time etc.
But, without knowing that pos is typically some power of two this would just
waste code size, dealing with the unaligned cases would be more complicated
(one can't read the next elt until proving that the current one is all equal),
so it would need to involve some rotations (or permutes for SIMD).
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #11 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #10)

> If the compiler knew say from PGO that pos is usually a multiple of certain
> power of two and that the loop usually iterates many times (I guess the
> latter can be determined from comparing the bb count of the loop itself and
> its header), it could emit something like:
> static int func2(int max, int pos, unsigned char *cur)
> {
>   unsigned char *p = cur + pos;
>   int len = 0;
>   if (max > 32 && (pos & 7) == 0)
>     {
>       int l = ((1 - ((uintptr_t) cur)) & 7) + 1;
>       while (++len != l)
>         if (p[len] != cur[len])
>           goto end;
>       unsigned long long __attribute__((may_alias)) *p2 = (unsigned long
> long *) &p[len];
>       unsigned long long __attribute__((may_alias)) *cur2 = (unsigned long
> long *) &cur[len];
>       while (len + 8 < max)
>         {
>           if (*p2++ != *cur2++)
>             break;
>           len += 8;
>         }
>       --len;
>     }
>   while (++len != max)
>     if (p[len] != cur[len])
>       break;
> end:
>   return cur[len];
> }
>
> or so (untested).  Of course, it could be done using SIMD too if there is a
> way to terminate the loop if any of the elts is different and could be done
> in that case at 16 or 32 or 64 characters at a time etc.
> But, without knowing that pos is typically some power of two this would just
> waste code size, dealing with the unaligned cases would be more complicated
> (one can't read the next elt until proving that the current one is all
> equal), so it would need to involve some rotations (or permutes for SIMD).

Given it is compressing data both pointers will typically be misaligned. Pos
would be fairly random and len does not usually start from zero either. And
it's highly unlikely it will iterate more than a few bytes. Compression matches
are typically just a few bytes long. So unrolling is the only useful
optimization for this kind of code.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
I think the only chance would be for FDO to flag the loop as hot plus assign
it a reasonably low expected niter so GCC could consider "fully" peeling it.
But then whether peeling it that situation is profitable depends highly on
the target (that is, the number of peeled copies that might be profitable).
Considering the various forms of pre-decoded yop caches we have in modern
CPUs peeling might do more harm than good (you only get away with the
mispredicted exit jump, plus you add mispredictions of the peeled exits
in case niter fluctuates a lot).
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #13 from Wilco <wilco at gcc dot gnu.org> ---
So to add some real numbers to the discussion, the average number of iterations
is 4.31. Frequency stats (16 includes all iterations > 16 too):

1: 29.0
2: 4.2
3: 1.0
4: 36.7
5: 8.7
6: 3.4
7: 3.0
8: 2.6
9: 2.1
10: 1.9
11: 1.6
12: 1.2
13: 0.9
14: 0.8
15: 0.7
16: 2.1

So unrolling 4x is perfect for this loop. Note the official xz version has
optimized this loop since 2014(!) using unaligned accesses:
https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #14 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 7 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
>
> --- Comment #13 from Wilco <wilco at gcc dot gnu.org> ---
> So to add some real numbers to the discussion, the average number of iterations
> is 4.31. Frequency stats (16 includes all iterations > 16 too):
>
> 1: 29.0
> 2: 4.2
> 3: 1.0
> 4: 36.7
> 5: 8.7
> 6: 3.4
> 7: 3.0
> 8: 2.6
> 9: 2.1
> 10: 1.9
> 11: 1.6
> 12: 1.2
> 13: 0.9
> 14: 0.8
> 15: 0.7
> 16: 2.1
>
> So unrolling 4x is perfect for this loop. Note the official xz version has
> optimized this loop since 2014(!) using unaligned accesses:
> https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h

I guess if we'd have some data to guide then classical unrolling using
duffs device would be best here?  Because peeling will increase the
number of dynamic branches and likely the actual distribution of
#iterations isn't so that they will be well predicted?
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #15 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to [hidden email] from comment #14)

> On Mon, 7 Jan 2019, wilco at gcc dot gnu.org wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
> >
> > --- Comment #13 from Wilco <wilco at gcc dot gnu.org> ---
> > So to add some real numbers to the discussion, the average number of iterations
> > is 4.31. Frequency stats (16 includes all iterations > 16 too):
> >
> > 1: 29.0
> > 2: 4.2
> > 3: 1.0
> > 4: 36.7
> > 5: 8.7
> > 6: 3.4
> > 7: 3.0
> > 8: 2.6
> > 9: 2.1
> > 10: 1.9
> > 11: 1.6
> > 12: 1.2
> > 13: 0.9
> > 14: 0.8
> > 15: 0.7
> > 16: 2.1
> >
> > So unrolling 4x is perfect for this loop. Note the official xz version has
> > optimized this loop since 2014(!) using unaligned accesses:
> > https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h
>
> I guess if we'd have some data to guide then classical unrolling using
> duffs device would be best here?  Because peeling will increase the
> number of dynamic branches and likely the actual distribution of
> #iterations isn't so that they will be well predicted?

Duff's device is very slow on any CPU with branch prediction. It can't be used
here given you don't know the number of iterations in advance (only the
maximum).

Unrolling reduces the number of branches given the loop branch is only taken
once every N iterations. The loop overhead is significant here: 3 out of 7
instructions, with 4x unrolling that reduces to 3 in 19.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #16 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 8 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
>
> --- Comment #15 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to [hidden email] from comment #14)
> > On Mon, 7 Jan 2019, wilco at gcc dot gnu.org wrote:
> >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398
> > >
> > > --- Comment #13 from Wilco <wilco at gcc dot gnu.org> ---
> > > So to add some real numbers to the discussion, the average number of iterations
> > > is 4.31. Frequency stats (16 includes all iterations > 16 too):
> > >
> > > 1: 29.0
> > > 2: 4.2
> > > 3: 1.0
> > > 4: 36.7
> > > 5: 8.7
> > > 6: 3.4
> > > 7: 3.0
> > > 8: 2.6
> > > 9: 2.1
> > > 10: 1.9
> > > 11: 1.6
> > > 12: 1.2
> > > 13: 0.9
> > > 14: 0.8
> > > 15: 0.7
> > > 16: 2.1
> > >
> > > So unrolling 4x is perfect for this loop. Note the official xz version has
> > > optimized this loop since 2014(!) using unaligned accesses:
> > > https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/common/memcmplen.h
> >
> > I guess if we'd have some data to guide then classical unrolling using
> > duffs device would be best here?  Because peeling will increase the
> > number of dynamic branches and likely the actual distribution of
> > #iterations isn't so that they will be well predicted?
>
> Duff's device is very slow on any CPU with branch prediction. It can't be used
> here given you don't know the number of iterations in advance (only the
> maximum).
>
> Unrolling reduces the number of branches given the loop branch is only taken
> once every N iterations. The loop overhead is significant here: 3 out of 7
> instructions, with 4x unrolling that reduces to 3 in 19.

But you can't elide the checks in the peeled copies and for 4-times
unrolling you have most cases exiting on the first or fourth check.

Duffs device simply merges the prologue iterations for unrolling
with the loop body so I don't see why it can't be used.  It's

  switch (n % 4)
   {
    loop:
       iter
       n--;
    case 3:
       iter
       n--;
    case 2:
       iter
       n--
    case 1:
       iter
       n--;
    case 0:
       if (n != 0)
         goto loop;
   }

it's cost is mainly the computed jump into the loop body.  Then
you have a four-fold reduction in branches without the overhead
of having another three loop body copies in the prologue with
retained early exit checks.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #17 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to [hidden email] from comment #16)

> But you can't elide the checks in the peeled copies and for 4-times
> unrolling you have most cases exiting on the first or fourth check.

See comment #8 for an example how it should be unrolled (it needs a simple
check at entry and a trailing loop as well of course).

> Duffs device simply merges the prologue iterations for unrolling
> with the loop body so I don't see why it can't be used.  It's
>
>   switch (n % 4)
>    {
>     loop:
>        iter
>        n--;
>     case 3:
>        iter
>        n--;
>     case 2:
>        iter
>        n--
>     case 1:
>        iter
>        n--;
>     case 0:
>        if (n != 0)
>          goto loop;
>    }
>
> it's cost is mainly the computed jump into the loop body.  Then
> you have a four-fold reduction in branches without the overhead
> of having another three loop body copies in the prologue with
> retained early exit checks.

Duff's device is a bad idea given it adds extra checks and dependencies that
aren't necessary if you unroll properly. There is never a need to merge the
trailing loop into the unrolled copy, and neither should we peel off 3
iterations for no gain.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The duffs device doesn't need to be done with computed jump, it can be done
with 3 conditional branches + 3 comparisons too.  The advantage of doing that
is especially if the iter isn't really very small, by doing it that way you
don't need those 4 unrolled iterations + one scalar loop.
if (n & 2)
  {
    if (n & 1) { n++; goto loop3; }
    { n += 2; goto loop2; }
  }
else if (n & 1)
  { n += 3; goto loop1; }
else if (n == 0)
  goto end;
do
  {
    iter;
  loop3:
    iter;
  loop2:
    iter;
  loop1:
    iter;
    n -= 4;
  }
while (n != 0);
end:;

Of course, if iter is very short, it might be easier/more efficient to
duplicate iter more times than 4 and do something else.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/88398] vectorization failure for a small loop to do byte comparison

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88398

--- Comment #19 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #18)
> The duffs device doesn't need to be done with computed jump, it can be done
> with 3 conditional branches + 3 comparisons too.  The advantage of doing
> that is especially if the iter isn't really very small, by doing it that way
> you don't need those 4 unrolled iterations + one scalar loop.

While that is better than an indirect branch, it still branches into the loop,
so you don't benefit from optimization across the loop. So using a trailing
loop is better in most cases.

> Of course, if iter is very short, it might be easier/more efficient to
> duplicate iter more times than 4 and do something else.

If iter is a small constant then peeling makes sense as you can completely
remove the loop counter and branches. Eg. for N=3 you end up with something
like this:

if (n >= 2)
  iter1
  iter2
  n -= 2
if (n != 0)
  iter3
12