Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

Thomas Schwinge-8
Hi!

I'm for the first time looking into the existing vectorization
functionality in GCC (yay!), and with that I'm also for the first time
encountering GCC's scalar evolution (scev) machinery (yay!), and the
chains of recurrences (chrec) used by that (yay!).

Obviously, I'm right now doing my own reading and experimenting, but
maybe somebody can cut that short, if my current question doesn't make
much sense, and is thus easily answered:

    int a[NJ][NI];

    #pragma acc loop collapse(2)
    for (int j = 0; j < N_J; ++j)
      for (int i = 0; i < N_I; ++i)
        a[j][i] = 0;

Without "-fopenacc" (thus the pragma ignored), this does vectorize (for
the x86_64 target, for example, without OpenACC code offloading), and
also does it vectorize with "-fopenacc" enabled but the "collapse(2)"
clause removed and instead another "#pragma acc loop" added in front of
the inner "i" loop.  But with the "collapse(2)" clause in effect, these
two nested loops get, well, "collapse"d by omp-expand into one:

    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
      {
        int j = tmp / N_I;
        int i = tmp % N_I;
        a[j][i] = 0;
      }

This does not vectorize because of scalar evolution running into
unhandled (chrec_dont_know) TRUNC_DIV_EXPR and TRUNC_MOD_EXPR in
gcc/tree-scalar-evolution.c:interpret_rhs_expression.  Do I have a chance
in teaching it to handle these, without big effort?


If that's not reasonable, I shall look for other options to address the
problem that currently vectorization gets pessimized by "-fopenacc" and
in particular the code rewriting for the "collapse" clause.


By the way, the problem can, similarly, also be displayed in an OpenMP
example, where also when such a "collapse" clause is present, the inner
loop's code no longer vectorizes.  (But I've not considered that case in
any more detail; Jakub CCed in case that's something to look into?  I
don't know how OpenMP threads' loop iterations are meant to interact with
OpenMP SIMD, basically.)


Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
also present when running the following code through the vectorizer:

    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
      {
        int j = tmp / N_I;
        int i = tmp % N_I;
        a[j][i] = 0;
      }

... whereas the following variant (obviously) does vectorize:

    int a[NJ * NI];

    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
      a[tmp] = 0;

Hmm.  Linearization.  From a quick search, I found some 2010 work by
Sebastian Pop on that topic, in the Graphite context
(gcc/graphite-flattening.c), but that got pulled out again in 2012.
(I have not yet looked up the history, and have not yet looked whether
that'd be relevant here at all -- and we're not using Graphite here.)

Regarding that, am I missing something obvious?


Grüße
 Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

Marc Glisse-6
On Fri, 12 Oct 2018, Thomas Schwinge wrote:

> Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
> also present when running the following code through the vectorizer:
>
>    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
>      {
>        int j = tmp / N_I;
>        int i = tmp % N_I;
>        a[j][i] = 0;
>      }
>
> ... whereas the following variant (obviously) does vectorize:
>
>    int a[NJ * NI];
>
>    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
>      a[tmp] = 0;

I had a quick look at the difference, and a[j][i] remains in this form
throughout optimization. If I write instead *((*(a+j))+i) = 0; I get

   j_10 = tmp_17 / 1025;
   i_11 = tmp_17 % 1025;
   _1 = (long unsigned int) j_10;
   _2 = _1 * 1025;
   _3 = (sizetype) i_11;
   _4 = _2 + _3;

or for a power of 2

   j_10 = tmp_17 >> 10;
   i_11 = tmp_17 & 1023;
   _1 = (long unsigned int) j_10;
   _2 = _1 * 1024;
   _3 = (sizetype) i_11;
   _4 = _2 + _3;

and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
I think that's true).

So there are missing match.pd transformations in addition to whatever
scev/ivdep/other work is needed.

--
Marc Glisse
Reply | Threaded
Open this post in threaded view
|

Re: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

Jakub Jelinek
In reply to this post by Thomas Schwinge-8
On Fri, Oct 12, 2018 at 07:35:09PM +0200, Thomas Schwinge wrote:
>     int a[NJ][NI];
>
>     #pragma acc loop collapse(2)
>     for (int j = 0; j < N_J; ++j)
>       for (int i = 0; i < N_I; ++i)
>         a[j][i] = 0;

For e.g.
int a[128][128];

void
foo (int m, int n)
{
  #pragma omp for simd collapse(2)
  for (int i = 0; i < m; i++)
    for (int j = 0; j < n; j++)
      a[i][j]++;
}
we emit in the inner loop:
  <bb 8> :
  i = i.0;
  j = j.1;
  _1 = a[i][j];
  _2 = _1 + 1;
  a[i][j] = _2;
  .iter.4 = .iter.4 + 1;
  j.1 = j.1 + 1;
  D.2912 = j.1 < n.7 ? 0 : 1;
  i.0 = D.2912 + i.0;
  j.1 = j.1 < n.7 ? j.1 : 0;
 
  <bb 9> :
  if (.iter.4 < D.2902)
    goto <bb 8>; [87.50%]
  else
    goto <bb 10>; [12.50%]
to make it more vectorization friendly (though, in this particular case it
isn't vectorized either) and not do the expensive % and / operations inside
of the inner loop.  Without -fopenmp it does vectorize only the inner loop,
there is no collapse.

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

Richard Biener-2
On Fri, Oct 12, 2018 at 9:52 PM Jakub Jelinek <[hidden email]> wrote:

>
> On Fri, Oct 12, 2018 at 07:35:09PM +0200, Thomas Schwinge wrote:
> >     int a[NJ][NI];
> >
> >     #pragma acc loop collapse(2)
> >     for (int j = 0; j < N_J; ++j)
> >       for (int i = 0; i < N_I; ++i)
> >         a[j][i] = 0;
>
> For e.g.
> int a[128][128];
>
> void
> foo (int m, int n)
> {
>   #pragma omp for simd collapse(2)
>   for (int i = 0; i < m; i++)
>     for (int j = 0; j < n; j++)
>       a[i][j]++;
> }
> we emit in the inner loop:
>   <bb 8> :
>   i = i.0;
>   j = j.1;
>   _1 = a[i][j];
>   _2 = _1 + 1;
>   a[i][j] = _2;
>   .iter.4 = .iter.4 + 1;
>   j.1 = j.1 + 1;
>   D.2912 = j.1 < n.7 ? 0 : 1;
>   i.0 = D.2912 + i.0;
>   j.1 = j.1 < n.7 ? j.1 : 0;
>
>   <bb 9> :
>   if (.iter.4 < D.2902)
>     goto <bb 8>; [87.50%]
>   else
>     goto <bb 10>; [12.50%]
> to make it more vectorization friendly (though, in this particular case it
> isn't vectorized either) and not do the expensive % and / operations inside
> of the inner loop.  Without -fopenmp it does vectorize only the inner loop,
> there is no collapse.

Yeah.  Note this still makes the IVs not analyzable since i now effectively
becomes wrapping in the inner loop.  For some special values we might
get away with a wrapping CHREC in a bit-precision type but we cannot
represent wrapping at some (possibly non-constant) value.

So - collapsing loops is a bad idea.  Why's that done anyways?

Richard.

>
>         Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

Jakub Jelinek
On Mon, Oct 15, 2018 at 10:55:26AM +0200, Richard Biener wrote:
> Yeah.  Note this still makes the IVs not analyzable since i now effectively
> becomes wrapping in the inner loop.  For some special values we might
> get away with a wrapping CHREC in a bit-precision type but we cannot
> represent wrapping at some (possibly non-constant) value.
>
> So - collapsing loops is a bad idea.  Why's that done anyways?

Because the standards (both OpenMP and OpenACC) mandate that if one uses
collapse(2) or more.  The semantics is that that many nested loops form a
larger iteration space then and that is then handled according to the rules
of the particular construct.  Sometimes it can be very much beneficial,
sometimes less so, but e.g. with OpenMP user has the option to say what they
want.  They can e.g. do:
  #pragma omp distribute
  for (int i = 0; i < M; i++)
    #pragma omp parallel for
    for (int j = 0; j < N; j++)
      #pragma omp simd
      for (int k = 0; k < O; k++)
        do_something (i, j, k);
and that way distribute the outermost loop, parallelize the middle one and
vectorize the innermost one, or they can do:
  #pragma omp distribute parallel for simd collapse (3)
  for (int i = 0; i < M; i++)
    for (int j = 0; j < N; j++)
      for (int k = 0; k < O; k++)
        do_something (i, j, k);
and let the implementation split the M x N x O iteration space itself (or
use clauses to say how exactly it is done).  Say if O is very large and N
small and there are many cores, it might be more beneficial to parallelize
it more, etc.
If we come up with some way to help the vectorizer with the collapsed loop,
whether in a form of some loop flags, or internal fns, whatever, I'm all for
it.

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

Richard Biener-2
On Mon, Oct 15, 2018 at 11:11 AM Jakub Jelinek <[hidden email]> wrote:

>
> On Mon, Oct 15, 2018 at 10:55:26AM +0200, Richard Biener wrote:
> > Yeah.  Note this still makes the IVs not analyzable since i now effectively
> > becomes wrapping in the inner loop.  For some special values we might
> > get away with a wrapping CHREC in a bit-precision type but we cannot
> > represent wrapping at some (possibly non-constant) value.
> >
> > So - collapsing loops is a bad idea.  Why's that done anyways?
>
> Because the standards (both OpenMP and OpenACC) mandate that if one uses
> collapse(2) or more.  The semantics is that that many nested loops form a
> larger iteration space then and that is then handled according to the rules
> of the particular construct.  Sometimes it can be very much beneficial,
> sometimes less so, but e.g. with OpenMP user has the option to say what they
> want.  They can e.g. do:
>   #pragma omp distribute
>   for (int i = 0; i < M; i++)
>     #pragma omp parallel for
>     for (int j = 0; j < N; j++)
>       #pragma omp simd
>       for (int k = 0; k < O; k++)
>         do_something (i, j, k);
> and that way distribute the outermost loop, parallelize the middle one and
> vectorize the innermost one, or they can do:
>   #pragma omp distribute parallel for simd collapse (3)
>   for (int i = 0; i < M; i++)
>     for (int j = 0; j < N; j++)
>       for (int k = 0; k < O; k++)
>         do_something (i, j, k);
> and let the implementation split the M x N x O iteration space itself (or
> use clauses to say how exactly it is done).  Say if O is very large and N
> small and there are many cores, it might be more beneficial to parallelize
> it more, etc.
> If we come up with some way to help the vectorizer with the collapsed loop,
> whether in a form of some loop flags, or internal fns, whatever, I'm all for
> it.

But isn't _actual_ collapsing an implementation detail?  That is, isn't it
enough to interpret clauses in terms of the collapse result?

That is, can we delay the actual collapsing until after vectorization
for example?

Richard.

>
>         Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

Jakub Jelinek
On Mon, Oct 15, 2018 at 11:30:56AM +0200, Richard Biener wrote:
> But isn't _actual_ collapsing an implementation detail?

No, it is required by the standard and in many cases it is very much
observable.
#pragma omp parallel for schedule(nonmonotonic: static, 23) collapse (2)
for (int i = 0; i < 64; i++)
  for (int j = 0; j < 16; j++)
    a[i][j] = omp_get_thread_num ();
The standard says that from the logical iteration space 64 x 16,
first 23 iterations go to the first thread (i.e. i=0, j=0..15 and i=1,
j=0..14), then 23 iterations go to the second thread, etc.
In other constructs, e.g. the new loop construct, it is a request to
distribute, parallelize and vectorize as much as possible with optional
guarantee of no cross-iteration dependencies at all, but even in that case
using the source loops might not be always a win, e.g. the loopnest could be
5 loops and the iteration space might be diagonal or other not exactly
rectangular.

> That is, can we delay the actual collapsing until after vectorization
> for example?

No.  We can come up with some way to propagate some of the original info to
the vectorizer if it helps (or teach vectorizer to recognize whatever we
produce), but the mandatory transformation needs to be done
immediately before optimizations make those impossible.

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

Richard Biener-2
On Mon, Oct 15, 2018 at 11:45 AM Jakub Jelinek <[hidden email]> wrote:

>
> On Mon, Oct 15, 2018 at 11:30:56AM +0200, Richard Biener wrote:
> > But isn't _actual_ collapsing an implementation detail?
>
> No, it is required by the standard and in many cases it is very much
> observable.
> #pragma omp parallel for schedule(nonmonotonic: static, 23) collapse (2)
> for (int i = 0; i < 64; i++)
>   for (int j = 0; j < 16; j++)
>     a[i][j] = omp_get_thread_num ();
> The standard says that from the logical iteration space 64 x 16,
> first 23 iterations go to the first thread (i.e. i=0, j=0..15 and i=1,
> j=0..14), then 23 iterations go to the second thread, etc.
> In other constructs, e.g. the new loop construct, it is a request to
> distribute, parallelize and vectorize as much as possible with optional
> guarantee of no cross-iteration dependencies at all, but even in that case
> using the source loops might not be always a win, e.g. the loopnest could be
> 5 loops and the iteration space might be diagonal or other not exactly
> rectangular.

But then you could do

  for (int i = si1; i < n1; i++)
    for (int j = sj1; j < m1; j++)
      {
        a[i][j] = omp_get_thread_num ();
      }
   if (m_tail1)
     for (int j = 0; j < m_tail1; j++)
    ...

with appropriate start/end for the i/j loop and the "epilogue" loop?

> > That is, can we delay the actual collapsing until after vectorization
> > for example?
>
> No.  We can come up with some way to propagate some of the original info to
> the vectorizer if it helps (or teach vectorizer to recognize whatever we
> produce), but the mandatory transformation needs to be done
> immediately before optimizations make those impossible.

The issue is that with refs like

  a[i % m] = a[(i + 1) % m];

you do not know whether you have a backwards or forward dependence
so I do not see how you could perform loop vectorization.  That implies
that one option might be to have the OMP lowering unroll & interleave
loops when asked for SIMD so that the SLP vectorizer could pick up
things?

But then how is safelen() defined in the context of collapse()?

Richard.

>         Jakub
Reply | Threaded
Open this post in threaded view
|

Re: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?

Sebastian Pop-2
In reply to this post by Marc Glisse-6
On Fri, Oct 12, 2018 at 2:14 PM Marc Glisse <[hidden email]> wrote:

> On Fri, 12 Oct 2018, Thomas Schwinge wrote:
>
> > Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
> > also present when running the following code through the vectorizer:
> >
> >    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> >      {
> >        int j = tmp / N_I;
> >        int i = tmp % N_I;
> >        a[j][i] = 0;
> >      }
> >
> > ... whereas the following variant (obviously) does vectorize:
> >
> >    int a[NJ * NI];
> >
> >    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> >      a[tmp] = 0;
>
> I had a quick look at the difference, and a[j][i] remains in this form
> throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
>
>    j_10 = tmp_17 / 1025;
>    i_11 = tmp_17 % 1025;
>    _1 = (long unsigned int) j_10;
>    _2 = _1 * 1025;
>    _3 = (sizetype) i_11;
>    _4 = _2 + _3;
>
> or for a power of 2
>
>    j_10 = tmp_17 >> 10;
>    i_11 = tmp_17 & 1023;
>    _1 = (long unsigned int) j_10;
>    _2 = _1 * 1024;
>    _3 = (sizetype) i_11;
>    _4 = _2 + _3;
>
> and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
> I think that's true).
>
>
If this folding is correct, the dependence analysis would not have
to handle array accesses with div and mod, and it would be able
to classify the loop as parallel which will enable vectorization.


> So there are missing match.pd transformations in addition to whatever
> scev/ivdep/other work is needed.
>
> --
> Marc Glisse
>
Reply | Threaded
Open this post in threaded view
|

"match.pd" (was: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?)

Thomas Schwinge-8
In reply to this post by Marc Glisse-6
Hi!

Thanks for all your comments already!  I continued looked into this for a
bit (but then got interrupted by a higher-priority task).  Regarding this
one specifically:

On Fri, 12 Oct 2018 21:14:11 +0200, Marc Glisse <[hidden email]> wrote:

> On Fri, 12 Oct 2018, Thomas Schwinge wrote:
>
> > Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
> > also present when running the following code through the vectorizer:
> >
> >    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> >      {
> >        int j = tmp / N_I;
> >        int i = tmp % N_I;
> >        a[j][i] = 0;
> >      }
> >
> > ... whereas the following variant (obviously) does vectorize:
> >
> >    int a[NJ * NI];
> >
> >    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> >      a[tmp] = 0;
>
> I had a quick look at the difference, and a[j][i] remains in this form
> throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
>
>    j_10 = tmp_17 / 1025;
>    i_11 = tmp_17 % 1025;
>    _1 = (long unsigned int) j_10;
>    _2 = _1 * 1025;
>    _3 = (sizetype) i_11;
>    _4 = _2 + _3;
>
> or for a power of 2
>
>    j_10 = tmp_17 >> 10;
>    i_11 = tmp_17 & 1023;
>    _1 = (long unsigned int) j_10;
>    _2 = _1 * 1024;
>    _3 = (sizetype) i_11;
>    _4 = _2 + _3;
>
> and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
> I think that's true).
>
> So there are missing match.pd transformations in addition to whatever
> scev/ivdep/other work is needed.

With a very simplistic "match.pd" rule (not yet any special cases
checking etc.):

diff --git gcc/match.pd gcc/match.pd
index b36d7ccb5dc3..4c23116308da 100644
--- gcc/match.pd
+++ gcc/match.pd
@@ -5126,3 +5126,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
     { wide_int_to_tree (sizetype, off); })
       { swap_p ? @0 : @2; }))
     { rhs_tree; })))))))))
+
+/* Given:
+
+       j = in / N_I
+       i = in % N_I
+
+   ..., fold:
+
+       out = j * N_I + i
+
+   ..., into:
+
+       out = in
+*/
+
+/* As long as only considering N_I being INTEGER_CST (which are always second
+   argument?), probably don't need ":c" variants?  */
+
+(simplify
+ (plus:c
+  (mult:c
+   (trunc_div @0 INTEGER_CST@1)
+   INTEGER_CST@1)
+  (trunc_mod @0 INTEGER_CST@1))
+ (convert @0))

..., the original code:

    int f1(int in)
    {
      int j = in / N_I;
      int i = in % N_I;
   
      int out = j * N_I + i;
   
      return out;
    }

... gets simplified from ("div-mod-0.c.027t.objsz1"):

    f1 (int in)
    {
      int out;
      int i;
      int j;
      int _1;
      int _6;
   
      <bb 2> :
      gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
      gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
      gimple_assign <mult_expr, _1, j_3, 100, NULL>
      gimple_assign <plus_expr, out_5, i_4, _1, NULL>
      gimple_assign <ssa_name, _6, out_5, NULL, NULL>
      gimple_return <_6>
   
    }

... to ("div-mod-0.c.028t.ccp1"):

    f1 (int in)
    {
      int out;
      int i;
      int j;
      int _1;
   
      <bb 2> :
      gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
      gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
      gimple_assign <mult_expr, _1, j_3, 100, NULL>
      gimple_return <in_2(D)>
   
    }

(The three dead "gimple_assign"s get eliminated later on.)

So, that works.

However, it doesn't work yet for the original construct that I'd ran
into, which looks like this:

    [...]
    int i;
    int j;
    [...]
    signed int .offset.5_2;
    [...]
    unsigned int .offset.7_23;
    unsigned int .iter.0_24;
    unsigned int _25;
    unsigned int _26;
    [...]
    unsigned int .iter.0_32;
    [...]
   
    <bb 9> :
    # gimple_phi <.offset.5_2, .offset.5_21(8), .offset.5_30(9)>
    gimple_assign <nop_expr, .offset.7_23, .offset.5_2, NULL, NULL>
    gimple_assign <plus_expr, .iter.0_24, 0, .offset.7_23, NULL>
    gimple_assign <trunc_div_expr, _25, .iter.0_24, 100, NULL>
    gimple_assign <trunc_mod_expr, _26, .iter.0_24, 100, NULL>
    gimple_assign <nop_expr, i_27, _26, NULL, NULL>
    gimple_assign <nop_expr, j_28, _25, NULL, NULL>
    gimple_assign <integer_cst, a[j_28][i_27], 123, NULL, NULL>
    [...]

Resolving the "a[j][i] = 123" we'll need to look into later.

As Marc noted above, with that changed into "*(*(a + j) + i) = 123", we
get:

    [...]
    int i;
    int j;
    long unsigned int _1;
    long unsigned int _2;
    sizetype _3;
    sizetype _4;
    sizetype _5;
    int * _6;
    [...]
    signed int .offset.5_8;
    [...]
    unsigned int .offset.7_29;
    unsigned int .iter.0_30;
    unsigned int _31;
    unsigned int _32;
    [...]
   
    <bb 9> :
    # gimple_phi <.offset.5_8, .offset.5_27(8), .offset.5_36(9)>
    gimple_assign <nop_expr, .offset.7_29, .offset.5_8, NULL, NULL>
    gimple_assign <plus_expr, .iter.0_30, 0, .offset.7_29, NULL>
    gimple_assign <trunc_div_expr, _31, .iter.0_30, 100, NULL>
    gimple_assign <trunc_mod_expr, _32, .iter.0_30, 100, NULL>
    gimple_assign <nop_expr, i_33, _32, NULL, NULL>
    gimple_assign <nop_expr, j_34, _31, NULL, NULL>
    gimple_assign <nop_expr, _1, j_34, NULL, NULL>
    gimple_assign <mult_expr, _2, _1, 100, NULL>
    gimple_assign <nop_expr, _3, i_33, NULL, NULL>
    gimple_assign <plus_expr, _4, _2, _3, NULL>
    gimple_assign <mult_expr, _5, _4, 4, NULL>
    gimple_assign <pointer_plus_expr, _6, &a, _5, NULL>
    gimple_assign <integer_cst, *_6, 123, NULL, NULL>
    [...]

Here, unless I'm confused, "_4" is supposed to be equal to ".iter.0_30",
but "match.pd" doesn't agree yet.  Note the many "nop_expr"s here, which
I have not yet figured out how to handle, I suppose?  I tried some things
but couldn't get it to work.  Apparently the existing instances of
"(match (nop_convert @0)" and "Basic strip-useless-type-conversions /
strip_nops" rule also don't handle these; should they?  Or, are in fact
here the types mixed up too much?

I hope to get some time again soon to continue looking into this, but if
anybody got any ideas, I'm all ears.


Grüße
 Thomas
Reply | Threaded
Open this post in threaded view
|

Re: "match.pd" (was: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?)

Richard Biener-2
On Mon, Oct 22, 2018 at 6:35 PM Thomas Schwinge <[hidden email]> wrote:

>
> Hi!
>
> Thanks for all your comments already!  I continued looked into this for a
> bit (but then got interrupted by a higher-priority task).  Regarding this
> one specifically:
>
> On Fri, 12 Oct 2018 21:14:11 +0200, Marc Glisse <[hidden email]> wrote:
> > On Fri, 12 Oct 2018, Thomas Schwinge wrote:
> >
> > > Hmm, and without any OpenACC/OpenMP etc., actually the same problem is
> > > also present when running the following code through the vectorizer:
> > >
> > >    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> > >      {
> > >        int j = tmp / N_I;
> > >        int i = tmp % N_I;
> > >        a[j][i] = 0;
> > >      }
> > >
> > > ... whereas the following variant (obviously) does vectorize:
> > >
> > >    int a[NJ * NI];
> > >
> > >    for (int tmp = 0; tmp < N_J * N_I; ++tmp)
> > >      a[tmp] = 0;
> >
> > I had a quick look at the difference, and a[j][i] remains in this form
> > throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
> >
> >    j_10 = tmp_17 / 1025;
> >    i_11 = tmp_17 % 1025;
> >    _1 = (long unsigned int) j_10;
> >    _2 = _1 * 1025;
> >    _3 = (sizetype) i_11;
> >    _4 = _2 + _3;
> >
> > or for a power of 2
> >
> >    j_10 = tmp_17 >> 10;
> >    i_11 = tmp_17 & 1023;
> >    _1 = (long unsigned int) j_10;
> >    _2 = _1 * 1024;
> >    _3 = (sizetype) i_11;
> >    _4 = _2 + _3;
> >
> > and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
> > I think that's true).
> >
> > So there are missing match.pd transformations in addition to whatever
> > scev/ivdep/other work is needed.
>
> With a very simplistic "match.pd" rule (not yet any special cases
> checking etc.):
>
> diff --git gcc/match.pd gcc/match.pd
> index b36d7ccb5dc3..4c23116308da 100644
> --- gcc/match.pd
> +++ gcc/match.pd
> @@ -5126,3 +5126,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>                             { wide_int_to_tree (sizetype, off); })
>               { swap_p ? @0 : @2; }))
>             { rhs_tree; })))))))))
> +
> +/* Given:
> +
> +       j = in / N_I
> +       i = in % N_I
> +
> +   ..., fold:
> +
> +       out = j * N_I + i
> +
> +   ..., into:
> +
> +       out = in
> +*/
> +
> +/* As long as only considering N_I being INTEGER_CST (which are always second
> +   argument?), probably don't need ":c" variants?  */
> +
> +(simplify
> + (plus:c
> +  (mult:c
> +   (trunc_div @0 INTEGER_CST@1)
> +   INTEGER_CST@1)
> +  (trunc_mod @0 INTEGER_CST@1))
> + (convert @0))
>
> ..., the original code:
>
>     int f1(int in)
>     {
>       int j = in / N_I;
>       int i = in % N_I;
>
>       int out = j * N_I + i;
>
>       return out;
>     }
>
> ... gets simplified from ("div-mod-0.c.027t.objsz1"):
>
>     f1 (int in)
>     {
>       int out;
>       int i;
>       int j;
>       int _1;
>       int _6;
>
>       <bb 2> :
>       gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
>       gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
>       gimple_assign <mult_expr, _1, j_3, 100, NULL>
>       gimple_assign <plus_expr, out_5, i_4, _1, NULL>
>       gimple_assign <ssa_name, _6, out_5, NULL, NULL>
>       gimple_return <_6>
>
>     }
>
> ... to ("div-mod-0.c.028t.ccp1"):
>
>     f1 (int in)
>     {
>       int out;
>       int i;
>       int j;
>       int _1;
>
>       <bb 2> :
>       gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
>       gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
>       gimple_assign <mult_expr, _1, j_3, 100, NULL>
>       gimple_return <in_2(D)>
>
>     }
>
> (The three dead "gimple_assign"s get eliminated later on.)
>
> So, that works.
>
> However, it doesn't work yet for the original construct that I'd ran
> into, which looks like this:
>
>     [...]
>     int i;
>     int j;
>     [...]
>     signed int .offset.5_2;
>     [...]
>     unsigned int .offset.7_23;
>     unsigned int .iter.0_24;
>     unsigned int _25;
>     unsigned int _26;
>     [...]
>     unsigned int .iter.0_32;
>     [...]
>
>     <bb 9> :
>     # gimple_phi <.offset.5_2, .offset.5_21(8), .offset.5_30(9)>
>     gimple_assign <nop_expr, .offset.7_23, .offset.5_2, NULL, NULL>
>     gimple_assign <plus_expr, .iter.0_24, 0, .offset.7_23, NULL>
>     gimple_assign <trunc_div_expr, _25, .iter.0_24, 100, NULL>
>     gimple_assign <trunc_mod_expr, _26, .iter.0_24, 100, NULL>
>     gimple_assign <nop_expr, i_27, _26, NULL, NULL>
>     gimple_assign <nop_expr, j_28, _25, NULL, NULL>
>     gimple_assign <integer_cst, a[j_28][i_27], 123, NULL, NULL>
>     [...]
>
> Resolving the "a[j][i] = 123" we'll need to look into later.
>
> As Marc noted above, with that changed into "*(*(a + j) + i) = 123", we
> get:
>
>     [...]
>     int i;
>     int j;
>     long unsigned int _1;
>     long unsigned int _2;
>     sizetype _3;
>     sizetype _4;
>     sizetype _5;
>     int * _6;
>     [...]
>     signed int .offset.5_8;
>     [...]
>     unsigned int .offset.7_29;
>     unsigned int .iter.0_30;
>     unsigned int _31;
>     unsigned int _32;
>     [...]
>
>     <bb 9> :
>     # gimple_phi <.offset.5_8, .offset.5_27(8), .offset.5_36(9)>
>     gimple_assign <nop_expr, .offset.7_29, .offset.5_8, NULL, NULL>
>     gimple_assign <plus_expr, .iter.0_30, 0, .offset.7_29, NULL>
>     gimple_assign <trunc_div_expr, _31, .iter.0_30, 100, NULL>
>     gimple_assign <trunc_mod_expr, _32, .iter.0_30, 100, NULL>
>     gimple_assign <nop_expr, i_33, _32, NULL, NULL>
>     gimple_assign <nop_expr, j_34, _31, NULL, NULL>
>     gimple_assign <nop_expr, _1, j_34, NULL, NULL>
>     gimple_assign <mult_expr, _2, _1, 100, NULL>
>     gimple_assign <nop_expr, _3, i_33, NULL, NULL>
>     gimple_assign <plus_expr, _4, _2, _3, NULL>
>     gimple_assign <mult_expr, _5, _4, 4, NULL>
>     gimple_assign <pointer_plus_expr, _6, &a, _5, NULL>
>     gimple_assign <integer_cst, *_6, 123, NULL, NULL>
>     [...]
>
> Here, unless I'm confused, "_4" is supposed to be equal to ".iter.0_30",
> but "match.pd" doesn't agree yet.  Note the many "nop_expr"s here, which
> I have not yet figured out how to handle, I suppose?  I tried some things
> but couldn't get it to work.  Apparently the existing instances of
> "(match (nop_convert @0)" and "Basic strip-useless-type-conversions /
> strip_nops" rule also don't handle these; should they?  Or, are in fact
> here the types mixed up too much?

There are sign-extensions to larger types involved which are not no-op
ones.

So it's a little bit more complicated than the case you are catching
with the pattern.

Richard.

> I hope to get some time again soon to continue looking into this, but if
> anybody got any ideas, I'm all ears.
>
>
> Grüße
>  Thomas
Reply | Threaded
Open this post in threaded view
|

Re: "match.pd" (was: Can support TRUNC_DIV_EXPR, TRUNC_MOD_EXPR in GCC vectorization/scalar evolution -- and/or linearization?)

Marc Glisse-6
In reply to this post by Thomas Schwinge-8
(resent because of mail issues on my end)

On Mon, 22 Oct 2018, Thomas Schwinge wrote:

>> I had a quick look at the difference, and a[j][i] remains in this form
>> throughout optimization. If I write instead *((*(a+j))+i) = 0; I get
>>
>>    j_10 = tmp_17 / 1025;
>>    i_11 = tmp_17 % 1025;
>>    _1 = (long unsigned int) j_10;
>>    _2 = _1 * 1025;
>>    _3 = (sizetype) i_11;
>>    _4 = _2 + _3;
>>
>> or for a power of 2
>>
>>    j_10 = tmp_17 >> 10;
>>    i_11 = tmp_17 & 1023;
>>    _1 = (long unsigned int) j_10;
>>    _2 = _1 * 1024;
>>    _3 = (sizetype) i_11;
>>    _4 = _2 + _3;
>>
>> and in both cases we fail to notice that _4 = (sizetype) tmp_17; (at least
>> I think that's true).
>>
>> So there are missing match.pd transformations in addition to whatever
>> scev/ivdep/other work is needed.
>
> With a very simplistic "match.pd" rule (not yet any special cases
> checking etc.):
>
> diff --git gcc/match.pd gcc/match.pd
> index b36d7ccb5dc3..4c23116308da 100644
> --- gcc/match.pd
> +++ gcc/match.pd
> @@ -5126,3 +5126,28 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>    { wide_int_to_tree (sizetype, off); })
>      { swap_p ? @0 : @2; }))
>    { rhs_tree; })))))))))
> +
> +/* Given:
> +
> +       j = in / N_I
> +       i = in % N_I
> +
> +   ..., fold:
> +
> +       out = j * N_I + i
> +
> +   ..., into:
> +
> +       out = in
> +*/
> +
> +/* As long as only considering N_I being INTEGER_CST (which are always second
> +   argument?), probably don't need ":c" variants?  */
> +
> +(simplify
> + (plus:c
> +  (mult:c
> +   (trunc_div @0 INTEGER_CST@1)
> +   INTEGER_CST@1)
> +  (trunc_mod @0 INTEGER_CST@1))
> + (convert @0))

You should only specify INTEGER_CST@1 on the first occurence, the others
can be just @1. (you may be interested in @@1 at some point, but that
gets tricky)

> ..., the original code:
>
>    int f1(int in)
>    {
>      int j = in / N_I;
>      int i = in % N_I;
>
>      int out = j * N_I + i;
>
>      return out;
>    }
>
> ... gets simplified from ("div-mod-0.c.027t.objsz1"):
>
>    f1 (int in)
>    {
>      int out;
>      int i;
>      int j;
>      int _1;
>      int _6;
>
>      <bb 2> :
>      gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
>      gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
>      gimple_assign <mult_expr, _1, j_3, 100, NULL>
>      gimple_assign <plus_expr, out_5, i_4, _1, NULL>
>      gimple_assign <ssa_name, _6, out_5, NULL, NULL>
>      gimple_return <_6>
>
>    }
>
> ... to ("div-mod-0.c.028t.ccp1"):
>
>    f1 (int in)
>    {
>      int out;
>      int i;
>      int j;
>      int _1;
>
>      <bb 2> :
>      gimple_assign <trunc_div_expr, j_3, in_2(D), 100, NULL>
>      gimple_assign <trunc_mod_expr, i_4, in_2(D), 100, NULL>
>      gimple_assign <mult_expr, _1, j_3, 100, NULL>
>      gimple_return <in_2(D)>
>
>    }
>
> (The three dead "gimple_assign"s get eliminated later on.)
>
> So, that works.
>
> However, it doesn't work yet for the original construct that I'd ran
> into, which looks like this:
>
>    [...]
>    int i;
>    int j;
>    [...]
>    signed int .offset.5_2;
>    [...]
>    unsigned int .offset.7_23;
>    unsigned int .iter.0_24;
>    unsigned int _25;
>    unsigned int _26;
>    [...]
>    unsigned int .iter.0_32;
>    [...]
>
>    <bb 9> :
>    # gimple_phi <.offset.5_2, .offset.5_21(8), .offset.5_30(9)>
>    gimple_assign <nop_expr, .offset.7_23, .offset.5_2, NULL, NULL>
>    gimple_assign <plus_expr, .iter.0_24, 0, .offset.7_23, NULL>
>    gimple_assign <trunc_div_expr, _25, .iter.0_24, 100, NULL>
>    gimple_assign <trunc_mod_expr, _26, .iter.0_24, 100, NULL>
>    gimple_assign <nop_expr, i_27, _26, NULL, NULL>
>    gimple_assign <nop_expr, j_28, _25, NULL, NULL>
>    gimple_assign <integer_cst, a[j_28][i_27], 123, NULL, NULL>
>    [...]
>
> Resolving the "a[j][i] = 123" we'll need to look into later.
>
> As Marc noted above, with that changed into "*(*(a + j) + i) = 123", we
> get:
>
>    [...]
>    int i;
>    int j;
>    long unsigned int _1;
>    long unsigned int _2;
>    sizetype _3;
>    sizetype _4;
>    sizetype _5;
>    int * _6;
>    [...]
>    signed int .offset.5_8;
>    [...]
>    unsigned int .offset.7_29;
>    unsigned int .iter.0_30;
>    unsigned int _31;
>    unsigned int _32;
>    [...]
>
>    <bb 9> :
>    # gimple_phi <.offset.5_8, .offset.5_27(8), .offset.5_36(9)>
>    gimple_assign <nop_expr, .offset.7_29, .offset.5_8, NULL, NULL>
>    gimple_assign <plus_expr, .iter.0_30, 0, .offset.7_29, NULL>
>    gimple_assign <trunc_div_expr, _31, .iter.0_30, 100, NULL>
>    gimple_assign <trunc_mod_expr, _32, .iter.0_30, 100, NULL>
>    gimple_assign <nop_expr, i_33, _32, NULL, NULL>
>    gimple_assign <nop_expr, j_34, _31, NULL, NULL>
>    gimple_assign <nop_expr, _1, j_34, NULL, NULL>
>    gimple_assign <mult_expr, _2, _1, 100, NULL>
>    gimple_assign <nop_expr, _3, i_33, NULL, NULL>
>    gimple_assign <plus_expr, _4, _2, _3, NULL>
>    gimple_assign <mult_expr, _5, _4, 4, NULL>
>    gimple_assign <pointer_plus_expr, _6, &a, _5, NULL>
>    gimple_assign <integer_cst, *_6, 123, NULL, NULL>
>    [...]
>
> Here, unless I'm confused, "_4" is supposed to be equal to ".iter.0_30",
> but "match.pd" doesn't agree yet.  Note the many "nop_expr"s here, which
> I have not yet figured out how to handle, I suppose?  I tried some things
> but couldn't get it to work.  Apparently the existing instances of
> "(match (nop_convert @0)" and "Basic strip-useless-type-conversions /
> strip_nops" rule also don't handle these; should they?  Or, are in fact
> here the types mixed up too much?

"(match (nop_convert @0)" defines a shortcut so some transformations can
use nop_convert to detect some specific conversions, but it doesn't do
anything by itself. "Basic strip-useless-type-conversions" strips
conversions that are *useless*, essentially from a type to the same
type. If you want to handle true conversions, you need to do that
explicitly, see the many transformations that use convert? convert1?
convert2? and specify for which particular conversions the
transformation is valid.  Finding out the right conditions to detect
these conversions is often the most painful part of writing a match.pd
transformation.

> I hope to get some time again soon to continue looking into this, but if
> anybody got any ideas, I'm all ears.

--
Marc Glisse