[SVE] PR86753

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[SVE] PR86753

Prathamesh Kulkarni-2
Hi,
The attached patch tries to fix PR86753.

For following test:
void
f1 (int *restrict x, int *restrict y, int *restrict z)
{
  for (int i = 0; i < 100; ++i)
    x[i] = y[i] ? z[i] : 10;
}

vect dump shows:
  vect_cst__42 = { 0, ... };
  vect_cst__48 = { 0, ... };

  vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
  _4 = *_3;
  _5 = z_12(D) + _2;
  mask__35.8_43 = vect__4.7_41 != vect_cst__42;
  _35 = _4 != 0;
  vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
  vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
  iftmp.0_13 = 0;
  vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
vect_iftmp.11_47, vect_cst__49>;

and following code-gen:
L2:
        ld1w    z0.s, p2/z, [x1, x3, lsl 2]
        cmpne   p1.s, p3/z, z0.s, #0
        cmpne   p0.s, p2/z, z0.s, #0
        ld1w    z0.s, p0/z, [x2, x3, lsl 2]
        sel     z0.s, p1, z0.s, z1.s

We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.

I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
which is conditional on C, then we could reuse the mask used in load,
in vec_cond_expr ?

The patch maintains a hash_map cond_to_vec_mask
from <cond, loop_mask -> vec_mask (with loop predicate applied).
In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
and in vectorizable_condition, we check if <cond, loop_mask> exists in
cond_to_vec_mask
and if found, the corresponding vec_mask is used as 1st operand of
vec_cond_expr.

<cond, loop_mask> is represented with cond_vmask_key, and the patch
adds tree_cond_ops to represent condition operator and operands coming
either from cond_expr
or a gimple comparison stmt. If the stmt is not comparison, it returns
<ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.

With patch, the redundant p1 is eliminated and sel uses p0 for above test.

For following test:
void
f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
{
  for (int i = 0; i < 100; ++i)
    x[i] = y[i] ? z[i] : fallback;
}

input to vectorizer has operands swapped in cond_expr:
  _36 = _4 != 0;
  iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
  iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;

So we need to check for inverted condition in cond_to_vec_mask,
and swap the operands.
Does the patch look OK so far ?

One major issue remaining with the patch is value  numbering.
Currently, it does value numbering for entire function using sccvn
during start of vect pass, which is too expensive since we only need
block based VN. I am looking into that.

Thanks,
Prathamesh

pr86753-5.diff (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Biener-2
On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
<[hidden email]> wrote:

>
> Hi,
> The attached patch tries to fix PR86753.
>
> For following test:
> void
> f1 (int *restrict x, int *restrict y, int *restrict z)
> {
>   for (int i = 0; i < 100; ++i)
>     x[i] = y[i] ? z[i] : 10;
> }
>
> vect dump shows:
>   vect_cst__42 = { 0, ... };
>   vect_cst__48 = { 0, ... };
>
>   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
>   _4 = *_3;
>   _5 = z_12(D) + _2;
>   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
>   _35 = _4 != 0;
>   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
>   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
>   iftmp.0_13 = 0;
>   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> vect_iftmp.11_47, vect_cst__49>;
>
> and following code-gen:
> L2:
>         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
>         cmpne   p1.s, p3/z, z0.s, #0
>         cmpne   p0.s, p2/z, z0.s, #0
>         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
>         sel     z0.s, p1, z0.s, z1.s
>
> We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
>
> I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> which is conditional on C, then we could reuse the mask used in load,
> in vec_cond_expr ?
>
> The patch maintains a hash_map cond_to_vec_mask
> from <cond, loop_mask -> vec_mask (with loop predicate applied).
> In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> and in vectorizable_condition, we check if <cond, loop_mask> exists in
> cond_to_vec_mask
> and if found, the corresponding vec_mask is used as 1st operand of
> vec_cond_expr.
>
> <cond, loop_mask> is represented with cond_vmask_key, and the patch
> adds tree_cond_ops to represent condition operator and operands coming
> either from cond_expr
> or a gimple comparison stmt. If the stmt is not comparison, it returns
> <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
>
> With patch, the redundant p1 is eliminated and sel uses p0 for above test.
>
> For following test:
> void
> f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> {
>   for (int i = 0; i < 100; ++i)
>     x[i] = y[i] ? z[i] : fallback;
> }
>
> input to vectorizer has operands swapped in cond_expr:
>   _36 = _4 != 0;
>   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
>   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
>
> So we need to check for inverted condition in cond_to_vec_mask,
> and swap the operands.
> Does the patch look OK so far ?
>
> One major issue remaining with the patch is value  numbering.
> Currently, it does value numbering for entire function using sccvn
> during start of vect pass, which is too expensive since we only need
> block based VN. I am looking into that.

Why do you need it at all?  We run VN on the if-converted loop bodies btw.

Richard.

>
> Thanks,
> Prathamesh
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Biener-2
On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
<[hidden email]> wrote:

>
> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> <[hidden email]> wrote:
> >
> > Hi,
> > The attached patch tries to fix PR86753.
> >
> > For following test:
> > void
> > f1 (int *restrict x, int *restrict y, int *restrict z)
> > {
> >   for (int i = 0; i < 100; ++i)
> >     x[i] = y[i] ? z[i] : 10;
> > }
> >
> > vect dump shows:
> >   vect_cst__42 = { 0, ... };
> >   vect_cst__48 = { 0, ... };
> >
> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> >   _4 = *_3;
> >   _5 = z_12(D) + _2;
> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> >   _35 = _4 != 0;
> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> >   iftmp.0_13 = 0;
> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > vect_iftmp.11_47, vect_cst__49>;
> >
> > and following code-gen:
> > L2:
> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> >         cmpne   p1.s, p3/z, z0.s, #0
> >         cmpne   p0.s, p2/z, z0.s, #0
> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> >         sel     z0.s, p1, z0.s, z1.s
> >
> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> >
> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> > which is conditional on C, then we could reuse the mask used in load,
> > in vec_cond_expr ?
> >
> > The patch maintains a hash_map cond_to_vec_mask
> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> > cond_to_vec_mask
> > and if found, the corresponding vec_mask is used as 1st operand of
> > vec_cond_expr.
> >
> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> > adds tree_cond_ops to represent condition operator and operands coming
> > either from cond_expr
> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> >
> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> >
> > For following test:
> > void
> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> > {
> >   for (int i = 0; i < 100; ++i)
> >     x[i] = y[i] ? z[i] : fallback;
> > }
> >
> > input to vectorizer has operands swapped in cond_expr:
> >   _36 = _4 != 0;
> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> >
> > So we need to check for inverted condition in cond_to_vec_mask,
> > and swap the operands.
> > Does the patch look OK so far ?
> >
> > One major issue remaining with the patch is value  numbering.
> > Currently, it does value numbering for entire function using sccvn
> > during start of vect pass, which is too expensive since we only need
> > block based VN. I am looking into that.
>
> Why do you need it at all?  We run VN on the if-converted loop bodies btw.

Also I can't trivially see the equality of the masks and probably so can't VN.
Is it that we just don't bother to apply loop_mask to VEC_COND but there's
no harm if we do?

Richard.

> Richard.
>
> >
> > Thanks,
> > Prathamesh
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Sandiford-9
Richard Biener <[hidden email]> writes:

> On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> <[hidden email]> wrote:
>>
>> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
>> <[hidden email]> wrote:
>> >
>> > Hi,
>> > The attached patch tries to fix PR86753.
>> >
>> > For following test:
>> > void
>> > f1 (int *restrict x, int *restrict y, int *restrict z)
>> > {
>> >   for (int i = 0; i < 100; ++i)
>> >     x[i] = y[i] ? z[i] : 10;
>> > }
>> >
>> > vect dump shows:
>> >   vect_cst__42 = { 0, ... };
>> >   vect_cst__48 = { 0, ... };
>> >
>> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
>> >   _4 = *_3;
>> >   _5 = z_12(D) + _2;
>> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
>> >   _35 = _4 != 0;
>> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
>> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
>> >   iftmp.0_13 = 0;
>> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
>> > vect_iftmp.11_47, vect_cst__49>;
>> >
>> > and following code-gen:
>> > L2:
>> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
>> >         cmpne   p1.s, p3/z, z0.s, #0
>> >         cmpne   p0.s, p2/z, z0.s, #0
>> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
>> >         sel     z0.s, p1, z0.s, z1.s
>> >
>> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
>> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
>> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
>> >
>> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
>> > which is conditional on C, then we could reuse the mask used in load,
>> > in vec_cond_expr ?
>> >
>> > The patch maintains a hash_map cond_to_vec_mask
>> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
>> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
>> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
>> > cond_to_vec_mask
>> > and if found, the corresponding vec_mask is used as 1st operand of
>> > vec_cond_expr.
>> >
>> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
>> > adds tree_cond_ops to represent condition operator and operands coming
>> > either from cond_expr
>> > or a gimple comparison stmt. If the stmt is not comparison, it returns
>> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
>> >
>> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
>> >
>> > For following test:
>> > void
>> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
>> > {
>> >   for (int i = 0; i < 100; ++i)
>> >     x[i] = y[i] ? z[i] : fallback;
>> > }
>> >
>> > input to vectorizer has operands swapped in cond_expr:
>> >   _36 = _4 != 0;
>> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
>> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
>> >
>> > So we need to check for inverted condition in cond_to_vec_mask,
>> > and swap the operands.
>> > Does the patch look OK so far ?
>> >
>> > One major issue remaining with the patch is value  numbering.
>> > Currently, it does value numbering for entire function using sccvn
>> > during start of vect pass, which is too expensive since we only need
>> > block based VN. I am looking into that.
>>
>> Why do you need it at all?  We run VN on the if-converted loop bodies btw.

This was my suggestion, but with the idea being to do the numbering
per-statement as we vectorise.  We'll then see pattern statements too.

That's important because we use pattern statements to set the right
vector boolean type (e.g. vect_recog_mask_conversion_pattern).
So some of the masks we care about don't exist after if converison.

> Also I can't trivially see the equality of the masks and probably so
> can't VN.  Is it that we just don't bother to apply loop_mask to
> VEC_COND but there's no harm if we do?

Yeah.  The idea of the optimisation is to decide when it's more profitable
to apply the loop mask, even though doing so isn't necessary.  It would
be hard to do after vectorisation because the masks aren't equivalent.
We're relying on knowledge of how the vectoriser uses the result.

Thanks,
Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Prathamesh Kulkarni-2
On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
<[hidden email]> wrote:

>
> Richard Biener <[hidden email]> writes:
> > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> > <[hidden email]> wrote:
> >>
> >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> >> <[hidden email]> wrote:
> >> >
> >> > Hi,
> >> > The attached patch tries to fix PR86753.
> >> >
> >> > For following test:
> >> > void
> >> > f1 (int *restrict x, int *restrict y, int *restrict z)
> >> > {
> >> >   for (int i = 0; i < 100; ++i)
> >> >     x[i] = y[i] ? z[i] : 10;
> >> > }
> >> >
> >> > vect dump shows:
> >> >   vect_cst__42 = { 0, ... };
> >> >   vect_cst__48 = { 0, ... };
> >> >
> >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> >> >   _4 = *_3;
> >> >   _5 = z_12(D) + _2;
> >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> >> >   _35 = _4 != 0;
> >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> >> >   iftmp.0_13 = 0;
> >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> >> > vect_iftmp.11_47, vect_cst__49>;
> >> >
> >> > and following code-gen:
> >> > L2:
> >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> >> >         cmpne   p1.s, p3/z, z0.s, #0
> >> >         cmpne   p0.s, p2/z, z0.s, #0
> >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> >> >         sel     z0.s, p1, z0.s, z1.s
> >> >
> >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> >> >
> >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> >> > which is conditional on C, then we could reuse the mask used in load,
> >> > in vec_cond_expr ?
> >> >
> >> > The patch maintains a hash_map cond_to_vec_mask
> >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> >> > cond_to_vec_mask
> >> > and if found, the corresponding vec_mask is used as 1st operand of
> >> > vec_cond_expr.
> >> >
> >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> >> > adds tree_cond_ops to represent condition operator and operands coming
> >> > either from cond_expr
> >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> >> >
> >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> >> >
> >> > For following test:
> >> > void
> >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> >> > {
> >> >   for (int i = 0; i < 100; ++i)
> >> >     x[i] = y[i] ? z[i] : fallback;
> >> > }
> >> >
> >> > input to vectorizer has operands swapped in cond_expr:
> >> >   _36 = _4 != 0;
> >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> >> >
> >> > So we need to check for inverted condition in cond_to_vec_mask,
> >> > and swap the operands.
> >> > Does the patch look OK so far ?
> >> >
> >> > One major issue remaining with the patch is value  numbering.
> >> > Currently, it does value numbering for entire function using sccvn
> >> > during start of vect pass, which is too expensive since we only need
> >> > block based VN. I am looking into that.
> >>
> >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
>
> This was my suggestion, but with the idea being to do the numbering
> per-statement as we vectorise.  We'll then see pattern statements too.
>
> That's important because we use pattern statements to set the right
> vector boolean type (e.g. vect_recog_mask_conversion_pattern).
> So some of the masks we care about don't exist after if converison.
>
> > Also I can't trivially see the equality of the masks and probably so
> > can't VN.  Is it that we just don't bother to apply loop_mask to
> > VEC_COND but there's no harm if we do?
>
> Yeah.  The idea of the optimisation is to decide when it's more profitable
> to apply the loop mask, even though doing so isn't necessary.  It would
> be hard to do after vectorisation because the masks aren't equivalent.
> We're relying on knowledge of how the vectoriser uses the result.
Hi,
Sorry for late response. This is an updated patch, that integrates
block-based VN into vect pass.
The patch
(a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
initialize VN state, and vn_bb_free to free it.
(b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
stmts. We're only interested in obtaining
value numbers, not eliminating redundancies.
Does it look in the right direction ?
I am not sure if the initialization in vn_bb_init is entirely correct.

PS: The patch seems to regress fmla_2.c. I am looking into it.

Thanks,
Prathamesh
>
> Thanks,
> Richard

pr86753-6.diff (19K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Biener-2
On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni
<[hidden email]> wrote:

>
> On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
> <[hidden email]> wrote:
> >
> > Richard Biener <[hidden email]> writes:
> > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> > > <[hidden email]> wrote:
> > >>
> > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> > >> <[hidden email]> wrote:
> > >> >
> > >> > Hi,
> > >> > The attached patch tries to fix PR86753.
> > >> >
> > >> > For following test:
> > >> > void
> > >> > f1 (int *restrict x, int *restrict y, int *restrict z)
> > >> > {
> > >> >   for (int i = 0; i < 100; ++i)
> > >> >     x[i] = y[i] ? z[i] : 10;
> > >> > }
> > >> >
> > >> > vect dump shows:
> > >> >   vect_cst__42 = { 0, ... };
> > >> >   vect_cst__48 = { 0, ... };
> > >> >
> > >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> > >> >   _4 = *_3;
> > >> >   _5 = z_12(D) + _2;
> > >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> > >> >   _35 = _4 != 0;
> > >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> > >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> > >> >   iftmp.0_13 = 0;
> > >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > >> > vect_iftmp.11_47, vect_cst__49>;
> > >> >
> > >> > and following code-gen:
> > >> > L2:
> > >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> > >> >         cmpne   p1.s, p3/z, z0.s, #0
> > >> >         cmpne   p0.s, p2/z, z0.s, #0
> > >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> > >> >         sel     z0.s, p1, z0.s, z1.s
> > >> >
> > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> > >> >
> > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> > >> > which is conditional on C, then we could reuse the mask used in load,
> > >> > in vec_cond_expr ?
> > >> >
> > >> > The patch maintains a hash_map cond_to_vec_mask
> > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> > >> > cond_to_vec_mask
> > >> > and if found, the corresponding vec_mask is used as 1st operand of
> > >> > vec_cond_expr.
> > >> >
> > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> > >> > adds tree_cond_ops to represent condition operator and operands coming
> > >> > either from cond_expr
> > >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> > >> >
> > >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> > >> >
> > >> > For following test:
> > >> > void
> > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> > >> > {
> > >> >   for (int i = 0; i < 100; ++i)
> > >> >     x[i] = y[i] ? z[i] : fallback;
> > >> > }
> > >> >
> > >> > input to vectorizer has operands swapped in cond_expr:
> > >> >   _36 = _4 != 0;
> > >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> > >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> > >> >
> > >> > So we need to check for inverted condition in cond_to_vec_mask,
> > >> > and swap the operands.
> > >> > Does the patch look OK so far ?
> > >> >
> > >> > One major issue remaining with the patch is value  numbering.
> > >> > Currently, it does value numbering for entire function using sccvn
> > >> > during start of vect pass, which is too expensive since we only need
> > >> > block based VN. I am looking into that.
> > >>
> > >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
> >
> > This was my suggestion, but with the idea being to do the numbering
> > per-statement as we vectorise.  We'll then see pattern statements too.
> >
> > That's important because we use pattern statements to set the right
> > vector boolean type (e.g. vect_recog_mask_conversion_pattern).
> > So some of the masks we care about don't exist after if converison.
> >
> > > Also I can't trivially see the equality of the masks and probably so
> > > can't VN.  Is it that we just don't bother to apply loop_mask to
> > > VEC_COND but there's no harm if we do?
> >
> > Yeah.  The idea of the optimisation is to decide when it's more profitable
> > to apply the loop mask, even though doing so isn't necessary.  It would
> > be hard to do after vectorisation because the masks aren't equivalent.
> > We're relying on knowledge of how the vectoriser uses the result.
> Hi,
> Sorry for late response. This is an updated patch, that integrates
> block-based VN into vect pass.
> The patch
> (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
> initialize VN state, and vn_bb_free to free it.
> (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
> stmts. We're only interested in obtaining
> value numbers, not eliminating redundancies.
> Does it look in the right direction ?

It looks a bit odd to me.  I'd have expected it to work by generating
the stmts as before in the vectorizer and then on the stmts we care
invoke vn_visit_stmt that does both value-numbering and elimination.
Alternatively you could ask the VN state to generate the stmt for
you via vn_nary_build_or_lookup () (certainly that needs a bit more
work).  One complication might be availability if you don't value-number
all stmts in the block, but well.  I'm not sure constraining to a single
block is necessary - I've thought of having a "CSE"ing gimple_build
for some time (add & CSE new stmts onto a sequence), so one
should keep this mode in mind when designing the one working on
an existing BB.  Note as you write it it depends on visiting the
stmts in proper order - is that guaranteed when for example
vectorizing SLP?

> I am not sure if the initialization in vn_bb_init is entirely correct.
>
> PS: The patch seems to regress fmla_2.c. I am looking into it.
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Prathamesh Kulkarni-2
On Thu, 22 Aug 2019 at 16:44, Richard Biener <[hidden email]> wrote:

>
> On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni
> <[hidden email]> wrote:
> >
> > On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
> > <[hidden email]> wrote:
> > >
> > > Richard Biener <[hidden email]> writes:
> > > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> > > > <[hidden email]> wrote:
> > > >>
> > > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> > > >> <[hidden email]> wrote:
> > > >> >
> > > >> > Hi,
> > > >> > The attached patch tries to fix PR86753.
> > > >> >
> > > >> > For following test:
> > > >> > void
> > > >> > f1 (int *restrict x, int *restrict y, int *restrict z)
> > > >> > {
> > > >> >   for (int i = 0; i < 100; ++i)
> > > >> >     x[i] = y[i] ? z[i] : 10;
> > > >> > }
> > > >> >
> > > >> > vect dump shows:
> > > >> >   vect_cst__42 = { 0, ... };
> > > >> >   vect_cst__48 = { 0, ... };
> > > >> >
> > > >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> > > >> >   _4 = *_3;
> > > >> >   _5 = z_12(D) + _2;
> > > >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> > > >> >   _35 = _4 != 0;
> > > >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> > > >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> > > >> >   iftmp.0_13 = 0;
> > > >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > > >> > vect_iftmp.11_47, vect_cst__49>;
> > > >> >
> > > >> > and following code-gen:
> > > >> > L2:
> > > >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> > > >> >         cmpne   p1.s, p3/z, z0.s, #0
> > > >> >         cmpne   p0.s, p2/z, z0.s, #0
> > > >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> > > >> >         sel     z0.s, p1, z0.s, z1.s
> > > >> >
> > > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> > > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> > > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> > > >> >
> > > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> > > >> > which is conditional on C, then we could reuse the mask used in load,
> > > >> > in vec_cond_expr ?
> > > >> >
> > > >> > The patch maintains a hash_map cond_to_vec_mask
> > > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> > > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> > > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> > > >> > cond_to_vec_mask
> > > >> > and if found, the corresponding vec_mask is used as 1st operand of
> > > >> > vec_cond_expr.
> > > >> >
> > > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> > > >> > adds tree_cond_ops to represent condition operator and operands coming
> > > >> > either from cond_expr
> > > >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> > > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> > > >> >
> > > >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> > > >> >
> > > >> > For following test:
> > > >> > void
> > > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> > > >> > {
> > > >> >   for (int i = 0; i < 100; ++i)
> > > >> >     x[i] = y[i] ? z[i] : fallback;
> > > >> > }
> > > >> >
> > > >> > input to vectorizer has operands swapped in cond_expr:
> > > >> >   _36 = _4 != 0;
> > > >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> > > >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> > > >> >
> > > >> > So we need to check for inverted condition in cond_to_vec_mask,
> > > >> > and swap the operands.
> > > >> > Does the patch look OK so far ?
> > > >> >
> > > >> > One major issue remaining with the patch is value  numbering.
> > > >> > Currently, it does value numbering for entire function using sccvn
> > > >> > during start of vect pass, which is too expensive since we only need
> > > >> > block based VN. I am looking into that.
> > > >>
> > > >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
> > >
> > > This was my suggestion, but with the idea being to do the numbering
> > > per-statement as we vectorise.  We'll then see pattern statements too.
> > >
> > > That's important because we use pattern statements to set the right
> > > vector boolean type (e.g. vect_recog_mask_conversion_pattern).
> > > So some of the masks we care about don't exist after if converison.
> > >
> > > > Also I can't trivially see the equality of the masks and probably so
> > > > can't VN.  Is it that we just don't bother to apply loop_mask to
> > > > VEC_COND but there's no harm if we do?
> > >
> > > Yeah.  The idea of the optimisation is to decide when it's more profitable
> > > to apply the loop mask, even though doing so isn't necessary.  It would
> > > be hard to do after vectorisation because the masks aren't equivalent.
> > > We're relying on knowledge of how the vectoriser uses the result.
> > Hi,
> > Sorry for late response. This is an updated patch, that integrates
> > block-based VN into vect pass.
> > The patch
> > (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
> > initialize VN state, and vn_bb_free to free it.
> > (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
> > stmts. We're only interested in obtaining
> > value numbers, not eliminating redundancies.
> > Does it look in the right direction ?
>
> It looks a bit odd to me.  I'd have expected it to work by generating
> the stmts as before in the vectorizer and then on the stmts we care
> invoke vn_visit_stmt that does both value-numbering and elimination.
> Alternatively you could ask the VN state to generate the stmt for
> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> work).  One complication might be availability if you don't value-number
> all stmts in the block, but well.  I'm not sure constraining to a single
> block is necessary - I've thought of having a "CSE"ing gimple_build
> for some time (add & CSE new stmts onto a sequence), so one
> should keep this mode in mind when designing the one working on
> an existing BB.  Note as you write it it depends on visiting the
> stmts in proper order - is that guaranteed when for example
> vectorizing SLP?
Hi,
Indeed, I wrote the function with assumption that, stmts would be
visited in proper order.
This doesn't affect SLP currently, because call to vn_visit_stmt in
vect_transform_stmt is
conditional on cond_to_vec_mask, which is only allocated inside
vect_transform_loop.
But I agree we could make it more general.
AFAIU, the idea of constraining VN to single block was to avoid using defs from
non-dominating scalar stmts during outer-loop vectorization.

* fmla_2.c regression with patch:
This happens because with patch, forwprop4 is able to convert all 3
vec_cond_expr's
to .cond_fma(), which results in 3 calls to fmla, regressing the
test-case. If matching with
inverted condition is disabled in patch in vectorizable_condition,
then the old behavior gets preserved.

Thanks,
Prathamesh

>
> > I am not sure if the initialization in vn_bb_init is entirely correct.
> >
> > PS: The patch seems to regress fmla_2.c. I am looking into it.
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Sandiford-9
Prathamesh Kulkarni <[hidden email]> writes:

> On Thu, 22 Aug 2019 at 16:44, Richard Biener <[hidden email]> wrote:
>>
>> On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni
>> <[hidden email]> wrote:
>> >
>> > On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
>> > <[hidden email]> wrote:
>> > >
>> > > Richard Biener <[hidden email]> writes:
>> > > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
>> > > > <[hidden email]> wrote:
>> > > >>
>> > > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
>> > > >> <[hidden email]> wrote:
>> > > >> >
>> > > >> > Hi,
>> > > >> > The attached patch tries to fix PR86753.
>> > > >> >
>> > > >> > For following test:
>> > > >> > void
>> > > >> > f1 (int *restrict x, int *restrict y, int *restrict z)
>> > > >> > {
>> > > >> >   for (int i = 0; i < 100; ++i)
>> > > >> >     x[i] = y[i] ? z[i] : 10;
>> > > >> > }
>> > > >> >
>> > > >> > vect dump shows:
>> > > >> >   vect_cst__42 = { 0, ... };
>> > > >> >   vect_cst__48 = { 0, ... };
>> > > >> >
>> > > >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
>> > > >> >   _4 = *_3;
>> > > >> >   _5 = z_12(D) + _2;
>> > > >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
>> > > >> >   _35 = _4 != 0;
>> > > >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
>> > > >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
>> > > >> >   iftmp.0_13 = 0;
>> > > >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
>> > > >> > vect_iftmp.11_47, vect_cst__49>;
>> > > >> >
>> > > >> > and following code-gen:
>> > > >> > L2:
>> > > >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
>> > > >> >         cmpne   p1.s, p3/z, z0.s, #0
>> > > >> >         cmpne   p0.s, p2/z, z0.s, #0
>> > > >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
>> > > >> >         sel     z0.s, p1, z0.s, z1.s
>> > > >> >
>> > > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
>> > > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
>> > > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
>> > > >> >
>> > > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
>> > > >> > which is conditional on C, then we could reuse the mask used in load,
>> > > >> > in vec_cond_expr ?
>> > > >> >
>> > > >> > The patch maintains a hash_map cond_to_vec_mask
>> > > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
>> > > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
>> > > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
>> > > >> > cond_to_vec_mask
>> > > >> > and if found, the corresponding vec_mask is used as 1st operand of
>> > > >> > vec_cond_expr.
>> > > >> >
>> > > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
>> > > >> > adds tree_cond_ops to represent condition operator and operands coming
>> > > >> > either from cond_expr
>> > > >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
>> > > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
>> > > >> >
>> > > >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
>> > > >> >
>> > > >> > For following test:
>> > > >> > void
>> > > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
>> > > >> > {
>> > > >> >   for (int i = 0; i < 100; ++i)
>> > > >> >     x[i] = y[i] ? z[i] : fallback;
>> > > >> > }
>> > > >> >
>> > > >> > input to vectorizer has operands swapped in cond_expr:
>> > > >> >   _36 = _4 != 0;
>> > > >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
>> > > >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
>> > > >> >
>> > > >> > So we need to check for inverted condition in cond_to_vec_mask,
>> > > >> > and swap the operands.
>> > > >> > Does the patch look OK so far ?
>> > > >> >
>> > > >> > One major issue remaining with the patch is value  numbering.
>> > > >> > Currently, it does value numbering for entire function using sccvn
>> > > >> > during start of vect pass, which is too expensive since we only need
>> > > >> > block based VN. I am looking into that.
>> > > >>
>> > > >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
>> > >
>> > > This was my suggestion, but with the idea being to do the numbering
>> > > per-statement as we vectorise.  We'll then see pattern statements too.
>> > >
>> > > That's important because we use pattern statements to set the right
>> > > vector boolean type (e.g. vect_recog_mask_conversion_pattern).
>> > > So some of the masks we care about don't exist after if converison.
>> > >
>> > > > Also I can't trivially see the equality of the masks and probably so
>> > > > can't VN.  Is it that we just don't bother to apply loop_mask to
>> > > > VEC_COND but there's no harm if we do?
>> > >
>> > > Yeah.  The idea of the optimisation is to decide when it's more profitable
>> > > to apply the loop mask, even though doing so isn't necessary.  It would
>> > > be hard to do after vectorisation because the masks aren't equivalent.
>> > > We're relying on knowledge of how the vectoriser uses the result.
>> > Hi,
>> > Sorry for late response. This is an updated patch, that integrates
>> > block-based VN into vect pass.
>> > The patch
>> > (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
>> > initialize VN state, and vn_bb_free to free it.
>> > (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
>> > stmts. We're only interested in obtaining
>> > value numbers, not eliminating redundancies.
>> > Does it look in the right direction ?
>>
>> It looks a bit odd to me.  I'd have expected it to work by generating
>> the stmts as before in the vectorizer and then on the stmts we care
>> invoke vn_visit_stmt that does both value-numbering and elimination.
>> Alternatively you could ask the VN state to generate the stmt for
>> you via vn_nary_build_or_lookup () (certainly that needs a bit more
>> work).  One complication might be availability if you don't value-number
>> all stmts in the block, but well.  I'm not sure constraining to a single
>> block is necessary - I've thought of having a "CSE"ing gimple_build
>> for some time (add & CSE new stmts onto a sequence), so one
>> should keep this mode in mind when designing the one working on
>> an existing BB.  Note as you write it it depends on visiting the
>> stmts in proper order - is that guaranteed when for example
>> vectorizing SLP?
> Hi,
> Indeed, I wrote the function with assumption that, stmts would be
> visited in proper order.
> This doesn't affect SLP currently, because call to vn_visit_stmt in
> vect_transform_stmt is
> conditional on cond_to_vec_mask, which is only allocated inside
> vect_transform_loop.
> But I agree we could make it more general.
> AFAIU, the idea of constraining VN to single block was to avoid using defs from
> non-dominating scalar stmts during outer-loop vectorization.

Maybe we could do the numbering in a separate walk immediately before
the transform phase instead.

> * fmla_2.c regression with patch:
> This happens because with patch, forwprop4 is able to convert all 3
> vec_cond_expr's
> to .cond_fma(), which results in 3 calls to fmla, regressing the
> test-case. If matching with
> inverted condition is disabled in patch in vectorizable_condition,
> then the old behavior gets preserved.

Ugh, yeah.  This all stems from the fact that we don't get rid of
the first redundant store (before or after the patch).  I think it's
just luck that we managed to CSE the FMLAs feeding the two stores.
(That wasn't what the test was added for; it was added for a
suboptimal .md choice.)

I think it'd be OK to XFAIL the current scan-assembler-times and add a
new one for 3 times instead of 2 (with a comment of course).  I've filed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91532 for the redundant
stores.

Thanks,
Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Prathamesh Kulkarni-2
On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
<[hidden email]> wrote:

>
> Prathamesh Kulkarni <[hidden email]> writes:
> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <[hidden email]> wrote:
> >>
> >> On Wed, Aug 21, 2019 at 8:24 PM Prathamesh Kulkarni
> >> <[hidden email]> wrote:
> >> >
> >> > On Thu, 15 Aug 2019 at 01:50, Richard Sandiford
> >> > <[hidden email]> wrote:
> >> > >
> >> > > Richard Biener <[hidden email]> writes:
> >> > > > On Wed, Aug 14, 2019 at 6:49 PM Richard Biener
> >> > > > <[hidden email]> wrote:
> >> > > >>
> >> > > >> On Wed, Aug 14, 2019 at 5:06 PM Prathamesh Kulkarni
> >> > > >> <[hidden email]> wrote:
> >> > > >> >
> >> > > >> > Hi,
> >> > > >> > The attached patch tries to fix PR86753.
> >> > > >> >
> >> > > >> > For following test:
> >> > > >> > void
> >> > > >> > f1 (int *restrict x, int *restrict y, int *restrict z)
> >> > > >> > {
> >> > > >> >   for (int i = 0; i < 100; ++i)
> >> > > >> >     x[i] = y[i] ? z[i] : 10;
> >> > > >> > }
> >> > > >> >
> >> > > >> > vect dump shows:
> >> > > >> >   vect_cst__42 = { 0, ... };
> >> > > >> >   vect_cst__48 = { 0, ... };
> >> > > >> >
> >> > > >> >   vect__4.7_41 = .MASK_LOAD (vectp_y.5_38, 4B, loop_mask_40);
> >> > > >> >   _4 = *_3;
> >> > > >> >   _5 = z_12(D) + _2;
> >> > > >> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> >> > > >> >   _35 = _4 != 0;
> >> > > >> >   vec_mask_and_46 = mask__35.8_43 & loop_mask_40;
> >> > > >> >   vect_iftmp.11_47 = .MASK_LOAD (vectp_z.9_44, 4B, vec_mask_and_46);
> >> > > >> >   iftmp.0_13 = 0;
> >> > > >> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> >> > > >> > vect_iftmp.11_47, vect_cst__49>;
> >> > > >> >
> >> > > >> > and following code-gen:
> >> > > >> > L2:
> >> > > >> >         ld1w    z0.s, p2/z, [x1, x3, lsl 2]
> >> > > >> >         cmpne   p1.s, p3/z, z0.s, #0
> >> > > >> >         cmpne   p0.s, p2/z, z0.s, #0
> >> > > >> >         ld1w    z0.s, p0/z, [x2, x3, lsl 2]
> >> > > >> >         sel     z0.s, p1, z0.s, z1.s
> >> > > >> >
> >> > > >> > We could reuse vec_mask_and_46 in vec_cond_expr since the conditions
> >> > > >> > vect__4.7_41 != vect_cst__48 and vect__4.7_41 != vect_cst__42
> >> > > >> > are equivalent, and vect_iftmp.11_47 depends on vect__4.7_41 != vect_cst__48.
> >> > > >> >
> >> > > >> > I suppose in general for vec_cond_expr <C, T, E> if T comes from masked load,
> >> > > >> > which is conditional on C, then we could reuse the mask used in load,
> >> > > >> > in vec_cond_expr ?
> >> > > >> >
> >> > > >> > The patch maintains a hash_map cond_to_vec_mask
> >> > > >> > from <cond, loop_mask -> vec_mask (with loop predicate applied).
> >> > > >> > In prepare_load_store_mask, we record <cond, loop_mask> -> vec_mask & loop_mask,
> >> > > >> > and in vectorizable_condition, we check if <cond, loop_mask> exists in
> >> > > >> > cond_to_vec_mask
> >> > > >> > and if found, the corresponding vec_mask is used as 1st operand of
> >> > > >> > vec_cond_expr.
> >> > > >> >
> >> > > >> > <cond, loop_mask> is represented with cond_vmask_key, and the patch
> >> > > >> > adds tree_cond_ops to represent condition operator and operands coming
> >> > > >> > either from cond_expr
> >> > > >> > or a gimple comparison stmt. If the stmt is not comparison, it returns
> >> > > >> > <ne_expr, lhs, 0> and inserts that into cond_to_vec_mask.
> >> > > >> >
> >> > > >> > With patch, the redundant p1 is eliminated and sel uses p0 for above test.
> >> > > >> >
> >> > > >> > For following test:
> >> > > >> > void
> >> > > >> > f2 (int *restrict x, int *restrict y, int *restrict z, int fallback)
> >> > > >> > {
> >> > > >> >   for (int i = 0; i < 100; ++i)
> >> > > >> >     x[i] = y[i] ? z[i] : fallback;
> >> > > >> > }
> >> > > >> >
> >> > > >> > input to vectorizer has operands swapped in cond_expr:
> >> > > >> >   _36 = _4 != 0;
> >> > > >> >   iftmp.0_14 = .MASK_LOAD (_5, 32B, _36);
> >> > > >> >   iftmp.0_8 = _4 == 0 ? fallback_12(D) : iftmp.0_14;
> >> > > >> >
> >> > > >> > So we need to check for inverted condition in cond_to_vec_mask,
> >> > > >> > and swap the operands.
> >> > > >> > Does the patch look OK so far ?
> >> > > >> >
> >> > > >> > One major issue remaining with the patch is value  numbering.
> >> > > >> > Currently, it does value numbering for entire function using sccvn
> >> > > >> > during start of vect pass, which is too expensive since we only need
> >> > > >> > block based VN. I am looking into that.
> >> > > >>
> >> > > >> Why do you need it at all?  We run VN on the if-converted loop bodies btw.
> >> > >
> >> > > This was my suggestion, but with the idea being to do the numbering
> >> > > per-statement as we vectorise.  We'll then see pattern statements too.
> >> > >
> >> > > That's important because we use pattern statements to set the right
> >> > > vector boolean type (e.g. vect_recog_mask_conversion_pattern).
> >> > > So some of the masks we care about don't exist after if converison.
> >> > >
> >> > > > Also I can't trivially see the equality of the masks and probably so
> >> > > > can't VN.  Is it that we just don't bother to apply loop_mask to
> >> > > > VEC_COND but there's no harm if we do?
> >> > >
> >> > > Yeah.  The idea of the optimisation is to decide when it's more profitable
> >> > > to apply the loop mask, even though doing so isn't necessary.  It would
> >> > > be hard to do after vectorisation because the masks aren't equivalent.
> >> > > We're relying on knowledge of how the vectoriser uses the result.
> >> > Hi,
> >> > Sorry for late response. This is an updated patch, that integrates
> >> > block-based VN into vect pass.
> >> > The patch
> >> > (a) Exports visit_stmt (renamed to vn_visit_stmt), vn_bb_init to
> >> > initialize VN state, and vn_bb_free to free it.
> >> > (b) Calls vn_visit_stmt in vect_transform_stmt for value numbering
> >> > stmts. We're only interested in obtaining
> >> > value numbers, not eliminating redundancies.
> >> > Does it look in the right direction ?
> >>
> >> It looks a bit odd to me.  I'd have expected it to work by generating
> >> the stmts as before in the vectorizer and then on the stmts we care
> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> >> Alternatively you could ask the VN state to generate the stmt for
> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> >> work).  One complication might be availability if you don't value-number
> >> all stmts in the block, but well.  I'm not sure constraining to a single
> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> >> for some time (add & CSE new stmts onto a sequence), so one
> >> should keep this mode in mind when designing the one working on
> >> an existing BB.  Note as you write it it depends on visiting the
> >> stmts in proper order - is that guaranteed when for example
> >> vectorizing SLP?
> > Hi,
> > Indeed, I wrote the function with assumption that, stmts would be
> > visited in proper order.
> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> > vect_transform_stmt is
> > conditional on cond_to_vec_mask, which is only allocated inside
> > vect_transform_loop.
> > But I agree we could make it more general.
> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> > non-dominating scalar stmts during outer-loop vectorization.
>
> Maybe we could do the numbering in a separate walk immediately before
> the transform phase instead.
Um sorry, I didn't understand. Do you mean we should do dom based VN
just before transform phase
or run full VN ?

>
> > * fmla_2.c regression with patch:
> > This happens because with patch, forwprop4 is able to convert all 3
> > vec_cond_expr's
> > to .cond_fma(), which results in 3 calls to fmla, regressing the
> > test-case. If matching with
> > inverted condition is disabled in patch in vectorizable_condition,
> > then the old behavior gets preserved.
>
> Ugh, yeah.  This all stems from the fact that we don't get rid of
> the first redundant store (before or after the patch).  I think it's
> just luck that we managed to CSE the FMLAs feeding the two stores.
> (That wasn't what the test was added for; it was added for a
> suboptimal .md choice.)
>
> I think it'd be OK to XFAIL the current scan-assembler-times and add a
> new one for 3 times instead of 2 (with a comment of course).  I've filed
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91532 for the redundant
> stores.
Thanks for the suggestions, I will adjust the test-case in follow up patch.

Thanks,
Prathamesh
>
> Thanks,
> Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Sandiford-9
Prathamesh Kulkarni <[hidden email]> writes:

> On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> <[hidden email]> wrote:
>>
>> Prathamesh Kulkarni <[hidden email]> writes:
>> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <[hidden email]> wrote:
>> >> It looks a bit odd to me.  I'd have expected it to work by generating
>> >> the stmts as before in the vectorizer and then on the stmts we care
>> >> invoke vn_visit_stmt that does both value-numbering and elimination.
>> >> Alternatively you could ask the VN state to generate the stmt for
>> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
>> >> work).  One complication might be availability if you don't value-number
>> >> all stmts in the block, but well.  I'm not sure constraining to a single
>> >> block is necessary - I've thought of having a "CSE"ing gimple_build
>> >> for some time (add & CSE new stmts onto a sequence), so one
>> >> should keep this mode in mind when designing the one working on
>> >> an existing BB.  Note as you write it it depends on visiting the
>> >> stmts in proper order - is that guaranteed when for example
>> >> vectorizing SLP?
>> > Hi,
>> > Indeed, I wrote the function with assumption that, stmts would be
>> > visited in proper order.
>> > This doesn't affect SLP currently, because call to vn_visit_stmt in
>> > vect_transform_stmt is
>> > conditional on cond_to_vec_mask, which is only allocated inside
>> > vect_transform_loop.
>> > But I agree we could make it more general.
>> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
>> > non-dominating scalar stmts during outer-loop vectorization.
>>
>> Maybe we could do the numbering in a separate walk immediately before
>> the transform phase instead.
> Um sorry, I didn't understand. Do you mean we should do dom based VN
> just before transform phase
> or run full VN ?

No, I just meant that we could do a separate walk of the contents
of the basic block:

> @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>      {
>        basic_block bb = bbs[i];
>        stmt_vec_info stmt_info;
> +      vn_bb_init (bb);
> +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
>

...here, rather than doing it on the fly during vect_transform_stmt
itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
others don't have to pay the compile-time penalty.  (Same for
cond_to_vec_mask itself really.)

Thanks,
Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Prathamesh Kulkarni-2
On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
<[hidden email]> wrote:

>
> Prathamesh Kulkarni <[hidden email]> writes:
> > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> > <[hidden email]> wrote:
> >>
> >> Prathamesh Kulkarni <[hidden email]> writes:
> >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <[hidden email]> wrote:
> >> >> It looks a bit odd to me.  I'd have expected it to work by generating
> >> >> the stmts as before in the vectorizer and then on the stmts we care
> >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> >> >> Alternatively you could ask the VN state to generate the stmt for
> >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> >> >> work).  One complication might be availability if you don't value-number
> >> >> all stmts in the block, but well.  I'm not sure constraining to a single
> >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> >> >> for some time (add & CSE new stmts onto a sequence), so one
> >> >> should keep this mode in mind when designing the one working on
> >> >> an existing BB.  Note as you write it it depends on visiting the
> >> >> stmts in proper order - is that guaranteed when for example
> >> >> vectorizing SLP?
> >> > Hi,
> >> > Indeed, I wrote the function with assumption that, stmts would be
> >> > visited in proper order.
> >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> >> > vect_transform_stmt is
> >> > conditional on cond_to_vec_mask, which is only allocated inside
> >> > vect_transform_loop.
> >> > But I agree we could make it more general.
> >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> >> > non-dominating scalar stmts during outer-loop vectorization.
> >>
> >> Maybe we could do the numbering in a separate walk immediately before
> >> the transform phase instead.
> > Um sorry, I didn't understand. Do you mean we should do dom based VN
> > just before transform phase
> > or run full VN ?
>
> No, I just meant that we could do a separate walk of the contents
> of the basic block:
>
> > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> >      {
> >        basic_block bb = bbs[i];
> >        stmt_vec_info stmt_info;
> > +      vn_bb_init (bb);
> > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> >
>
> ...here, rather than doing it on the fly during vect_transform_stmt
> itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
> others don't have to pay the compile-time penalty.  (Same for
> cond_to_vec_mask itself really.)
Hi,
Does the attached patch look OK ?
In patch, I put call to vn_visit stmt in bb loop in
vect_transform_loop to avoid replicating logic for processing phi and
stmts.
AFAIU, vect_transform_loop_stmt is only called from bb loop, so
compile time penalty for checking cond_to_vec_mask
should be pretty small ?
If this is not OK, I will walk bb immediately before the bb loop.

Thanks,
Prathamesh
>
> Thanks,
> Richard

pr86753-7.diff (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Biener-2
On Sun, Aug 25, 2019 at 11:13 PM Prathamesh Kulkarni
<[hidden email]> wrote:

>
> On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
> <[hidden email]> wrote:
> >
> > Prathamesh Kulkarni <[hidden email]> writes:
> > > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> > > <[hidden email]> wrote:
> > >>
> > >> Prathamesh Kulkarni <[hidden email]> writes:
> > >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <[hidden email]> wrote:
> > >> >> It looks a bit odd to me.  I'd have expected it to work by generating
> > >> >> the stmts as before in the vectorizer and then on the stmts we care
> > >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> > >> >> Alternatively you could ask the VN state to generate the stmt for
> > >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> > >> >> work).  One complication might be availability if you don't value-number
> > >> >> all stmts in the block, but well.  I'm not sure constraining to a single
> > >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> > >> >> for some time (add & CSE new stmts onto a sequence), so one
> > >> >> should keep this mode in mind when designing the one working on
> > >> >> an existing BB.  Note as you write it it depends on visiting the
> > >> >> stmts in proper order - is that guaranteed when for example
> > >> >> vectorizing SLP?
> > >> > Hi,
> > >> > Indeed, I wrote the function with assumption that, stmts would be
> > >> > visited in proper order.
> > >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> > >> > vect_transform_stmt is
> > >> > conditional on cond_to_vec_mask, which is only allocated inside
> > >> > vect_transform_loop.
> > >> > But I agree we could make it more general.
> > >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> > >> > non-dominating scalar stmts during outer-loop vectorization.
> > >>
> > >> Maybe we could do the numbering in a separate walk immediately before
> > >> the transform phase instead.
> > > Um sorry, I didn't understand. Do you mean we should do dom based VN
> > > just before transform phase
> > > or run full VN ?
> >
> > No, I just meant that we could do a separate walk of the contents
> > of the basic block:
> >
> > > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> > >      {
> > >        basic_block bb = bbs[i];
> > >        stmt_vec_info stmt_info;
> > > +      vn_bb_init (bb);
> > > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> > >
> >
> > ...here, rather than doing it on the fly during vect_transform_stmt
> > itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
> > others don't have to pay the compile-time penalty.  (Same for
> > cond_to_vec_mask itself really.)
> Hi,
> Does the attached patch look OK ?
> In patch, I put call to vn_visit stmt in bb loop in
> vect_transform_loop to avoid replicating logic for processing phi and
> stmts.
> AFAIU, vect_transform_loop_stmt is only called from bb loop, so
> compile time penalty for checking cond_to_vec_mask
> should be pretty small ?
> If this is not OK, I will walk bb immediately before the bb loop.

So if I understand correctly you never have vectorizable COND_EXPRs
in SLP mode?  Because we vectorize all SLP chains before entering
the loop in vect_transform_loop where you VN existing scalar(!) stmts.

Then all this hew hash-table stuff should not be needed since this
is what VN should provide you with.  You of course need to visit
generated condition stmts.  And condition support is weak
in VN due to it possibly having two operations in a single stmt.
Bad GIMPLE IL.  So I'm not sure VN is up to the task here or
why you even need it given you are doing your own hashing?

Richard.

> Thanks,
> Prathamesh
> >
> > Thanks,
> > Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Prathamesh Kulkarni-2
On Mon, 26 Aug 2019 at 14:48, Richard Biener <[hidden email]> wrote:

>
> On Sun, Aug 25, 2019 at 11:13 PM Prathamesh Kulkarni
> <[hidden email]> wrote:
> >
> > On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
> > <[hidden email]> wrote:
> > >
> > > Prathamesh Kulkarni <[hidden email]> writes:
> > > > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> > > > <[hidden email]> wrote:
> > > >>
> > > >> Prathamesh Kulkarni <[hidden email]> writes:
> > > >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <[hidden email]> wrote:
> > > >> >> It looks a bit odd to me.  I'd have expected it to work by generating
> > > >> >> the stmts as before in the vectorizer and then on the stmts we care
> > > >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> > > >> >> Alternatively you could ask the VN state to generate the stmt for
> > > >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> > > >> >> work).  One complication might be availability if you don't value-number
> > > >> >> all stmts in the block, but well.  I'm not sure constraining to a single
> > > >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> > > >> >> for some time (add & CSE new stmts onto a sequence), so one
> > > >> >> should keep this mode in mind when designing the one working on
> > > >> >> an existing BB.  Note as you write it it depends on visiting the
> > > >> >> stmts in proper order - is that guaranteed when for example
> > > >> >> vectorizing SLP?
> > > >> > Hi,
> > > >> > Indeed, I wrote the function with assumption that, stmts would be
> > > >> > visited in proper order.
> > > >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> > > >> > vect_transform_stmt is
> > > >> > conditional on cond_to_vec_mask, which is only allocated inside
> > > >> > vect_transform_loop.
> > > >> > But I agree we could make it more general.
> > > >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> > > >> > non-dominating scalar stmts during outer-loop vectorization.
> > > >>
> > > >> Maybe we could do the numbering in a separate walk immediately before
> > > >> the transform phase instead.
> > > > Um sorry, I didn't understand. Do you mean we should do dom based VN
> > > > just before transform phase
> > > > or run full VN ?
> > >
> > > No, I just meant that we could do a separate walk of the contents
> > > of the basic block:
> > >
> > > > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> > > >      {
> > > >        basic_block bb = bbs[i];
> > > >        stmt_vec_info stmt_info;
> > > > +      vn_bb_init (bb);
> > > > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> > > >
> > >
> > > ...here, rather than doing it on the fly during vect_transform_stmt
> > > itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
> > > others don't have to pay the compile-time penalty.  (Same for
> > > cond_to_vec_mask itself really.)
> > Hi,
> > Does the attached patch look OK ?
> > In patch, I put call to vn_visit stmt in bb loop in
> > vect_transform_loop to avoid replicating logic for processing phi and
> > stmts.
> > AFAIU, vect_transform_loop_stmt is only called from bb loop, so
> > compile time penalty for checking cond_to_vec_mask
> > should be pretty small ?
> > If this is not OK, I will walk bb immediately before the bb loop.
>
> So if I understand correctly you never have vectorizable COND_EXPRs
> in SLP mode?  Because we vectorize all SLP chains before entering
> the loop in vect_transform_loop where you VN existing scalar(!) stmts.
>
> Then all this hew hash-table stuff should not be needed since this
> is what VN should provide you with.  You of course need to visit
> generated condition stmts.  And condition support is weak
> in VN due to it possibly having two operations in a single stmt.
> Bad GIMPLE IL.  So I'm not sure VN is up to the task here or
> why you even need it given you are doing your own hashing?
Well, we thought of using VN for comparing operands for cases
operand_equal_p would not
work. Actually, VN seems not to be required for test-cases in PR
because both conditions
are _4 != 0 (_35 = _4 != 0 and in cond_expr), which works to match
with operand_equal_p.
Input to vectorizer is:

 <bb 3> [local count: 1063004407]:
  # i_20 = PHI <i_16(7), 0(15)>
  # ivtmp_19 = PHI <ivtmp_9(7), 100(15)>
  _1 = (long unsigned int) i_20;
  _2 = _1 * 4;
  _3 = y_11(D) + _2;
  _4 = *_3;
  _5 = z_12(D) + _2;
  _35 = _4 != 0;
  iftmp.0_13 = .MASK_LOAD (_5, 32B, _35);
  iftmp.0_8 = _4 != 0 ? iftmp.0_13 : 10;

In prepare_load_store_mask, we record (ne_expr, _4, 0) -> vec_mask in
cond_to_vec_mask,
and in vectorizable_condition, we look up (ne_expr, _4, 0) which does
not require VN
since operands are same.

Initially, I was trying to change the generated vectorized code:

  mask__35.8_43 = vect__4.7_41 != vect_cst__42;
  vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
vect_iftmp.11_47, vect_cst__49>;

where both conditions are equivalent because vect_cst__42 and
vect_cst__48 are zero vectors but operand_equal_p failed to catch
those.

Sorry, I mixed up later between scalar and vector stmts.
I wonder if we then need VN ? Perhaps there might be other cases where
operands of scalar
conditions may be equivalent but not match with operand_equal_p ?
In the attached patch, changing operator==, to compare using
operand_equal_p works for the tests.

Thanks,
Prathamesh
>
> Richard.
>
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Richard

pr86753-8.diff (21K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Sandiford-9
Prathamesh Kulkarni <[hidden email]> writes:

> On Mon, 26 Aug 2019 at 14:48, Richard Biener <[hidden email]> wrote:
>>
>> On Sun, Aug 25, 2019 at 11:13 PM Prathamesh Kulkarni
>> <[hidden email]> wrote:
>> >
>> > On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
>> > <[hidden email]> wrote:
>> > >
>> > > Prathamesh Kulkarni <[hidden email]> writes:
>> > > > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
>> > > > <[hidden email]> wrote:
>> > > >>
>> > > >> Prathamesh Kulkarni <[hidden email]> writes:
>> > > >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <[hidden email]> wrote:
>> > > >> >> It looks a bit odd to me.  I'd have expected it to work by generating
>> > > >> >> the stmts as before in the vectorizer and then on the stmts we care
>> > > >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
>> > > >> >> Alternatively you could ask the VN state to generate the stmt for
>> > > >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
>> > > >> >> work).  One complication might be availability if you don't value-number
>> > > >> >> all stmts in the block, but well.  I'm not sure constraining to a single
>> > > >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
>> > > >> >> for some time (add & CSE new stmts onto a sequence), so one
>> > > >> >> should keep this mode in mind when designing the one working on
>> > > >> >> an existing BB.  Note as you write it it depends on visiting the
>> > > >> >> stmts in proper order - is that guaranteed when for example
>> > > >> >> vectorizing SLP?
>> > > >> > Hi,
>> > > >> > Indeed, I wrote the function with assumption that, stmts would be
>> > > >> > visited in proper order.
>> > > >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
>> > > >> > vect_transform_stmt is
>> > > >> > conditional on cond_to_vec_mask, which is only allocated inside
>> > > >> > vect_transform_loop.
>> > > >> > But I agree we could make it more general.
>> > > >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
>> > > >> > non-dominating scalar stmts during outer-loop vectorization.
>> > > >>
>> > > >> Maybe we could do the numbering in a separate walk immediately before
>> > > >> the transform phase instead.
>> > > > Um sorry, I didn't understand. Do you mean we should do dom based VN
>> > > > just before transform phase
>> > > > or run full VN ?
>> > >
>> > > No, I just meant that we could do a separate walk of the contents
>> > > of the basic block:
>> > >
>> > > > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
>> > > >      {
>> > > >        basic_block bb = bbs[i];
>> > > >        stmt_vec_info stmt_info;
>> > > > +      vn_bb_init (bb);
>> > > > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
>> > > >
>> > >
>> > > ...here, rather than doing it on the fly during vect_transform_stmt
>> > > itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
>> > > others don't have to pay the compile-time penalty.  (Same for
>> > > cond_to_vec_mask itself really.)
>> > Hi,
>> > Does the attached patch look OK ?
>> > In patch, I put call to vn_visit stmt in bb loop in
>> > vect_transform_loop to avoid replicating logic for processing phi and
>> > stmts.
>> > AFAIU, vect_transform_loop_stmt is only called from bb loop, so
>> > compile time penalty for checking cond_to_vec_mask
>> > should be pretty small ?
>> > If this is not OK, I will walk bb immediately before the bb loop.
>>
>> So if I understand correctly you never have vectorizable COND_EXPRs
>> in SLP mode?  Because we vectorize all SLP chains before entering
>> the loop in vect_transform_loop where you VN existing scalar(!) stmts.

On the "!": the idea behind the patch is to find cases in which a
scalar condition is used in both a statement that needs to be masked
for correctness reasons and a statement that we can choose to mask
if we want to.  It also tries (opportunisticly) to match the ?: order
with other conditions.

That's why it's operating on scalar values rather than vector values.
In principle it could be done as a subpass before vectorisation rather
than on the fly, when there aren't any vector stmts around.

>> Then all this hew hash-table stuff should not be needed since this
>> is what VN should provide you with.  You of course need to visit
>> generated condition stmts.  And condition support is weak
>> in VN due to it possibly having two operations in a single stmt.
>> Bad GIMPLE IL.  So I'm not sure VN is up to the task here or
>> why you even need it given you are doing your own hashing?
> Well, we thought of using VN for comparing operands for cases
> operand_equal_p would not
> work. Actually, VN seems not to be required for test-cases in PR
> because both conditions
> are _4 != 0 (_35 = _4 != 0 and in cond_expr), which works to match
> with operand_equal_p.

Right, that's why I was suggesting in the earlier thread that we
treat value numbering as a follow-on.  But...

> Input to vectorizer is:
>
>  <bb 3> [local count: 1063004407]:
>   # i_20 = PHI <i_16(7), 0(15)>
>   # ivtmp_19 = PHI <ivtmp_9(7), 100(15)>
>   _1 = (long unsigned int) i_20;
>   _2 = _1 * 4;
>   _3 = y_11(D) + _2;
>   _4 = *_3;
>   _5 = z_12(D) + _2;
>   _35 = _4 != 0;
>   iftmp.0_13 = .MASK_LOAD (_5, 32B, _35);
>   iftmp.0_8 = _4 != 0 ? iftmp.0_13 : 10;
>
> In prepare_load_store_mask, we record (ne_expr, _4, 0) -> vec_mask in
> cond_to_vec_mask,
> and in vectorizable_condition, we look up (ne_expr, _4, 0) which does
> not require VN
> since operands are same.
>
> Initially, I was trying to change the generated vectorized code:
>
>   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
>   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> vect_iftmp.11_47, vect_cst__49>;
>
> where both conditions are equivalent because vect_cst__42 and
> vect_cst__48 are zero vectors but operand_equal_p failed to catch
> those.
>
> Sorry, I mixed up later between scalar and vector stmts.
> I wonder if we then need VN ? Perhaps there might be other cases where
> operands of scalar
> conditions may be equivalent but not match with operand_equal_p ?
> In the attached patch, changing operator==, to compare using
> operand_equal_p works for the tests.

...those are only very simple cases :-) The problem is that ifcvt
doesn't leave the code free of redundancies, and pattern stmt generation
can create redundancies too.  Of course, we could fix those instead.

E.g. for:

void
f (short *restrict x, short *restrict a, short *restrict b, short *restrict c)
{
  for (int i = 0; i < 100; ++i)
    if (a[i] >= 1)
      x[i] = b[i] >= 1 ? a[i] : c[i];
}

ifcvt produces:

  <bb 3> [local count: 1063004407]:
  # i_34 = PHI <i_30(10), 0(21)>
  # ivtmp_5 = PHI <ivtmp_6(10), 100(21)>
  _1 = (long unsigned int) i_34;
  _2 = _1 * 2;
  _3 = a_23(D) + _2;
  _4 = *_3;
  _7 = b_24(D) + _2;
  _49 = _4 > 0;
  _8 = .MASK_LOAD (_7, 16B, _49);
  _12 = _4 > 0;
  _13 = _8 > 0;
  _9 = _12 & _13;
  _10 = _4 > 0;
  _11 = _8 > 0;
  _27 = ~_11;
  _15 = _10 & _27;
  _14 = c_25(D) + _2;
  iftmp.0_26 = .MASK_LOAD (_14, 16B, _15);
  iftmp.0_19 = _9 ? _4 : iftmp.0_26;
  _17 = x_28(D) + _2;
  _50 = _4 > 0;
  .MASK_STORE (_17, 16B, _50, iftmp.0_19);
  i_30 = i_34 + 1;
  ivtmp_6 = ivtmp_5 - 1;
  if (ivtmp_6 != 0)
    goto <bb 10>; [98.99%]
  else
    goto <bb 9>; [1.01%]

  <bb 10> [local count: 1052266994]:
  goto <bb 3>; [100.00%]

which has 4 copies of _4 > 0 (a[i] > 0) and 2 copies of _8 > 0 (b[i] > 0).
Looking through the definition of an SSA name means that we can cope
with these redundancies for single comparisons, but not for comparisons
combined through & and |.  This hurts most when trying to decide whether
to invert comparisons.

But maybe value numbering won't cope with that either :-)

Thanks,
Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Biener-2
On Tue, Aug 27, 2019 at 11:58 AM Richard Sandiford
<[hidden email]> wrote:

>
> Prathamesh Kulkarni <[hidden email]> writes:
> > On Mon, 26 Aug 2019 at 14:48, Richard Biener <[hidden email]> wrote:
> >>
> >> On Sun, Aug 25, 2019 at 11:13 PM Prathamesh Kulkarni
> >> <[hidden email]> wrote:
> >> >
> >> > On Fri, 23 Aug 2019 at 19:43, Richard Sandiford
> >> > <[hidden email]> wrote:
> >> > >
> >> > > Prathamesh Kulkarni <[hidden email]> writes:
> >> > > > On Fri, 23 Aug 2019 at 18:15, Richard Sandiford
> >> > > > <[hidden email]> wrote:
> >> > > >>
> >> > > >> Prathamesh Kulkarni <[hidden email]> writes:
> >> > > >> > On Thu, 22 Aug 2019 at 16:44, Richard Biener <[hidden email]> wrote:
> >> > > >> >> It looks a bit odd to me.  I'd have expected it to work by generating
> >> > > >> >> the stmts as before in the vectorizer and then on the stmts we care
> >> > > >> >> invoke vn_visit_stmt that does both value-numbering and elimination.
> >> > > >> >> Alternatively you could ask the VN state to generate the stmt for
> >> > > >> >> you via vn_nary_build_or_lookup () (certainly that needs a bit more
> >> > > >> >> work).  One complication might be availability if you don't value-number
> >> > > >> >> all stmts in the block, but well.  I'm not sure constraining to a single
> >> > > >> >> block is necessary - I've thought of having a "CSE"ing gimple_build
> >> > > >> >> for some time (add & CSE new stmts onto a sequence), so one
> >> > > >> >> should keep this mode in mind when designing the one working on
> >> > > >> >> an existing BB.  Note as you write it it depends on visiting the
> >> > > >> >> stmts in proper order - is that guaranteed when for example
> >> > > >> >> vectorizing SLP?
> >> > > >> > Hi,
> >> > > >> > Indeed, I wrote the function with assumption that, stmts would be
> >> > > >> > visited in proper order.
> >> > > >> > This doesn't affect SLP currently, because call to vn_visit_stmt in
> >> > > >> > vect_transform_stmt is
> >> > > >> > conditional on cond_to_vec_mask, which is only allocated inside
> >> > > >> > vect_transform_loop.
> >> > > >> > But I agree we could make it more general.
> >> > > >> > AFAIU, the idea of constraining VN to single block was to avoid using defs from
> >> > > >> > non-dominating scalar stmts during outer-loop vectorization.
> >> > > >>
> >> > > >> Maybe we could do the numbering in a separate walk immediately before
> >> > > >> the transform phase instead.
> >> > > > Um sorry, I didn't understand. Do you mean we should do dom based VN
> >> > > > just before transform phase
> >> > > > or run full VN ?
> >> > >
> >> > > No, I just meant that we could do a separate walk of the contents
> >> > > of the basic block:
> >> > >
> >> > > > @@ -8608,6 +8609,8 @@ vect_transform_loop (loop_vec_info loop_vinfo)
> >> > > >      {
> >> > > >        basic_block bb = bbs[i];
> >> > > >        stmt_vec_info stmt_info;
> >> > > > +      vn_bb_init (bb);
> >> > > > +      loop_vinfo->cond_to_vec_mask = new cond_vmask_map_type (8);
> >> > > >
> >> > >
> >> > > ...here, rather than doing it on the fly during vect_transform_stmt
> >> > > itself.  The walk should be gated on LOOP_VINFO_FULLY_MASKED_P so that
> >> > > others don't have to pay the compile-time penalty.  (Same for
> >> > > cond_to_vec_mask itself really.)
> >> > Hi,
> >> > Does the attached patch look OK ?
> >> > In patch, I put call to vn_visit stmt in bb loop in
> >> > vect_transform_loop to avoid replicating logic for processing phi and
> >> > stmts.
> >> > AFAIU, vect_transform_loop_stmt is only called from bb loop, so
> >> > compile time penalty for checking cond_to_vec_mask
> >> > should be pretty small ?
> >> > If this is not OK, I will walk bb immediately before the bb loop.
> >>
> >> So if I understand correctly you never have vectorizable COND_EXPRs
> >> in SLP mode?  Because we vectorize all SLP chains before entering
> >> the loop in vect_transform_loop where you VN existing scalar(!) stmts.
>
> On the "!": the idea behind the patch is to find cases in which a
> scalar condition is used in both a statement that needs to be masked
> for correctness reasons and a statement that we can choose to mask
> if we want to.  It also tries (opportunisticly) to match the ?: order
> with other conditions.
>
> That's why it's operating on scalar values rather than vector values.
> In principle it could be done as a subpass before vectorisation rather
> than on the fly, when there aren't any vector stmts around.
>
> >> Then all this hew hash-table stuff should not be needed since this
> >> is what VN should provide you with.  You of course need to visit
> >> generated condition stmts.  And condition support is weak
> >> in VN due to it possibly having two operations in a single stmt.
> >> Bad GIMPLE IL.  So I'm not sure VN is up to the task here or
> >> why you even need it given you are doing your own hashing?
> > Well, we thought of using VN for comparing operands for cases
> > operand_equal_p would not
> > work. Actually, VN seems not to be required for test-cases in PR
> > because both conditions
> > are _4 != 0 (_35 = _4 != 0 and in cond_expr), which works to match
> > with operand_equal_p.
>
> Right, that's why I was suggesting in the earlier thread that we
> treat value numbering as a follow-on.  But...
>
> > Input to vectorizer is:
> >
> >  <bb 3> [local count: 1063004407]:
> >   # i_20 = PHI <i_16(7), 0(15)>
> >   # ivtmp_19 = PHI <ivtmp_9(7), 100(15)>
> >   _1 = (long unsigned int) i_20;
> >   _2 = _1 * 4;
> >   _3 = y_11(D) + _2;
> >   _4 = *_3;
> >   _5 = z_12(D) + _2;
> >   _35 = _4 != 0;
> >   iftmp.0_13 = .MASK_LOAD (_5, 32B, _35);
> >   iftmp.0_8 = _4 != 0 ? iftmp.0_13 : 10;
> >
> > In prepare_load_store_mask, we record (ne_expr, _4, 0) -> vec_mask in
> > cond_to_vec_mask,
> > and in vectorizable_condition, we look up (ne_expr, _4, 0) which does
> > not require VN
> > since operands are same.
> >
> > Initially, I was trying to change the generated vectorized code:
> >
> >   mask__35.8_43 = vect__4.7_41 != vect_cst__42;
> >   vect_iftmp.12_50 = VEC_COND_EXPR <vect__4.7_41 != vect_cst__48,
> > vect_iftmp.11_47, vect_cst__49>;
> >
> > where both conditions are equivalent because vect_cst__42 and
> > vect_cst__48 are zero vectors but operand_equal_p failed to catch
> > those.
> >
> > Sorry, I mixed up later between scalar and vector stmts.
> > I wonder if we then need VN ? Perhaps there might be other cases where
> > operands of scalar
> > conditions may be equivalent but not match with operand_equal_p ?
> > In the attached patch, changing operator==, to compare using
> > operand_equal_p works for the tests.
>
> ...those are only very simple cases :-) The problem is that ifcvt
> doesn't leave the code free of redundancies, and pattern stmt generation
> can create redundancies too.  Of course, we could fix those instead.
>
> E.g. for:
>
> void
> f (short *restrict x, short *restrict a, short *restrict b, short *restrict c)
> {
>   for (int i = 0; i < 100; ++i)
>     if (a[i] >= 1)
>       x[i] = b[i] >= 1 ? a[i] : c[i];
> }
>
> ifcvt produces:
>
>   <bb 3> [local count: 1063004407]:
>   # i_34 = PHI <i_30(10), 0(21)>
>   # ivtmp_5 = PHI <ivtmp_6(10), 100(21)>
>   _1 = (long unsigned int) i_34;
>   _2 = _1 * 2;
>   _3 = a_23(D) + _2;
>   _4 = *_3;
>   _7 = b_24(D) + _2;
>   _49 = _4 > 0;
>   _8 = .MASK_LOAD (_7, 16B, _49);
>   _12 = _4 > 0;
>   _13 = _8 > 0;
>   _9 = _12 & _13;
>   _10 = _4 > 0;
>   _11 = _8 > 0;
>   _27 = ~_11;
>   _15 = _10 & _27;
>   _14 = c_25(D) + _2;
>   iftmp.0_26 = .MASK_LOAD (_14, 16B, _15);
>   iftmp.0_19 = _9 ? _4 : iftmp.0_26;
>   _17 = x_28(D) + _2;
>   _50 = _4 > 0;
>   .MASK_STORE (_17, 16B, _50, iftmp.0_19);
>   i_30 = i_34 + 1;
>   ivtmp_6 = ivtmp_5 - 1;
>   if (ivtmp_6 != 0)
>     goto <bb 10>; [98.99%]
>   else
>     goto <bb 9>; [1.01%]
>
>   <bb 10> [local count: 1052266994]:
>   goto <bb 3>; [100.00%]
>
> which has 4 copies of _4 > 0 (a[i] > 0) and 2 copies of _8 > 0 (b[i] > 0).

Huh.  if-conversion does

  /* Now all statements are if-convertible.  Combine all the basic
     blocks into one huge basic block doing the if-conversion
     on-the-fly.  */
  combine_blocks (loop);

  /* Delete dead predicate computations.  */
  ifcvt_local_dce (loop->header);

  /* Perform local CSE, this esp. helps the vectorizer analysis if loads
     and stores are involved.  CSE only the loop body, not the entry
     PHIs, those are to be kept in sync with the non-if-converted copy.
     ???  We'll still keep dead stores though.  */
  exit_bbs = BITMAP_ALLOC (NULL);
  bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
  bitmap_set_bit (exit_bbs, loop->latch->index);
  todo |= do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);

which should remove those redundant _4 > 0 checks.  In fact when I
run this on x86_64 with -mavx512bw I see

  <bb 3> [local count: 1063004407]:
  # i_25 = PHI <i_20(9), 0(21)>
  # ivtmp_24 = PHI <ivtmp_12(9), 100(21)>
  _1 = (long unsigned int) i_25;
  _2 = _1 * 2;
  _3 = a_14(D) + _2;
  _4 = *_3;
  _5 = b_15(D) + _2;
  _49 = _4 > 0;
  _6 = .MASK_LOAD (_5, 16B, _49);
  _22 = _6 > 0;
  _28 = ~_22;
  _29 = _28 & _49;
  _7 = c_16(D) + _2;
  iftmp.0_17 = .MASK_LOAD (_7, 16B, _29);
  iftmp.0_10 = _29 ? iftmp.0_17 : _4;
  _8 = x_18(D) + _2;
  .MASK_STORE (_8, 16B, _49, iftmp.0_10);
  i_20 = i_25 + 1;
  ivtmp_12 = ivtmp_24 - 1;
  if (ivtmp_12 != 0)

after if-conversion (that should be the case already on the GCC 9 branch).

> Looking through the definition of an SSA name means that we can cope
> with these redundancies for single comparisons, but not for comparisons
> combined through & and |.  This hurts most when trying to decide whether
> to invert comparisons.
>
> But maybe value numbering won't cope with that either :-)

Not sure, it definitely doesn't do arbitrary re-association
(but commutative ops are handled).  So it cannot prove
a & (b & c) == (c & a) & b but it can prove a & b == b & a

Richard.

>
> Thanks,
> Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Sandiford-9
Richard Biener <[hidden email]> writes:

> On Tue, Aug 27, 2019 at 11:58 AM Richard Sandiford
> <[hidden email]> wrote:
>> ifcvt produces:
>>
>>   <bb 3> [local count: 1063004407]:
>>   # i_34 = PHI <i_30(10), 0(21)>
>>   # ivtmp_5 = PHI <ivtmp_6(10), 100(21)>
>>   _1 = (long unsigned int) i_34;
>>   _2 = _1 * 2;
>>   _3 = a_23(D) + _2;
>>   _4 = *_3;
>>   _7 = b_24(D) + _2;
>>   _49 = _4 > 0;
>>   _8 = .MASK_LOAD (_7, 16B, _49);
>>   _12 = _4 > 0;
>>   _13 = _8 > 0;
>>   _9 = _12 & _13;
>>   _10 = _4 > 0;
>>   _11 = _8 > 0;
>>   _27 = ~_11;
>>   _15 = _10 & _27;
>>   _14 = c_25(D) + _2;
>>   iftmp.0_26 = .MASK_LOAD (_14, 16B, _15);
>>   iftmp.0_19 = _9 ? _4 : iftmp.0_26;
>>   _17 = x_28(D) + _2;
>>   _50 = _4 > 0;
>>   .MASK_STORE (_17, 16B, _50, iftmp.0_19);
>>   i_30 = i_34 + 1;
>>   ivtmp_6 = ivtmp_5 - 1;
>>   if (ivtmp_6 != 0)
>>     goto <bb 10>; [98.99%]
>>   else
>>     goto <bb 9>; [1.01%]
>>
>>   <bb 10> [local count: 1052266994]:
>>   goto <bb 3>; [100.00%]
>>
>> which has 4 copies of _4 > 0 (a[i] > 0) and 2 copies of _8 > 0 (b[i] > 0).
>
> Huh.  if-conversion does
>
>   /* Now all statements are if-convertible.  Combine all the basic
>      blocks into one huge basic block doing the if-conversion
>      on-the-fly.  */
>   combine_blocks (loop);
>
>   /* Delete dead predicate computations.  */
>   ifcvt_local_dce (loop->header);
>
>   /* Perform local CSE, this esp. helps the vectorizer analysis if loads
>      and stores are involved.  CSE only the loop body, not the entry
>      PHIs, those are to be kept in sync with the non-if-converted copy.
>      ???  We'll still keep dead stores though.  */
>   exit_bbs = BITMAP_ALLOC (NULL);
>   bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
>   bitmap_set_bit (exit_bbs, loop->latch->index);
>   todo |= do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
>
> which should remove those redundant _4 > 0 checks.  In fact when I
> run this on x86_64 with -mavx512bw I see
>
>   <bb 3> [local count: 1063004407]:
>   # i_25 = PHI <i_20(9), 0(21)>
>   # ivtmp_24 = PHI <ivtmp_12(9), 100(21)>
>   _1 = (long unsigned int) i_25;
>   _2 = _1 * 2;
>   _3 = a_14(D) + _2;
>   _4 = *_3;
>   _5 = b_15(D) + _2;
>   _49 = _4 > 0;
>   _6 = .MASK_LOAD (_5, 16B, _49);
>   _22 = _6 > 0;
>   _28 = ~_22;
>   _29 = _28 & _49;
>   _7 = c_16(D) + _2;
>   iftmp.0_17 = .MASK_LOAD (_7, 16B, _29);
>   iftmp.0_10 = _29 ? iftmp.0_17 : _4;
>   _8 = x_18(D) + _2;
>   .MASK_STORE (_8, 16B, _49, iftmp.0_10);
>   i_20 = i_25 + 1;
>   ivtmp_12 = ivtmp_24 - 1;
>   if (ivtmp_12 != 0)
>
> after if-conversion (that should be the case already on the GCC 9 branch).

Gah, sorry for the noise.  Turns out I still had a local change that was
trying to poke the patch into doing something wrong.  Will try to check
my facts more carefully next time.

The redundant pattern statements I was thinking of come from
vect_recog_mask_conversion_pattern, but I guess that isn't so
interesting here.

So yeah, let's drop this whole vn thing for now...

Thanks,
Richard
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Prathamesh Kulkarni-2
On Tue, 27 Aug 2019 at 17:29, Richard Sandiford
<[hidden email]> wrote:

>
> Richard Biener <[hidden email]> writes:
> > On Tue, Aug 27, 2019 at 11:58 AM Richard Sandiford
> > <[hidden email]> wrote:
> >> ifcvt produces:
> >>
> >>   <bb 3> [local count: 1063004407]:
> >>   # i_34 = PHI <i_30(10), 0(21)>
> >>   # ivtmp_5 = PHI <ivtmp_6(10), 100(21)>
> >>   _1 = (long unsigned int) i_34;
> >>   _2 = _1 * 2;
> >>   _3 = a_23(D) + _2;
> >>   _4 = *_3;
> >>   _7 = b_24(D) + _2;
> >>   _49 = _4 > 0;
> >>   _8 = .MASK_LOAD (_7, 16B, _49);
> >>   _12 = _4 > 0;
> >>   _13 = _8 > 0;
> >>   _9 = _12 & _13;
> >>   _10 = _4 > 0;
> >>   _11 = _8 > 0;
> >>   _27 = ~_11;
> >>   _15 = _10 & _27;
> >>   _14 = c_25(D) + _2;
> >>   iftmp.0_26 = .MASK_LOAD (_14, 16B, _15);
> >>   iftmp.0_19 = _9 ? _4 : iftmp.0_26;
> >>   _17 = x_28(D) + _2;
> >>   _50 = _4 > 0;
> >>   .MASK_STORE (_17, 16B, _50, iftmp.0_19);
> >>   i_30 = i_34 + 1;
> >>   ivtmp_6 = ivtmp_5 - 1;
> >>   if (ivtmp_6 != 0)
> >>     goto <bb 10>; [98.99%]
> >>   else
> >>     goto <bb 9>; [1.01%]
> >>
> >>   <bb 10> [local count: 1052266994]:
> >>   goto <bb 3>; [100.00%]
> >>
> >> which has 4 copies of _4 > 0 (a[i] > 0) and 2 copies of _8 > 0 (b[i] > 0).
> >
> > Huh.  if-conversion does
> >
> >   /* Now all statements are if-convertible.  Combine all the basic
> >      blocks into one huge basic block doing the if-conversion
> >      on-the-fly.  */
> >   combine_blocks (loop);
> >
> >   /* Delete dead predicate computations.  */
> >   ifcvt_local_dce (loop->header);
> >
> >   /* Perform local CSE, this esp. helps the vectorizer analysis if loads
> >      and stores are involved.  CSE only the loop body, not the entry
> >      PHIs, those are to be kept in sync with the non-if-converted copy.
> >      ???  We'll still keep dead stores though.  */
> >   exit_bbs = BITMAP_ALLOC (NULL);
> >   bitmap_set_bit (exit_bbs, single_exit (loop)->dest->index);
> >   bitmap_set_bit (exit_bbs, loop->latch->index);
> >   todo |= do_rpo_vn (cfun, loop_preheader_edge (loop), exit_bbs);
> >
> > which should remove those redundant _4 > 0 checks.  In fact when I
> > run this on x86_64 with -mavx512bw I see
> >
> >   <bb 3> [local count: 1063004407]:
> >   # i_25 = PHI <i_20(9), 0(21)>
> >   # ivtmp_24 = PHI <ivtmp_12(9), 100(21)>
> >   _1 = (long unsigned int) i_25;
> >   _2 = _1 * 2;
> >   _3 = a_14(D) + _2;
> >   _4 = *_3;
> >   _5 = b_15(D) + _2;
> >   _49 = _4 > 0;
> >   _6 = .MASK_LOAD (_5, 16B, _49);
> >   _22 = _6 > 0;
> >   _28 = ~_22;
> >   _29 = _28 & _49;
> >   _7 = c_16(D) + _2;
> >   iftmp.0_17 = .MASK_LOAD (_7, 16B, _29);
> >   iftmp.0_10 = _29 ? iftmp.0_17 : _4;
> >   _8 = x_18(D) + _2;
> >   .MASK_STORE (_8, 16B, _49, iftmp.0_10);
> >   i_20 = i_25 + 1;
> >   ivtmp_12 = ivtmp_24 - 1;
> >   if (ivtmp_12 != 0)
> >
> > after if-conversion (that should be the case already on the GCC 9 branch).
>
> Gah, sorry for the noise.  Turns out I still had a local change that was
> trying to poke the patch into doing something wrong.  Will try to check
> my facts more carefully next time.
>
> The redundant pattern statements I was thinking of come from
> vect_recog_mask_conversion_pattern, but I guess that isn't so
> interesting here.
>
> So yeah, let's drop this whole vn thing for now...
The attached version drops VN, and uses operand_equal_p for comparison.
Does it look OK ?

Thanks,
Prathamesh
>
> Thanks,
> Richard

pr86753-9.diff (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Sandiford-9
Richard should have the final say, but some comments...

Prathamesh Kulkarni <[hidden email]> writes:

> diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> index 1e2dfe5d22d..862206b3256 100644
> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
>  
>  static tree
>  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
> - gimple_stmt_iterator *gsi)
> + gimple_stmt_iterator *gsi, tree mask,
> + cond_vmask_map_type *cond_to_vec_mask)

"scalar_mask" might be a better name.  But maybe we should key off the
vector mask after all, now that we're relying on the code having no
redundancies.

Passing the vinfo would be better than passing the cond_vmask_map_type
directly.

>  {
>    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
>    if (!loop_mask)
>      return vec_mask;
>  
>    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
> +
> +  tree *slot = 0;
> +  if (cond_to_vec_mask)

The pointer should never be null in this context.

> +    {
> +      cond_vmask_key cond (mask, loop_mask);
> +      slot = &cond_to_vec_mask->get_or_insert (cond);
> +      if (*slot)
> + return *slot;
> +    }
> +
>    tree and_res = make_temp_ssa_name (mask_type, NULL, "vec_mask_and");
>    gimple *and_stmt = gimple_build_assign (and_res, BIT_AND_EXPR,
>    vec_mask, loop_mask);
>    gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
> +
> +  if (slot)
> +    *slot = and_res;
>    return and_res;
>  }
> [...]
> @@ -9975,6 +9997,38 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>    /* Handle cond expr.  */
>    for (j = 0; j < ncopies; j++)
>      {
> +      tree vec_mask = NULL_TREE;
> +
> +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)

Nit: one condition per line when the whole thing doesn't fit on a single line.

> +  && TREE_CODE_CLASS (TREE_CODE (cond_expr)) == tcc_comparison

Why restrict this to embedded comparisons?  It should work for separate
comparisons too.

> +  && loop_vinfo->cond_to_vec_mask)

This should always be nonnull given the above.

> + {
> +  vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> +  if (masks)

This is never null.

> +    {
> +      tree loop_mask = vect_get_loop_mask (gsi, masks,
> +   ncopies, vectype, j);
> +
> +      cond_vmask_key cond (cond_expr, loop_mask);
> +      tree *slot = loop_vinfo->cond_to_vec_mask->get (cond);
> +      if (slot && *slot)
> + vec_mask = *slot;
> +      else
> + {
> +  cond.cond_ops.code
> +    = invert_tree_comparison (cond.cond_ops.code, true);
> +  slot = loop_vinfo->cond_to_vec_mask->get (cond);
> +  if (slot && *slot)
> +    {
> +      vec_mask = *slot;
> +      tree tmp = then_clause;
> +      then_clause = else_clause;
> +      else_clause = tmp;

Can use std::swap.

> +    }
> + }
> +    }
> + }
> +
>        stmt_vec_info new_stmt_info = NULL;
>        if (j == 0)
>   {
> @@ -10054,6 +10108,8 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
>  
>    if (masked)
>      vec_compare = vec_cond_lhs;
> +  else if (vec_mask)
> +    vec_compare = vec_mask;

If we do drop the comparison check above, this should win over "masked".

> @@ -193,6 +194,81 @@ public:
>    poly_uint64 min_value;
>  };
>  
> +struct cond_vmask_key

I'm no good at naming things, but since "vmask" doesn't occur elsewhere
in target-independent code, how about "vec_masked_cond_key"?

> +{
> +  cond_vmask_key (tree t, tree loop_mask_)
> +    : cond_ops (t), loop_mask (loop_mask_)
> +  {}
> +
> +  hashval_t hash () const
> +  {
> +    inchash::hash h;
> +    h.add_int (cond_ops.code);
> +    h.add_int (TREE_HASH (cond_ops.op0));
> +    h.add_int (TREE_HASH (cond_ops.op1));

These two need to use inchash::add_expr, since you're hashing for
operand_equal_p.

> +    h.add_int (TREE_HASH (loop_mask));
> +    return h.end ();
> +  }
> +
> +  void mark_empty ()
> +  {
> +    loop_mask = NULL_TREE;
> +  }
> +
> +  bool is_empty ()
> +  {
> +    return loop_mask == NULL_TREE;
> +  }
> +
> +  tree_cond_ops cond_ops;
> +  tree loop_mask;
> +};
> +
> +inline bool operator== (const cond_vmask_key& c1, const cond_vmask_key& c2)
> +{
> +  return c1.loop_mask == c2.loop_mask
> + && c1.cond_ops == c2.cond_ops;

Multi-line expressions should be in brackets (or just put this one on
a single line).

> +}
> +
> +struct cond_vmask_key_traits

Might as well make this:

template<>
struct default_hash_traits<cond_vmask_key>

and then you can drop the third template parameter from hash_map.

> +{
> +  typedef cond_vmask_key value_type;
> +  typedef cond_vmask_key compare_type;
> +
> +  static inline hashval_t hash (value_type v)
> +  {
> +    return v.hash ();
> +  }
> +
> +  static inline bool equal (value_type existing, value_type candidate)
> +  {
> +    return existing == candidate;
> +  }
> +
> +  static inline void mark_empty (value_type& v)
> +  {
> +    v.mark_empty ();
> +  }
> +
> +  static inline bool is_empty (value_type v)
> +  {
> +    return v.is_empty ();
> +  }

Making hash (), mask_empty () and is_empty () forward to cond_vmask_key
functions of the same name seems unnecessary.  I think we should just put
the implementation here and not define the functions in cond_vmask_key
itself.

> +
> +  static void mark_deleted (value_type&) {}
> +
> +  static inline bool is_deleted (value_type)
> +  {
> +    return false;
> +  }
> +
> +  static inline void remove (value_type &) {}
> +};
> +
> +typedef hash_map<cond_vmask_key, tree,
> + simple_hashmap_traits <cond_vmask_key_traits, tree> >
> +  cond_vmask_map_type;
> +
>  /* Vectorizer state shared between different analyses like vector sizes
>     of the same CFG region.  */
>  class vec_info_shared {
> @@ -255,6 +331,8 @@ public:
>    /* Cost data used by the target cost model.  */
>    void *target_cost_data;
>  
> +  cond_vmask_map_type *cond_to_vec_mask;
> +
>  private:
>    stmt_vec_info new_stmt_vec_info (gimple *stmt);
>    void set_vinfo_for_stmt (gimple *, stmt_vec_info);
> diff --git a/gcc/tree.c b/gcc/tree.c
> index 8f80012c6e8..32a8fcf1eb8 100644
> --- a/gcc/tree.c
> +++ b/gcc/tree.c
> @@ -15204,6 +15204,44 @@ max_object_size (void)
>    return TYPE_MAX_VALUE (ptrdiff_type_node);
>  }
>  
> +/* If code(T) is comparison op or def of comparison stmt,
> +   extract it's operands.
> +   Else return <NE_EXPR, T, 0>.  */
> +
> +tree_cond_ops::tree_cond_ops (tree t)
> +{
> +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> +    {
> +      this->code = TREE_CODE (t);
> +      this->op0 = TREE_OPERAND (t, 0);
> +      this->op1 = TREE_OPERAND (t, 1);
> +      return;
> +    }
> +
> +  else if (TREE_CODE (t) == SSA_NAME)

Better as just an "if", given the early return above.

> +    {
> +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> +      if (stmt)
> + {
> +  tree_code code = gimple_assign_rhs_code (stmt);
> +  if (TREE_CODE_CLASS (code) == tcc_comparison)
> +    {
> +      this->code = code;
> +      this->op0 = gimple_assign_rhs1 (stmt);
> +      this->op1 = gimple_assign_rhs2 (stmt);
> +      return;
> +    }
> + }
> +
> +      this->code = NE_EXPR;
> +      this->op0 = t;
> +      this->op1 = build_zero_cst (TREE_TYPE (t));

Think we should use this as the default for non-SSA_NAMEs too,
rather than assert below.

> +    }
> +
> +  else
> +    gcc_unreachable ();
> +}
> +
>  #if CHECKING_P
>  
>  namespace selftest {
> diff --git a/gcc/tree.h b/gcc/tree.h
> index 94dbb95a78a..e6d6e9541c3 100644
> --- a/gcc/tree.h
> +++ b/gcc/tree.h
> @@ -6141,4 +6141,25 @@ public:
>    operator location_t () const { return m_combined_loc; }
>  };
>  
> +struct tree_cond_ops
> +{
> +  tree_code code;
> +  tree op0;
> +  tree op1;
> +
> +  tree_cond_ops (tree);
> +};
> +
> +/* ??? Not sure if it's good idea to include fold-const.h
> +       only for operand_equal_p ? */

Maybe put the new stuff in fold-const.h?

> +extern bool operand_equal_p (const_tree, const_tree, unsigned int);
> +
> +inline bool
> +operator== (const tree_cond_ops& o1, const tree_cond_ops &o2)
> +{
> +  return o1.code == o2.code
> + && operand_equal_p (o1.op0, o2.op0, 0)
> + && operand_equal_p (o1.op1, o2.op1, 0);

Multi-line expression should be enclosed in brackets.

Thanks,
Richard

> +}
> +
>  #endif  /* GCC_TREE_H  */

Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Prathamesh Kulkarni-2
On Tue, 27 Aug 2019 at 21:14, Richard Sandiford
<[hidden email]> wrote:

>
> Richard should have the final say, but some comments...
>
> Prathamesh Kulkarni <[hidden email]> writes:
> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> > index 1e2dfe5d22d..862206b3256 100644
> > --- a/gcc/tree-vect-stmts.c
> > +++ b/gcc/tree-vect-stmts.c
> > @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
> >
> >  static tree
> >  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
> > -                      gimple_stmt_iterator *gsi)
> > +                      gimple_stmt_iterator *gsi, tree mask,
> > +                      cond_vmask_map_type *cond_to_vec_mask)
>
> "scalar_mask" might be a better name.  But maybe we should key off the
> vector mask after all, now that we're relying on the code having no
> redundancies.
>
> Passing the vinfo would be better than passing the cond_vmask_map_type
> directly.
>
> >  {
> >    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
> >    if (!loop_mask)
> >      return vec_mask;
> >
> >    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
> > +
> > +  tree *slot = 0;
> > +  if (cond_to_vec_mask)
>
> The pointer should never be null in this context.
Disabling check for NULL results in segfault with cond_arith_4.c because we
reach prepare_load_store_mask via vect_schedule_slp, called from
here in vect_transform_loop:
 /* Schedule the SLP instances first, then handle loop vectorization
     below.  */
  if (!loop_vinfo->slp_instances.is_empty ())
    {
      DUMP_VECT_SCOPE ("scheduling SLP instances");
      vect_schedule_slp (loop_vinfo);
    }

which is before bb processing loop.

>
> > +    {
> > +      cond_vmask_key cond (mask, loop_mask);
> > +      slot = &cond_to_vec_mask->get_or_insert (cond);
> > +      if (*slot)
> > +     return *slot;
> > +    }
> > +
> >    tree and_res = make_temp_ssa_name (mask_type, NULL, "vec_mask_and");
> >    gimple *and_stmt = gimple_build_assign (and_res, BIT_AND_EXPR,
> >                                         vec_mask, loop_mask);
> >    gsi_insert_before (gsi, and_stmt, GSI_SAME_STMT);
> > +
> > +  if (slot)
> > +    *slot = and_res;
> >    return and_res;
> >  }
> > [...]
> > @@ -9975,6 +9997,38 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >    /* Handle cond expr.  */
> >    for (j = 0; j < ncopies; j++)
> >      {
> > +      tree vec_mask = NULL_TREE;
> > +
> > +      if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)
>
> Nit: one condition per line when the whole thing doesn't fit on a single line.
>
> > +       && TREE_CODE_CLASS (TREE_CODE (cond_expr)) == tcc_comparison
>
> Why restrict this to embedded comparisons?  It should work for separate
> comparisons too.
>
> > +       && loop_vinfo->cond_to_vec_mask)
>
> This should always be nonnull given the above.
>
> > +     {
> > +       vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo);
> > +       if (masks)
>
> This is never null.
>
> > +         {
> > +           tree loop_mask = vect_get_loop_mask (gsi, masks,
> > +                                                ncopies, vectype, j);
> > +
> > +           cond_vmask_key cond (cond_expr, loop_mask);
> > +           tree *slot = loop_vinfo->cond_to_vec_mask->get (cond);
> > +           if (slot && *slot)
> > +             vec_mask = *slot;
> > +           else
> > +             {
> > +               cond.cond_ops.code
> > +                 = invert_tree_comparison (cond.cond_ops.code, true);
> > +               slot = loop_vinfo->cond_to_vec_mask->get (cond);
> > +               if (slot && *slot)
> > +                 {
> > +                   vec_mask = *slot;
> > +                   tree tmp = then_clause;
> > +                   then_clause = else_clause;
> > +                   else_clause = tmp;
>
> Can use std::swap.
>
> > +                 }
> > +             }
> > +         }
> > +     }
> > +
> >        stmt_vec_info new_stmt_info = NULL;
> >        if (j == 0)
> >       {
> > @@ -10054,6 +10108,8 @@ vectorizable_condition (stmt_vec_info stmt_info, gimple_stmt_iterator *gsi,
> >
> >         if (masked)
> >           vec_compare = vec_cond_lhs;
> > +       else if (vec_mask)
> > +         vec_compare = vec_mask;
>
> If we do drop the comparison check above, this should win over "masked".
>
> > @@ -193,6 +194,81 @@ public:
> >    poly_uint64 min_value;
> >  };
> >
> > +struct cond_vmask_key
>
> I'm no good at naming things, but since "vmask" doesn't occur elsewhere
> in target-independent code, how about "vec_masked_cond_key"?
>
> > +{
> > +  cond_vmask_key (tree t, tree loop_mask_)
> > +    : cond_ops (t), loop_mask (loop_mask_)
> > +  {}
> > +
> > +  hashval_t hash () const
> > +  {
> > +    inchash::hash h;
> > +    h.add_int (cond_ops.code);
> > +    h.add_int (TREE_HASH (cond_ops.op0));
> > +    h.add_int (TREE_HASH (cond_ops.op1));
>
> These two need to use inchash::add_expr, since you're hashing for
> operand_equal_p.
>
> > +    h.add_int (TREE_HASH (loop_mask));
> > +    return h.end ();
> > +  }
> > +
> > +  void mark_empty ()
> > +  {
> > +    loop_mask = NULL_TREE;
> > +  }
> > +
> > +  bool is_empty ()
> > +  {
> > +    return loop_mask == NULL_TREE;
> > +  }
> > +
> > +  tree_cond_ops cond_ops;
> > +  tree loop_mask;
> > +};
> > +
> > +inline bool operator== (const cond_vmask_key& c1, const cond_vmask_key& c2)
> > +{
> > +  return c1.loop_mask == c2.loop_mask
> > +      && c1.cond_ops == c2.cond_ops;
>
> Multi-line expressions should be in brackets (or just put this one on
> a single line).
>
> > +}
> > +
> > +struct cond_vmask_key_traits
>
> Might as well make this:
>
> template<>
> struct default_hash_traits<cond_vmask_key>
>
> and then you can drop the third template parameter from hash_map.
>
> > +{
> > +  typedef cond_vmask_key value_type;
> > +  typedef cond_vmask_key compare_type;
> > +
> > +  static inline hashval_t hash (value_type v)
> > +  {
> > +    return v.hash ();
> > +  }
> > +
> > +  static inline bool equal (value_type existing, value_type candidate)
> > +  {
> > +    return existing == candidate;
> > +  }
> > +
> > +  static inline void mark_empty (value_type& v)
> > +  {
> > +    v.mark_empty ();
> > +  }
> > +
> > +  static inline bool is_empty (value_type v)
> > +  {
> > +    return v.is_empty ();
> > +  }
>
> Making hash (), mask_empty () and is_empty () forward to cond_vmask_key
> functions of the same name seems unnecessary.  I think we should just put
> the implementation here and not define the functions in cond_vmask_key
> itself.
>
> > +
> > +  static void mark_deleted (value_type&) {}
> > +
> > +  static inline bool is_deleted (value_type)
> > +  {
> > +    return false;
> > +  }
> > +
> > +  static inline void remove (value_type &) {}
> > +};
> > +
> > +typedef hash_map<cond_vmask_key, tree,
> > +              simple_hashmap_traits <cond_vmask_key_traits, tree> >
> > +  cond_vmask_map_type;
> > +
> >  /* Vectorizer state shared between different analyses like vector sizes
> >     of the same CFG region.  */
> >  class vec_info_shared {
> > @@ -255,6 +331,8 @@ public:
> >    /* Cost data used by the target cost model.  */
> >    void *target_cost_data;
> >
> > +  cond_vmask_map_type *cond_to_vec_mask;
> > +
> >  private:
> >    stmt_vec_info new_stmt_vec_info (gimple *stmt);
> >    void set_vinfo_for_stmt (gimple *, stmt_vec_info);
> > diff --git a/gcc/tree.c b/gcc/tree.c
> > index 8f80012c6e8..32a8fcf1eb8 100644
> > --- a/gcc/tree.c
> > +++ b/gcc/tree.c
> > @@ -15204,6 +15204,44 @@ max_object_size (void)
> >    return TYPE_MAX_VALUE (ptrdiff_type_node);
> >  }
> >
> > +/* If code(T) is comparison op or def of comparison stmt,
> > +   extract it's operands.
> > +   Else return <NE_EXPR, T, 0>.  */
> > +
> > +tree_cond_ops::tree_cond_ops (tree t)
> > +{
> > +  if (TREE_CODE_CLASS (TREE_CODE (t)) == tcc_comparison)
> > +    {
> > +      this->code = TREE_CODE (t);
> > +      this->op0 = TREE_OPERAND (t, 0);
> > +      this->op1 = TREE_OPERAND (t, 1);
> > +      return;
> > +    }
> > +
> > +  else if (TREE_CODE (t) == SSA_NAME)
>
> Better as just an "if", given the early return above.
>
> > +    {
> > +      gassign *stmt = dyn_cast<gassign *> (SSA_NAME_DEF_STMT (t));
> > +      if (stmt)
> > +     {
> > +       tree_code code = gimple_assign_rhs_code (stmt);
> > +       if (TREE_CODE_CLASS (code) == tcc_comparison)
> > +         {
> > +           this->code = code;
> > +           this->op0 = gimple_assign_rhs1 (stmt);
> > +           this->op1 = gimple_assign_rhs2 (stmt);
> > +           return;
> > +         }
> > +     }
> > +
> > +      this->code = NE_EXPR;
> > +      this->op0 = t;
> > +      this->op1 = build_zero_cst (TREE_TYPE (t));
>
> Think we should use this as the default for non-SSA_NAMEs too,
> rather than assert below.
>
> > +    }
> > +
> > +  else
> > +    gcc_unreachable ();
> > +}
> > +
> >  #if CHECKING_P
> >
> >  namespace selftest {
> > diff --git a/gcc/tree.h b/gcc/tree.h
> > index 94dbb95a78a..e6d6e9541c3 100644
> > --- a/gcc/tree.h
> > +++ b/gcc/tree.h
> > @@ -6141,4 +6141,25 @@ public:
> >    operator location_t () const { return m_combined_loc; }
> >  };
> >
> > +struct tree_cond_ops
> > +{
> > +  tree_code code;
> > +  tree op0;
> > +  tree op1;
> > +
> > +  tree_cond_ops (tree);
> > +};
> > +
> > +/* ??? Not sure if it's good idea to include fold-const.h
> > +       only for operand_equal_p ? */
>
> Maybe put the new stuff in fold-const.h?
>
> > +extern bool operand_equal_p (const_tree, const_tree, unsigned int);
> > +
> > +inline bool
> > +operator== (const tree_cond_ops& o1, const tree_cond_ops &o2)
> > +{
> > +  return o1.code == o2.code
> > +      && operand_equal_p (o1.op0, o2.op0, 0)
> > +      && operand_equal_p (o1.op1, o2.op1, 0);
>
> Multi-line expression should be enclosed in brackets.
Thanks for the suggestions, I tried addressing these in attached patch.
Does it look OK ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
>
> > +}
> > +
> >  #endif  /* GCC_TREE_H  */
>

pr86753-10.diff (14K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [SVE] PR86753

Richard Sandiford-9
Prathamesh Kulkarni <[hidden email]> writes:

> On Tue, 27 Aug 2019 at 21:14, Richard Sandiford
> <[hidden email]> wrote:
>>
>> Richard should have the final say, but some comments...
>>
>> Prathamesh Kulkarni <[hidden email]> writes:
>> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
>> > index 1e2dfe5d22d..862206b3256 100644
>> > --- a/gcc/tree-vect-stmts.c
>> > +++ b/gcc/tree-vect-stmts.c
>> > @@ -1989,17 +1989,31 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype,
>> >
>> >  static tree
>> >  prepare_load_store_mask (tree mask_type, tree loop_mask, tree vec_mask,
>> > -                      gimple_stmt_iterator *gsi)
>> > +                      gimple_stmt_iterator *gsi, tree mask,
>> > +                      cond_vmask_map_type *cond_to_vec_mask)
>>
>> "scalar_mask" might be a better name.  But maybe we should key off the
>> vector mask after all, now that we're relying on the code having no
>> redundancies.
>>
>> Passing the vinfo would be better than passing the cond_vmask_map_type
>> directly.
>>
>> >  {
>> >    gcc_assert (useless_type_conversion_p (mask_type, TREE_TYPE (vec_mask)));
>> >    if (!loop_mask)
>> >      return vec_mask;
>> >
>> >    gcc_assert (TREE_TYPE (loop_mask) == mask_type);
>> > +
>> > +  tree *slot = 0;
>> > +  if (cond_to_vec_mask)
>>
>> The pointer should never be null in this context.
> Disabling check for NULL results in segfault with cond_arith_4.c because we
> reach prepare_load_store_mask via vect_schedule_slp, called from
> here in vect_transform_loop:
>  /* Schedule the SLP instances first, then handle loop vectorization
>      below.  */
>   if (!loop_vinfo->slp_instances.is_empty ())
>     {
>       DUMP_VECT_SCOPE ("scheduling SLP instances");
>       vect_schedule_slp (loop_vinfo);
>     }
>
> which is before bb processing loop.

We want this optimisation to be applied to SLP too though.  Especially
since non-SLP will be going away at some point.

But as Richard says, the problem with SLP is that the statements aren't
traversed in block order, so I guess we can't do the on-the-fly
redundancy elimination there...

Maybe an alternative would be to record during the analysis phase which
scalar conditions need which loop masks.  Statements that need a loop
mask currently do:

      vect_record_loop_mask (loop_vinfo, masks, ncopies, vectype);

If we also pass the scalar condition, we can maintain a hash_set of
<condition, ncopies> pairs, representing the conditions that have
loop masks applied at some point in the vectorised code.  The COND_EXPR
code can use that set to decide whether to apply the loop mask or not.

Trying to avoid duplicate ANDs with the loop mask would then become a
separate follow-on change.  Not sure whether it's worth it on its own.

Thanks,
Richard
12