[ARM] Implement division using vrecpe, vrecps

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[ARM] Implement division using vrecpe, vrecps

Prathamesh Kulkarni-2
Hi,
This is a rebased version of patch that adds a pattern to neon.md for
implementing division with multiplication by reciprocal using
vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
The newly added test-cases are not vectorized on armeb target with
-O2. I posted the analysis for that here:
https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html

Briefly, the difference between little and big-endian vectorizer is in
arm_builtin_support_vector_misalignment() which calls
default_builtin_support_vector_misalignment() for big-endian case, and
that returns false because
movmisalign_optab does not exist for V2SF mode. This isn't observed
with -O3 because loop peeling for alignment gets enabled.

It seems that the test cases in patch appear unsupported on armeb,
after r221677 thus this patch requires no changes to
target-supports.exp to adjust for armeb (unlike last time which
stalled the patch).

Bootstrap+tested on arm-linux-gnueabihf.
Cross-tested on arm*-*-* variants.
OK for trunk ?

Thanks,
Prathamesh

tcwg-319-3.txt (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [ARM] Implement division using vrecpe, vrecps

Prathamesh Kulkarni-2
On Fri, 26 Oct 2018 at 10:34, Prathamesh Kulkarni
<[hidden email]> wrote:

>
> Hi,
> This is a rebased version of patch that adds a pattern to neon.md for
> implementing division with multiplication by reciprocal using
> vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> The newly added test-cases are not vectorized on armeb target with
> -O2. I posted the analysis for that here:
> https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
>
> Briefly, the difference between little and big-endian vectorizer is in
> arm_builtin_support_vector_misalignment() which calls
> default_builtin_support_vector_misalignment() for big-endian case, and
> that returns false because
> movmisalign_optab does not exist for V2SF mode. This isn't observed
> with -O3 because loop peeling for alignment gets enabled.
>
> It seems that the test cases in patch appear unsupported on armeb,
> after r221677 thus this patch requires no changes to
> target-supports.exp to adjust for armeb (unlike last time which
> stalled the patch).
>
> Bootstrap+tested on arm-linux-gnueabihf.
> Cross-tested on arm*-*-* variants.
> OK for trunk ?
ping: https://gcc.gnu.org/ml/gcc-patches/2018-10/msg01645.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
Reply | Threaded
Open this post in threaded view
|

Re: [ARM] Implement division using vrecpe, vrecps

Wilco Dijkstra-2
In reply to this post by Prathamesh Kulkarni-2
Prathamesh Kulkarni wrote:

> This is a rebased version of patch that adds a pattern to neon.md for
> implementing division with multiplication by reciprocal using
> vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> The newly added test-cases are not vectorized on armeb target with
> -O2. I posted the analysis for that here:
> https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html

I don't think doing this unconditionally for any CPU is a good idea. On AArch64
we don't enable this for any core since it's not really faster (newer CPUs have
significantly improved division and the reciprocal instructions reduce throughput
of other FMAs). On wrf doing reciprocal square root is far better than reciprocal
division, but it's only faster on some specific CPUs, so it's not enabled by default.

Wilco
Reply | Threaded
Open this post in threaded view
|

Re: [ARM] Implement division using vrecpe, vrecps

Prathamesh Kulkarni-2
On Fri, 2 Nov 2018 at 19:08, Wilco Dijkstra <[hidden email]> wrote:

>
> Prathamesh Kulkarni wrote:
>
> > This is a rebased version of patch that adds a pattern to neon.md for
> > implementing division with multiplication by reciprocal using
> > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> > The newly added test-cases are not vectorized on armeb target with
> > -O2. I posted the analysis for that here:
> > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
>
> I don't think doing this unconditionally for any CPU is a good idea. On AArch64
> we don't enable this for any core since it's not really faster (newer CPUs have
> significantly improved division and the reciprocal instructions reduce throughput
> of other FMAs). On wrf doing reciprocal square root is far better than reciprocal
> division, but it's only faster on some specific CPUs, so it's not enabled by default.
Hi Wilco,
Thanks for the suggestions. The last time I benchmarked the patch
(around Jan 2016)
I got following results with the patch for SPEC2006:

a15: +0.64% overall, 481.wrf: +6.46%
a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
a57: +0.35% overall, 481.wrf: +3.84%
(https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01209.html)

Do these numbers look acceptable ?
I am benchmarking the patch on ToT, and will report if there are any
performance improvements found with the patch.

Thanks,
Prathamesh
>
> Wilco
Reply | Threaded
Open this post in threaded view
|

Re: [ARM] Implement division using vrecpe, vrecps

Wilco Dijkstra-2
Hi Prathamesh,

Prathamesh Kulkarni wrote:

> Thanks for the suggestions. The last time I benchmarked the patch
> (around Jan 2016)
> I got following results with the patch for SPEC2006:
>
> a15: +0.64% overall, 481.wrf: +6.46%
> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
> a57: +0.35% overall, 481.wrf: +3.84%
> (https://gcc.gnu.org/ml/gcc-patches/2016-01/msg01209.html)
>
> Do these numbers look acceptable ?
> I am benchmarking the patch on ToT, and will report if there are any
> performance improvements found with the patch.

Yes those results are quite good - in fact they seemed too good to be true at first.
However looking at arm/neon.md there isn't a division pattern. So I think it's worth
mentioning in the description that your patch actually adds vectorization of
division. Disassembling the AArch64 wrf binary shows several hundred vector
division instructions - so the speedup makes sense now since many more loops
are being vectorized.

It's a shame this pattern wasn't added many years ago... It's a good idea to add a
vectorized (r)sqrt too as this will improve wrf even further.

Wilco
Reply | Threaded
Open this post in threaded view
|

Re: [ARM] Implement division using vrecpe, vrecps

Ramana Radhakrishnan-4
In reply to this post by Prathamesh Kulkarni-2
On 26/10/2018 06:04, Prathamesh Kulkarni wrote:

> Hi,
> This is a rebased version of patch that adds a pattern to neon.md for
> implementing division with multiplication by reciprocal using
> vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> The newly added test-cases are not vectorized on armeb target with
> -O2. I posted the analysis for that here:
> https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
>
> Briefly, the difference between little and big-endian vectorizer is in
> arm_builtin_support_vector_misalignment() which calls
> default_builtin_support_vector_misalignment() for big-endian case, and
> that returns false because
> movmisalign_optab does not exist for V2SF mode. This isn't observed
> with -O3 because loop peeling for alignment gets enabled.
>
> It seems that the test cases in patch appear unsupported on armeb,
> after r221677 thus this patch requires no changes to
> target-supports.exp to adjust for armeb (unlike last time which
> stalled the patch).
>
> Bootstrap+tested on arm-linux-gnueabihf.
> Cross-tested on arm*-*-* variants.
> OK for trunk ?
>
> Thanks,
> Prathamesh
>
>
> tcwg-319-3.txt
>
> 2018-10-26  Prathamesh Kulkarni<[hidden email]>
>
> * config/arm/neon.md (div<mode>3): New pattern.
>
> testsuite/
> * gcc.target/arm/neon-vect-div-1.c: New test.
> * gcc.target/arm/neon-vect-div-2.c: Likewise.
>
> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> index 5aeee4b08c1..25ed45d381a 100644
> --- a/gcc/config/arm/neon.md
> +++ b/gcc/config/arm/neon.md
> @@ -620,6 +620,38 @@
>                       (const_string "neon_mul_<V_elem_ch><q>")))]
>   )
>  
> +/* Perform division using multiply-by-reciprocal.
> +   Reciprocal is calculated using Newton-Raphson method.
> +   Enabled with -funsafe-math-optimizations -freciprocal-math
> +   and disabled for -Os since it increases code size .  */ > +
> +(define_expand "div<mode>3"
> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
> +  (match_operand:VCVTF 2 "s_register_operand" "w")))]
> +  "TARGET_NEON && !optimize_size
> +   && flag_unsafe_math_optimizations && flag_reciprocal_math"

I would prefer this to be more granular than
flag_unsafe_math_optimization && flag_reciprocal_math which really is
flag_reciprocal_math as it is turned on by default with
funsafe-math-optimizations.

I think this should really be just flag_reciprocal_math.


Otherwise ok.

regards
Ramana




> +  {
> +    rtx rec = gen_reg_rtx (<MODE>mode);
> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
> +
> +    /* Reciprocal estimate.  */
> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
> +
> +    /* Perform 2 iterations of newton-raphson method.  */
> +    for (int i = 0; i < 2; i++)
> +      {
> + emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
> + emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
> +      }
> +
> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */
> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
> +    DONE;
> +  }
> +)
> +
> +
>   (define_insn "mul<mode>3add<mode>_neon"
>     [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>           (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
> diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
> new file mode 100644
> index 00000000000..50d04b4175b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
> @@ -0,0 +1,16 @@
> +/* Test pattern div<mode>3.  */
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_ok } */
> +/* { dg-require-effective-target vect_hw_misalign } */
> +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details" } */
> +/* { dg-add-options arm_neon } */
> +
> +void
> +foo (int len, float * __restrict p, float *__restrict x)
> +{
> +  len = len & ~31;
> +  for (int i = 0; i < len; i++)
> +    p[i] = p[i] / x[i];
> +}
> +
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
> new file mode 100644
> index 00000000000..606f54b4e0e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
> @@ -0,0 +1,16 @@
> +/* Test pattern div<mode>3.  */
> +/* { dg-do compile } */
> +/* { dg-require-effective-target arm_neon_ok } */
> +/* { dg-require-effective-target vect_hw_misalign } */
> +/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details -fno-reciprocal-math" } */
> +/* { dg-add-options arm_neon } */
> +
> +void
> +foo (int len, float * __restrict p, float *__restrict x)
> +{
> +  len = len & ~31;
> +  for (int i = 0; i < len; i++)
> +    p[i] = p[i] / x[i];
> +}
> +
> +/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */
>

Reply | Threaded
Open this post in threaded view
|

Re: [ARM] Implement division using vrecpe, vrecps

Prathamesh Kulkarni-2
On Mon, 5 Nov 2018 at 19:22, Ramana Radhakrishnan
<[hidden email]> wrote:

>
> On 26/10/2018 06:04, Prathamesh Kulkarni wrote:
> > Hi,
> > This is a rebased version of patch that adds a pattern to neon.md for
> > implementing division with multiplication by reciprocal using
> > vrecpe/vrecps with -funsafe-math-optimizations excluding -Os.
> > The newly added test-cases are not vectorized on armeb target with
> > -O2. I posted the analysis for that here:
> > https://gcc.gnu.org/ml/gcc-patches/2016-05/msg01765.html
> >
> > Briefly, the difference between little and big-endian vectorizer is in
> > arm_builtin_support_vector_misalignment() which calls
> > default_builtin_support_vector_misalignment() for big-endian case, and
> > that returns false because
> > movmisalign_optab does not exist for V2SF mode. This isn't observed
> > with -O3 because loop peeling for alignment gets enabled.
> >
> > It seems that the test cases in patch appear unsupported on armeb,
> > after r221677 thus this patch requires no changes to
> > target-supports.exp to adjust for armeb (unlike last time which
> > stalled the patch).
> >
> > Bootstrap+tested on arm-linux-gnueabihf.
> > Cross-tested on arm*-*-* variants.
> > OK for trunk ?
> >
> > Thanks,
> > Prathamesh
> >
> >
> > tcwg-319-3.txt
> >
> > 2018-10-26  Prathamesh Kulkarni<[hidden email]>
> >
> >       * config/arm/neon.md (div<mode>3): New pattern.
> >
> > testsuite/
> >       * gcc.target/arm/neon-vect-div-1.c: New test.
> >       * gcc.target/arm/neon-vect-div-2.c: Likewise.
> >
> > diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
> > index 5aeee4b08c1..25ed45d381a 100644
> > --- a/gcc/config/arm/neon.md
> > +++ b/gcc/config/arm/neon.md
> > @@ -620,6 +620,38 @@
> >                       (const_string "neon_mul_<V_elem_ch><q>")))]
> >   )
> >
> > +/* Perform division using multiply-by-reciprocal.
> > +   Reciprocal is calculated using Newton-Raphson method.
> > +   Enabled with -funsafe-math-optimizations -freciprocal-math
> > +   and disabled for -Os since it increases code size .  */ > +
> > +(define_expand "div<mode>3"
> > +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
> > +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
> > +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
> > +  "TARGET_NEON && !optimize_size
> > +   && flag_unsafe_math_optimizations && flag_reciprocal_math"
>
> I would prefer this to be more granular than
> flag_unsafe_math_optimization && flag_reciprocal_math which really is
> flag_reciprocal_math as it is turned on by default with
> funsafe-math-optimizations.
>
> I think this should really be just flag_reciprocal_math.
>
>
> Otherwise ok.
Thanks, committed it in r265948 after removing check for
flag_unsafe_math_optimizations.

Regards,
Prathamesh

>
> regards
> Ramana
>
>
>
>
> > +  {
> > +    rtx rec = gen_reg_rtx (<MODE>mode);
> > +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
> > +
> > +    /* Reciprocal estimate.  */
> > +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
> > +
> > +    /* Perform 2 iterations of newton-raphson method.  */
> > +    for (int i = 0; i < 2; i++)
> > +      {
> > +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
> > +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
> > +      }
> > +
> > +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */
> > +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
> > +    DONE;
> > +  }
> > +)
> > +
> > +
> >   (define_insn "mul<mode>3add<mode>_neon"
> >     [(set (match_operand:VDQW 0 "s_register_operand" "=w")
> >           (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
> > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
> > new file mode 100644
> > index 00000000000..50d04b4175b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
> > @@ -0,0 +1,16 @@
> > +/* Test pattern div<mode>3.  */
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_neon_ok } */
> > +/* { dg-require-effective-target vect_hw_misalign } */
> > +/* { dg-options "-O2 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details" } */
> > +/* { dg-add-options arm_neon } */
> > +
> > +void
> > +foo (int len, float * __restrict p, float *__restrict x)
> > +{
> > +  len = len & ~31;
> > +  for (int i = 0; i < len; i++)
> > +    p[i] = p[i] / x[i];
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
> > diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
> > new file mode 100644
> > index 00000000000..606f54b4e0e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
> > @@ -0,0 +1,16 @@
> > +/* Test pattern div<mode>3.  */
> > +/* { dg-do compile } */
> > +/* { dg-require-effective-target arm_neon_ok } */
> > +/* { dg-require-effective-target vect_hw_misalign } */
> > +/* { dg-options "-O3 -ftree-vectorize -funsafe-math-optimizations -fdump-tree-vect-details -fno-reciprocal-math" } */
> > +/* { dg-add-options arm_neon } */
> > +
> > +void
> > +foo (int len, float * __restrict p, float *__restrict x)
> > +{
> > +  len = len & ~31;
> > +  for (int i = 0; i < len; i++)
> > +    p[i] = p[i] / x[i];
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "vectorized 1 loops" "vect" } } */
> >
>