PPC64 libmvec implementation of sincos

classic Classic list List threaded Threaded
10 messages Options
GT
Reply | Threaded
Open this post in threaded view
|

PPC64 libmvec implementation of sincos

GT
I am attempting to create a vector version of sincos for PPC64.
The relevant discussion thread is on the GLIBC libc-alpha mailing list.
Navigate it beginning at https://sourceware.org/ml/libc-alpha/2019-09/msg00334.html

The intention is to reuse as much as possible from the existing GCC implementation of other libmvec functions.
My questions are: Which function(s) in GCC;

1. Gather scalar function input arguments, from multiple loop iterations, into a single vector input argument for the vector function version?
2. Distribute scalar function outputs, to appropriate loop iteration result, from the single vector function output result?

I am referring especially to vectorization of sin and cos.

Thanks.
Bert Tenjy.
Reply | Threaded
Open this post in threaded view
|

Re: PPC64 libmvec implementation of sincos

Szabolcs Nagy-2
On 27/09/2019 20:23, GT wrote:

> I am attempting to create a vector version of sincos for PPC64.
> The relevant discussion thread is on the GLIBC libc-alpha mailing list.
> Navigate it beginning at https://sourceware.org/ml/libc-alpha/2019-09/msg00334.html
>
> The intention is to reuse as much as possible from the existing GCC implementation of other libmvec functions.
> My questions are: Which function(s) in GCC;
>
> 1. Gather scalar function input arguments, from multiple loop iterations, into a single vector input argument for the vector function version?
> 2. Distribute scalar function outputs, to appropriate loop iteration result, from the single vector function output result?
>
> I am referring especially to vectorization of sin and cos.

i wonder if gcc can auto-vectorize scalar sincos
calls, the vectorizer seems to want the calls to
have no side-effect, but attribute pure or const
is not appropriate for sincos (which has no return
value but takes writable pointer args)

"#pragma omp simd" on a loop seems to work but i
could not get unannotated sincos loops to vectorize.

it seems it would be nice if we could add pure/const
somehow (maybe to the simd variant only? afaik openmp
requires no sideeffects for simd variants, but that's
probably only for explicitly marked loops?)
GT
Reply | Threaded
Open this post in threaded view
|

Re: PPC64 libmvec implementation of sincos

GT
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Monday, September 30, 2019 9:52 AM, Szabolcs Nagy <[hidden email]> wrote:

> On 27/09/2019 20:23, GT wrote:
>
> > I am attempting to create a vector version of sincos for PPC64.
> > The relevant discussion thread is on the GLIBC libc-alpha mailing list.
> > Navigate it beginning at https://sourceware.org/ml/libc-alpha/2019-09/msg00334.html
> > The intention is to reuse as much as possible from the existing GCC implementation of other libmvec functions.
> > My questions are: Which function(s) in GCC;
> >
> > 1.  Gather scalar function input arguments, from multiple loop iterations, into a single vector input argument for the vector function version?
> > 2.  Distribute scalar function outputs, to appropriate loop iteration result, from the single vector function output result?
> >
> > I am referring especially to vectorization of sin and cos.
>
> i wonder if gcc can auto-vectorize scalar sincos
> calls, the vectorizer seems to want the calls to
> have no side-effect, but attribute pure or const
> is not appropriate for sincos (which has no return
> value but takes writable pointer args)

1.  Do you mean whether x86_64 already does auto-vectorize sincos?
2.  Where in the code do you see the vectorizer require no side-effect?

> "#pragma omp simd" on a loop seems to work but i
> could not get unannotated sincos loops to vectorize.
>
> it seems it would be nice if we could add pure/const
> somehow (maybe to the simd variant only? afaik openmp
> requires no sideeffects for simd variants, but that's
> probably only for explicitly marked loops?)

1. Example 1 and Example 2 at https://sourceware.org/glibc/wiki/libmvec show the 2 different
ways to activate auto-vectorization. When you refer to "unannotated sincos", which of
the 2 techniques do you mean?
2. Which function was auto-vectorized by "pragma omp simd" in the loop?
Reply | Threaded
Open this post in threaded view
|

Re: PPC64 libmvec implementation of sincos

Szabolcs Nagy-2
On 30/09/2019 18:30, GT wrote:

> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, September 30, 2019 9:52 AM, Szabolcs Nagy <[hidden email]> wrote:
>
>> On 27/09/2019 20:23, GT wrote:
>>
>>> I am attempting to create a vector version of sincos for PPC64.
>>> The relevant discussion thread is on the GLIBC libc-alpha mailing list.
>>> Navigate it beginning at https://sourceware.org/ml/libc-alpha/2019-09/msg00334.html
>>> The intention is to reuse as much as possible from the existing GCC implementation of other libmvec functions.
>>> My questions are: Which function(s) in GCC;
>>>
>>> 1.  Gather scalar function input arguments, from multiple loop iterations, into a single vector input argument for the vector function version?
>>> 2.  Distribute scalar function outputs, to appropriate loop iteration result, from the single vector function output result?
>>>
>>> I am referring especially to vectorization of sin and cos.
>>
>> i wonder if gcc can auto-vectorize scalar sincos
>> calls, the vectorizer seems to want the calls to
>> have no side-effect, but attribute pure or const
>> is not appropriate for sincos (which has no return
>> value but takes writable pointer args)
>
> 1.  Do you mean whether x86_64 already does auto-vectorize sincos?

any current target with simd attribute or omp delcare simd support.

> 2.  Where in the code do you see the vectorizer require no side-effect?

i don't know where it is in the code, but

__attribute__((simd)) float foo (float);

void bar (float *restrict a, float *restrict b)
{
        for(int i=0; i<4000; i++)
                a[i] = foo (b[i]);
}

is not vectorized, however it gets vectorized if

i add __attribute__((const)) to foo
OR
if i add '#pragma omp simd' to the loop and compile with
-fopenmp-simd.

(which makes sense to me: you don't want to vectorize
if you don't know the side-effects, otoh, there is no
attribute to say that i know there will be no side-effects
in functions taking pointer arguments so i don't see
how sincos can get vectorized)

>> "#pragma omp simd" on a loop seems to work but i
>> could not get unannotated sincos loops to vectorize.
>>
>> it seems it would be nice if we could add pure/const
>> somehow (maybe to the simd variant only? afaik openmp
>> requires no sideeffects for simd variants, but that's
>> probably only for explicitly marked loops?)
>
> 1. Example 1 and Example 2 at https://sourceware.org/glibc/wiki/libmvec show the 2 different
> ways to activate auto-vectorization. When you refer to "unannotated sincos", which of
> the 2 techniques do you mean?

example 1 annotates the loop with #pragma omp simd.
(and requires -fopenmp-simd cflag to work)

example 2 is my goal where -ftree-vectorize is enough
without annotation.

> 2. Which function was auto-vectorized by "pragma omp simd" in the loop?

see example above.
Reply | Threaded
Open this post in threaded view
|

Re: PPC64 libmvec implementation of sincos

Richard Biener-2
In reply to this post by Szabolcs Nagy-2
On September 30, 2019 3:52:52 PM GMT+02:00, Szabolcs Nagy <[hidden email]> wrote:

>On 27/09/2019 20:23, GT wrote:
>> I am attempting to create a vector version of sincos for PPC64.
>> The relevant discussion thread is on the GLIBC libc-alpha mailing
>list.
>> Navigate it beginning at
>https://sourceware.org/ml/libc-alpha/2019-09/msg00334.html
>>
>> The intention is to reuse as much as possible from the existing GCC
>implementation of other libmvec functions.
>> My questions are: Which function(s) in GCC;
>>
>> 1. Gather scalar function input arguments, from multiple loop
>iterations, into a single vector input argument for the vector function
>version?
>> 2. Distribute scalar function outputs, to appropriate loop iteration
>result, from the single vector function output result?
>>
>> I am referring especially to vectorization of sin and cos.
>
>i wonder if gcc can auto-vectorize scalar sincos
>calls, the vectorizer seems to want the calls to
>have no side-effect, but attribute pure or const
>is not appropriate for sincos (which has no return
>value but takes writable pointer args)

We have __builtin_cexpi for that but not sure if any of the mechanisms can provide a mapping to a vectorized variant.

>"#pragma omp simd" on a loop seems to work but i
>could not get unannotated sincos loops to vectorize.
>
>it seems it would be nice if we could add pure/const
>somehow (maybe to the simd variant only? afaik openmp
>requires no sideeffects for simd variants, but that's
>probably only for explicitly marked loops?)

GT
Reply | Threaded
Open this post in threaded view
|

Re: PPC64 libmvec implementation of sincos

GT
> >
> > i wonder if gcc can auto-vectorize scalar sincos
> > calls, the vectorizer seems to want the calls to
> > have no side-effect, but attribute pure or const
> > is not appropriate for sincos (which has no return
> > value but takes writable pointer args)
>
> We have __builtin_cexpi for that but not sure if any of the mechanisms can provide a mapping to a vectorized variant.
>

1. Using flags -fopt-info-all and -fopt-info-internals, the failure to vectorize sincos
is reported as "unsupported data-type: complex double". The default GCC behavior is to
replace sincos calls with calls to __builtin_cexpi.

2. Using flags -fno-builtin-sincos and -fno-builtin-cexpi, the failure to vectorize
sincos is different. In this case, the failure to vectorize is due to "number of iterations
could not be computed". No calls to __builtin_cexpi; sincos calls retained.

Questions:
1. Should we aim to provide a vectorized version of __builtin_cexpi? If so, it would have
to be a PPC64-only vector __builtin-cexpi, right?

2. Or should we require that vectorized sincos be available only when -fno-builtin-sincos flag
is used in compilation?

I don't think we need to fix both types of vectorization failures in order to obtain sincos
vectorization.

Thanks.
Bert.
Reply | Threaded
Open this post in threaded view
|

Re: PPC64 libmvec implementation of sincos

Richard Biener-2
On Mon, Nov 25, 2019 at 5:53 PM GT <[hidden email]> wrote:

>
> > >
> > > i wonder if gcc can auto-vectorize scalar sincos
> > > calls, the vectorizer seems to want the calls to
> > > have no side-effect, but attribute pure or const
> > > is not appropriate for sincos (which has no return
> > > value but takes writable pointer args)
> >
> > We have __builtin_cexpi for that but not sure if any of the mechanisms can provide a mapping to a vectorized variant.
> >
>
> 1. Using flags -fopt-info-all and -fopt-info-internals, the failure to vectorize sincos
> is reported as "unsupported data-type: complex double". The default GCC behavior is to
> replace sincos calls with calls to __builtin_cexpi.
>
> 2. Using flags -fno-builtin-sincos and -fno-builtin-cexpi, the failure to vectorize
> sincos is different. In this case, the failure to vectorize is due to "number of iterations
> could not be computed". No calls to __builtin_cexpi; sincos calls retained.
>
> Questions:
> 1. Should we aim to provide a vectorized version of __builtin_cexpi? If so, it would have
> to be a PPC64-only vector __builtin-cexpi, right?
>
> 2. Or should we require that vectorized sincos be available only when -fno-builtin-sincos flag
> is used in compilation?
>
> I don't think we need to fix both types of vectorization failures in order to obtain sincos
> vectorization.

I think we should have a vectorized cexpi since that's having a sane
ABI.  The complex
return type of cexpi makes it a little awkward for the vectorizer but
handling this should
be manageable.  It's a bit difficult to expose complex types to the
vectorizer since
most cases are lowered early.

Richard.

> Thanks.
> Bert.
GT
Reply | Threaded
Open this post in threaded view
|

Re: PPC64 libmvec implementation of sincos

GT
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, November 27, 2019 3:19 AM, Richard Biener <[hidden email]> wrote:

...

> > Questions:
> >
> > 1.  Should we aim to provide a vectorized version of __builtin_cexpi? If so, it would have
> >     to be a PPC64-only vector __builtin-cexpi, right?
> >
> > 2.  Or should we require that vectorized sincos be available only when -fno-builtin-sincos flag
> >     is used in compilation?
> >
> >
> > I don't think we need to fix both types of vectorization failures in order to obtain sincos
> > vectorization.
>
> I think we should have a vectorized cexpi since that's having a sane
> ABI. The complex
> return type of cexpi makes it a little awkward for the vectorizer but
> handling this should
> be manageable. It's a bit difficult to expose complex types to the
> vectorizer since
> most cases are lowered early.
>

I'm trying to identify the source code which needs modification but I need help proceeding.

I am comparing two compilations: The first is a simple file with a call to sin in a loop.
Vectorization succeeds. The second is an almost identical file but with a call to sincos
in the loop. Vectorization fails.

In gdb, the earliest code location where the two compilations differ is in function
number_of_iterations_exit_assumptions in file tree-ssa-loop-niter.c. Line

op0 = gimple_cond_lhs (stmt);

returns a tree which when analyzed in function instantiate_scev_r (in file tree-scalar-evolution.c)
results in the first branch of the switch being taken for sincos. For sin, the 2nd branch of the
switch is taken.

How can I correlate stmt in the source line above to the relevant line in any dump among those created
using debugging dump option -fdump-tree-all?

Thanks.
Bert.
Reply | Threaded
Open this post in threaded view
|

Re: PPC64 libmvec implementation of sincos

Richard Biener-2
On Wed, Dec 4, 2019 at 9:53 PM GT <[hidden email]> wrote:

>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, November 27, 2019 3:19 AM, Richard Biener <[hidden email]> wrote:
>
> ...
>
> > > Questions:
> > >
> > > 1.  Should we aim to provide a vectorized version of __builtin_cexpi? If so, it would have
> > >     to be a PPC64-only vector __builtin-cexpi, right?
> > >
> > > 2.  Or should we require that vectorized sincos be available only when -fno-builtin-sincos flag
> > >     is used in compilation?
> > >
> > >
> > > I don't think we need to fix both types of vectorization failures in order to obtain sincos
> > > vectorization.
> >
> > I think we should have a vectorized cexpi since that's having a sane
> > ABI. The complex
> > return type of cexpi makes it a little awkward for the vectorizer but
> > handling this should
> > be manageable. It's a bit difficult to expose complex types to the
> > vectorizer since
> > most cases are lowered early.
> >
>
> I'm trying to identify the source code which needs modification but I need help proceeding.
>
> I am comparing two compilations: The first is a simple file with a call to sin in a loop.
> Vectorization succeeds. The second is an almost identical file but with a call to sincos
> in the loop. Vectorization fails.
>
> In gdb, the earliest code location where the two compilations differ is in function
> number_of_iterations_exit_assumptions in file tree-ssa-loop-niter.c. Line
>
> op0 = gimple_cond_lhs (stmt);
>
> returns a tree which when analyzed in function instantiate_scev_r (in file tree-scalar-evolution.c)
> results in the first branch of the switch being taken for sincos. For sin, the 2nd branch of the
> switch is taken.
>
> How can I correlate stmt in the source line above to the relevant line in any dump among those created
> using debugging dump option -fdump-tree-all?

grep ;)

Can you provide a testcase with a simd attribute annotated cexpi that
one can play with?

Richard.

>
> Thanks.
> Bert.
GT
Reply | Threaded
Open this post in threaded view
|

Re: PPC64 libmvec implementation of sincos

GT

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, December 5, 2019 4:44 AM, Richard Biener <[hidden email]> wrote:

...
...
...

> >
> > I'm trying to identify the source code which needs modification but I need help proceeding.
> > I am comparing two compilations: The first is a simple file with a call to sin in a loop.
> > Vectorization succeeds. The second is an almost identical file but with a call to sincos
> > in the loop. Vectorization fails.
> > In gdb, the earliest code location where the two compilations differ is in function
> > number_of_iterations_exit_assumptions in file tree-ssa-loop-niter.c. Line
> > op0 = gimple_cond_lhs (stmt);
> > returns a tree which when analyzed in function instantiate_scev_r (in file tree-scalar-evolution.c)
> > results in the first branch of the switch being taken for sincos. For sin, the 2nd branch of the
> > switch is taken.
> > How can I correlate stmt in the source line above to the relevant line in any dump among those created
> > using debugging dump option -fdump-tree-all?
>
> grep ;)
>
> Can you provide a testcase with a simd attribute annotated cexpi that
> one can play with?
>

On an x86_64 system, run Example 2 at this link:

sourceware.org/glibc/wiki/libmvec

After verifying vectorization (by finding a name with prefix _ZGV and suffix _sin in a.out), replace
the call to sin by one to sincos. The file should be similar to this:

================

#include <math.h>

int N = 3200;
double c[3200];
double b[3200];
double a[3200];

int main (void)
{
  int i;

  for (i = 0; i < N; i += 1)
  {
    sincos (a[i], &b[i], &c[i]);
  }

  return (0);
}

================

In addition to the options shown in Example 2, I passed GCC flags -fopt-info-all, -fopt-info-internal and
-fdump-tree-all to obtain more verbose messages.

That should show vectorization failing for sincos, and diagnostics on the screen indicating reason(s) for
the failure.

To perform the runs on PPC64 requires building both GCC and GLIBC with modifications not yet accepted
into the main development branches of the projects.

Please let me know if you are able to run on x86_64; if not, then perhaps I can push the local GCC
changes to some github repository. GLIBC changes are available at branch tuliom/libmvec of the
development repository.

Bert.