OpenACC support in 4.9

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

OpenACC support in 4.9

Evgeny Gavrin-2
Hi, all!

What do you think about support of OpenACC 1.0
(http://www.openacc-standard.org/) in gcc?

We're from Samsung Electronics and possibly we can allocate one or two
full-time engineers for this task. We can try to implement it for 4.9
branch by the end of Stage 1.
On my view, it's going to require some organizational help from your
side: development branch, at least.

Do you interested in getting OpenACC 1.0 support in 4.9 branch?

--
Thanks,
     Evgeny.
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Tobias Burnus
Evgeny Gavrin wrote:
> What do you think about support of OpenACC 1.0
> (http://www.openacc-standard.org/) in gcc?

I like the idea - though, I wonder whether OpenMP 4.0's "target"* would
be the better choice as it looks a bit more flexible and better defined.
(Conceptually, they are very similar; I think the
middle-end/back-end/library part would even be the same.)

Which accelerators do you intent to handle? "Accelerator" is a rather
broad term, covering DSPs, GPUs, Intel's MIC, ...

Tobias

* http://openmp.org/wp/2013/03/openmp-40-rc2/  (GCC supports OpenMP 3.1.
Support for OpenMP 4.0 is currently implemented in GCC's gomp-4_0-branch
- but I don't think anyone will work on OpenMP 4.0's 'target' feature
soon as the enough work on the non-'target' features remains.)

PS: A nonexclusive copyright assignment to the Free Software Foundation
is required for any any nontrivial code contribution. See
http://gcc.gnu.org/contribute.html#legal  - After the copyright issues
are sorted out, you can apply for commit rights (write after approval
status), which are sufficient to create a development branch.
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Jeff Law
On 05/06/2013 07:41 AM, Tobias Burnus wrote:
> Evgeny Gavrin wrote:
>> What do you think about support of OpenACC 1.0
>> (http://www.openacc-standard.org/) in gcc?
>
> I like the idea - though, I wonder whether OpenMP 4.0's "target"* would
> be the better choice as it looks a bit more flexible and better defined.
> (Conceptually, they are very similar; I think the
> middle-end/back-end/library part would even be the same.)
We're certainly hoping that OpenACC & OpenMP 4 & Cilk+ can share certain
parts of their implementations.  We're already seeing OpenMP 4 and Cilk
starting to converge on some stuff.

In a perfect world, there'd only be one standard for this stuff.  That's
not likely, so I'd be happy with parsing/FE kinds of things being
specific to each standard with everything from initial gimple generation
through the optimizers being shared.  That may not ultimately be
possible, but I think that's the right way to look at the work.

Jeff
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Torvald Riegel-4
In reply to this post by Evgeny Gavrin-2
On Mon, 2013-05-06 at 16:17 +0400, Evgeny Gavrin wrote:
> What do you think about support of OpenACC 1.0
> (http://www.openacc-standard.org/) in gcc?

Is there a specific reason for targeting 1.0 instead of 2.0 (besides 2.0
still being declared as a draft)?

Also, adding to Tobias' question: Which hardware or existing runtimes do
you intend to target in the implementation?

Torvald

Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Richard Biener-2
In reply to this post by Jeff Law
On Mon, May 6, 2013 at 5:16 PM, Jeff Law <[hidden email]> wrote:

> On 05/06/2013 07:41 AM, Tobias Burnus wrote:
>>
>> Evgeny Gavrin wrote:
>>>
>>> What do you think about support of OpenACC 1.0
>>> (http://www.openacc-standard.org/) in gcc?
>>
>>
>> I like the idea - though, I wonder whether OpenMP 4.0's "target"* would
>> be the better choice as it looks a bit more flexible and better defined.
>> (Conceptually, they are very similar; I think the
>> middle-end/back-end/library part would even be the same.)
>
> We're certainly hoping that OpenACC & OpenMP 4 & Cilk+ can share certain
> parts of their implementations.  We're already seeing OpenMP 4 and Cilk
> starting to converge on some stuff.
>
> In a perfect world, there'd only be one standard for this stuff.  That's not
> likely, so I'd be happy with parsing/FE kinds of things being specific to
> each standard with everything from initial gimple generation through the
> optimizers being shared.  That may not ultimately be possible, but I think
> that's the right way to look at the work.

We're going to look at supporting HSA from GCC (which would make it
more or less trivial to also target openCL I think) and also hope to leverage
parts of the GOMP infrastructure for this (GOMP is currently the only
way to annotate parallel regions, apart from autodetecting them).  If Cilk+
and OpenACC provide additional ways of annotating parallel regions then
it would be nice to have the middle-end see only a single consistent way
of a parallel region.

Richard.

> Jeff
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Evgeny Gavrin-2
In reply to this post by Tobias Burnus
Hi, all!

 > Which accelerators do you intent to handle? "Accelerator" is a rather
 > broad term, covering DSPs, GPUs, Intel's MIC, ...
The idea is to emit OpenCL from high-GIMPLE, for know. So, any device
that has OpenCL support can be utilized by ACC.
Maybe, we'll be able to reuse some parts from graphite/opengpu projects,
but this is not clear for now.

 > Is there a specific reason for targeting 1.0 instead of 2.0 (besides 2.0
 > still being declared as a draft)?
You've named the main reason why we're targeting OpenACC1 - it's stable
and it's a good starting point for the initial implementation. BTW,
OpenACC2 differs not much from the previous version. Major improvements
covers only runtime library.

 > - but I don't think anyone will work on OpenMP 4.0's 'target' feature
 > soon as the enough work on the non-'target' features remains.
OpenMP's 'target' is definitely inspired by OpenACC. So, I think it'll
be possible to reuse/share most of BE part from OpenACC implementation,
once it's finished.

 > I like the idea - though, I wonder whether OpenMP 4.0's "target"* would
 > be the better choice as it looks a bit more flexible and better defined.
 > (Conceptually, they are very similar; I think the
 > middle-end/back-end/library part would even be the same.)
OpenMP 4.0 is still in draft. Meanwhile, OpenACC is supported by several
commercial compilers and our implementation can be compared for achieved
performance. In addition, there are some side projects using OpenACC and
it'll help to test implementation on real code.

Thanks,
     Evgeny.

On 05/06/2013 05:41 PM, Tobias Burnus wrote:

> Evgeny Gavrin wrote:
>> What do you think about support of OpenACC 1.0
>> (http://www.openacc-standard.org/) in gcc?
>
> I like the idea - though, I wonder whether OpenMP 4.0's "target"* would
> be the better choice as it looks a bit more flexible and better defined.
> (Conceptually, they are very similar; I think the
> middle-end/back-end/library part would even be the same.)
>
> Which accelerators do you intent to handle? "Accelerator" is a rather
> broad term, covering DSPs, GPUs, Intel's MIC, ...
>
> Tobias
>
> * http://openmp.org/wp/2013/03/openmp-40-rc2/  (GCC supports OpenMP 3.1.
> Support for OpenMP 4.0 is currently implemented in GCC's gomp-4_0-branch
> - but I don't think anyone will work on OpenMP 4.0's 'target' feature
> soon as the enough work on the non-'target' features remains.)
>
> PS: A nonexclusive copyright assignment to the Free Software Foundation
> is required for any any nontrivial code contribution. See
> http://gcc.gnu.org/contribute.html#legal  - After the copyright issues
> are sorted out, you can apply for commit rights (write after approval
> status), which are sufficient to create a development branch.
>
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Tobias Burnus
In reply to this post by Richard Biener-2
Richard Biener wrote:
> We're going to look at supporting HSA from GCC (which would make it
> more or less trivial to also target openCL I think)

For the friends of link-time optimization (LTO):

Unless I missed some fine point in OpenACC and OpenMP's target, they
only work with directives which are locally visible. Thus, if one does a
function call in the device/target section, it can only be placed on the
accelerator if the function can be inlined.

Thus, it would be useful, if LTO could be used to inline such function
into device code. I know one OpenACC code which calls functions in
different translation units (TU) - and the Cray compiler handles this
via LTO. Thus, it would be great if the HSA/OpenMP target/OpenACC
middle-end infrastructure could do likewise, which also means deferring
the error that an external function cannot be used to the middle-end/LTO
FE and not placing it into the FE. - In the mentioned code, the called
function does not have any OpenACC annotation but only consists of
constructs which are permitted by the accelerator - thus, no automatic
code gen of accelerator code happens for that. TU.

(I just want to mention this to ensure that this kind of LTO/accelerator
inlining is kept in mind when implementing the infrastructure for
HSA/OpenACC/OpenMP target/OpenCL - even if cross-TU inlining is not
supported initially.)

Tobias
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Richard Biener-2
On Tue, May 7, 2013 at 11:02 AM, Tobias Burnus <[hidden email]> wrote:

> Richard Biener wrote:
>>
>> We're going to look at supporting HSA from GCC (which would make it more
>> or less trivial to also target openCL I think)
>
>
> For the friends of link-time optimization (LTO):
>
> Unless I missed some fine point in OpenACC and OpenMP's target, they only
> work with directives which are locally visible. Thus, if one does a function
> call in the device/target section, it can only be placed on the accelerator
> if the function can be inlined.
>
> Thus, it would be useful, if LTO could be used to inline such function into
> device code. I know one OpenACC code which calls functions in different
> translation units (TU) - and the Cray compiler handles this via LTO. Thus,
> it would be great if the HSA/OpenMP target/OpenACC middle-end infrastructure
> could do likewise, which also means deferring the error that an external
> function cannot be used to the middle-end/LTO FE and not placing it into the
> FE. - In the mentioned code, the called function does not have any OpenACC
> annotation but only consists of constructs which are permitted by the
> accelerator - thus, no automatic code gen of accelerator code happens for
> that. TU.
>
> (I just want to mention this to ensure that this kind of LTO/accelerator
> inlining is kept in mind when implementing the infrastructure for
> HSA/OpenACC/OpenMP target/OpenCL - even if cross-TU inlining is not
> supported initially.)

In my view we'd get the "regular" OpenMP processing done during omp
lowering/expansion (which happens before LTO) which should mark the
generated worker functions apropriately.  Emitting accelerator code should
then happen at LTRANS time, thus after all IPA inlining took place.  The
interesting bits we can borrow from OMP is basically marking of functions
that are a) interesting, b) possible to transform.  Unmarked functions / loops
will have to go the autopar way, thus we have to prove via dependence analysis
that executing iterations in parallel is possible.

Richard.

> Tobias
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Richard Biener-2
On Tue, May 7, 2013 at 12:42 PM, Richard Biener
<[hidden email]> wrote:

> On Tue, May 7, 2013 at 11:02 AM, Tobias Burnus <[hidden email]> wrote:
>> Richard Biener wrote:
>>>
>>> We're going to look at supporting HSA from GCC (which would make it more
>>> or less trivial to also target openCL I think)
>>
>>
>> For the friends of link-time optimization (LTO):
>>
>> Unless I missed some fine point in OpenACC and OpenMP's target, they only
>> work with directives which are locally visible. Thus, if one does a function
>> call in the device/target section, it can only be placed on the accelerator
>> if the function can be inlined.
>>
>> Thus, it would be useful, if LTO could be used to inline such function into
>> device code. I know one OpenACC code which calls functions in different
>> translation units (TU) - and the Cray compiler handles this via LTO. Thus,
>> it would be great if the HSA/OpenMP target/OpenACC middle-end infrastructure
>> could do likewise, which also means deferring the error that an external
>> function cannot be used to the middle-end/LTO FE and not placing it into the
>> FE. - In the mentioned code, the called function does not have any OpenACC
>> annotation but only consists of constructs which are permitted by the
>> accelerator - thus, no automatic code gen of accelerator code happens for
>> that. TU.
>>
>> (I just want to mention this to ensure that this kind of LTO/accelerator
>> inlining is kept in mind when implementing the infrastructure for
>> HSA/OpenACC/OpenMP target/OpenCL - even if cross-TU inlining is not
>> supported initially.)
>
> In my view we'd get the "regular" OpenMP processing done during omp
> lowering/expansion (which happens before LTO) which should mark the
> generated worker functions apropriately.  Emitting accelerator code should
> then happen at LTRANS time, thus after all IPA inlining took place.  The
> interesting bits we can borrow from OMP is basically marking of functions
> that are a) interesting, b) possible to transform.  Unmarked functions / loops
> will have to go the autopar way, thus we have to prove via dependence analysis
> that executing iterations in parallel is possible.

Btw, we plan to re-use the GOMP runtime as otherwise any synchronisation
between accelerator code and regular thread code is impossible.  Which
means changing the GOMP runtime in a way to be able to pass a descriptor
which eventually has accelerator code (and a fallback regular function so
you can disable accelerator usage at runtime).

Richard.

> Richard.
>
>> Tobias
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Jakub Jelinek
In reply to this post by Tobias Burnus
On Tue, May 07, 2013 at 11:02:08AM +0200, Tobias Burnus wrote:

> Richard Biener wrote:
> >We're going to look at supporting HSA from GCC (which would make
> >it more or less trivial to also target openCL I think)
>
> For the friends of link-time optimization (LTO):
>
> Unless I missed some fine point in OpenACC and OpenMP's target, they
> only work with directives which are locally visible. Thus, if one
> does a function call in the device/target section, it can only be
> placed on the accelerator if the function can be inlined.

No, OpenMP 4.0 has
#pragma omp declare target
...
#pragma omp end declare target
where you can define/declare functions and variables in that ... and those
are all marked for cloning for the target device (guess parsing of
the above construct is going to just add "omp declare target" attribute
to all those variables/functions and we'd then just clone the functions
and map the variables into the target code).

The target code can perhaps be done by streaming LTO IL for the target
device into special sections during normal compilation and a linker plugin
invoked for -fopenmp could then collect those if any, compile the code
with a different backend and link both the host program and whatever
is needed for the target device.

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Torvald Riegel-4
In reply to this post by Evgeny Gavrin-2
On Tue, 2013-05-07 at 13:00 +0400, Evgeny Gavrin wrote:
> Hi, all!
>
>  > Which accelerators do you intent to handle? "Accelerator" is a rather
>  > broad term, covering DSPs, GPUs, Intel's MIC, ...
> The idea is to emit OpenCL from high-GIMPLE, for know. So, any device
> that has OpenCL support can be utilized by ACC.
> Maybe, we'll be able to reuse some parts from graphite/opengpu projects,
> but this is not clear for now.

I don't disagree that this could be useful for a proof-of-concept, but
I'm wondering whether this is really useful to our users in the long
term.  We don't have any OpenCL implementation in GCC, so if we'd use
OpenCL below OpenACC, we'd bring in dependencies to the OpenCL
implementation, and GCC's OpenACC support would be like a
close-to-frontend translation layer to an OpenCL implementation (with
probably a glue library component too).

Also, if the representation that we eventually want to have in GCC for
accelerators isn't quite like OpenCL, then we'd be translating OpenACC
to this representation back to OpenCL.

Perhaps the HSA efforts that Richard mentioned could be a useful
lower-level target too?  Samsung is listed as an HSA founding member;
are you involved with HSA?

>  > Is there a specific reason for targeting 1.0 instead of 2.0 (besides 2.0
>  > still being declared as a draft)?
> You've named the main reason why we're targeting OpenACC1 - it's stable
> and it's a good starting point for the initial implementation. BTW,
> OpenACC2 differs not much from the previous version. Major improvements
> covers only runtime library.

True, most of the differences seems to be in the Data API and such.
However, when I last looked at the list of proposed changes, there is a
proposal for allowing for calling non-inlined code.  It seems that this
would need to be part of general accelerator support (unless you'd want
to *require* inlining and LTO).  Vectorization features such as Cilk
also have something similar ("elemental functions"), so there seems to
be a need for this.

BTW, what's the state of OpenACC in general?  OpenACC 1.0 has been
released, but I had a few open questions after reading it a while ago.
Is 1.0 supposed to be a first draft, or indeed something that's expected
to be stable and useful for a long time?

>  > - but I don't think anyone will work on OpenMP 4.0's 'target' feature
>  > soon as the enough work on the non-'target' features remains.
> OpenMP's 'target' is definitely inspired by OpenACC. So, I think it'll
> be possible to reuse/share most of BE part from OpenACC implementation,
> once it's finished.

I'd agree with Jeff regarding this aspect: Even if we can't get a single
language-level standard for this in practice, we should try hard to have
somewhat compatible semantics in the different standards, and have a
internal representation in GCC that can represent those semantics.  It
seems unlikely that we can or want to support several incompatible
semantics, or have several different sets of middle-end IRs for
accelerators.  Thus, for GCC, I believe that the semantics that we want
to support are an important open question; those will also be affected
by the targets that we want to support (whether hardware or sth like
OpenCL).

Torvald

Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Torvald Riegel-4
In reply to this post by Jakub Jelinek
On Tue, 2013-05-07 at 17:34 +0200, Jakub Jelinek wrote:

> On Tue, May 07, 2013 at 11:02:08AM +0200, Tobias Burnus wrote:
> > Richard Biener wrote:
> > >We're going to look at supporting HSA from GCC (which would make
> > >it more or less trivial to also target openCL I think)
> >
> > For the friends of link-time optimization (LTO):
> >
> > Unless I missed some fine point in OpenACC and OpenMP's target, they
> > only work with directives which are locally visible. Thus, if one
> > does a function call in the device/target section, it can only be
> > placed on the accelerator if the function can be inlined.
>
> No, OpenMP 4.0 has
> #pragma omp declare target
> ...
> #pragma omp end declare target
> where you can define/declare functions and variables in that ... and those
> are all marked for cloning for the target device (guess parsing of
> the above construct is going to just add "omp declare target" attribute
> to all those variables/functions and we'd then just clone the functions
> and map the variables into the target code).

Additional examples of such "special" functions are (1) the OpenMP /
Cilk+ SIMD functions (aka "elemental functions), for which programmers
make assertions such as that they can do with weaker forward progress
guarantees (eg, no locks in SIMD code), and (2) transaction-safe
functions for TM (but there the programmer doesn't make an assertion,
but a requirement.  Both might or might not need special code being
generated.  OpenACC 2.0 also proposes a similar feature (but the
description didn't seem like a finished spec back when I read it).

So, this isn't just about accelerators.


Torvald

Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Torvald Riegel-4
In reply to this post by Richard Biener-2
On Tue, 2013-05-07 at 10:27 +0200, Richard Biener wrote:

> On Mon, May 6, 2013 at 5:16 PM, Jeff Law <[hidden email]> wrote:
> > On 05/06/2013 07:41 AM, Tobias Burnus wrote:
> >>
> >> Evgeny Gavrin wrote:
> >>>
> >>> What do you think about support of OpenACC 1.0
> >>> (http://www.openacc-standard.org/) in gcc?
> >>
> >>
> >> I like the idea - though, I wonder whether OpenMP 4.0's "target"* would
> >> be the better choice as it looks a bit more flexible and better defined.
> >> (Conceptually, they are very similar; I think the
> >> middle-end/back-end/library part would even be the same.)
> >
> > We're certainly hoping that OpenACC & OpenMP 4 & Cilk+ can share certain
> > parts of their implementations.  We're already seeing OpenMP 4 and Cilk
> > starting to converge on some stuff.
> >
> > In a perfect world, there'd only be one standard for this stuff.  That's not
> > likely, so I'd be happy with parsing/FE kinds of things being specific to
> > each standard with everything from initial gimple generation through the
> > optimizers being shared.  That may not ultimately be possible, but I think
> > that's the right way to look at the work.
>
> We're going to look at supporting HSA from GCC

Could you elaborate on those plans?

>  (which would make it
> more or less trivial to also target openCL I think) and also hope to leverage
> parts of the GOMP infrastructure for this

Are you thinking about leveraging the compiler side of GOMP, or libgomp?
I can see reasons for the former, but I'm not sure the latter is the
best approach to support for HSA.

> (GOMP is currently the only
> way to annotate parallel regions, apart from autodetecting them).  If Cilk+
> and OpenACC provide additional ways of annotating parallel regions then
> it would be nice to have the middle-end see only a single consistent way
> of a parallel region.

I agree that having one way of annotating parallel regions or task in
code would be useful.  There's also the TM infrastructure, which isn't
about parallelism but very much about annotated regions with additional
constraints on code in the regions, etc.; so it might perhaps be useful
too.  I believe it's the latter that's important here (and HW
heterogeneity), not whether we want to execute them in parallel or not
(i.e., you don't need language constructs to support parallel
execution...).


Torvald


Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Torvald Riegel-4
In reply to this post by Richard Biener-2
On Tue, 2013-05-07 at 12:46 +0200, Richard Biener wrote:

> On Tue, May 7, 2013 at 12:42 PM, Richard Biener
> <[hidden email]> wrote:
> > On Tue, May 7, 2013 at 11:02 AM, Tobias Burnus <[hidden email]> wrote:
> >> Richard Biener wrote:
> >>>
> >>> We're going to look at supporting HSA from GCC (which would make it more
> >>> or less trivial to also target openCL I think)
> >>
> >>
> >> For the friends of link-time optimization (LTO):
> >>
> >> Unless I missed some fine point in OpenACC and OpenMP's target, they only
> >> work with directives which are locally visible. Thus, if one does a function
> >> call in the device/target section, it can only be placed on the accelerator
> >> if the function can be inlined.
> >>
> >> Thus, it would be useful, if LTO could be used to inline such function into
> >> device code. I know one OpenACC code which calls functions in different
> >> translation units (TU) - and the Cray compiler handles this via LTO. Thus,
> >> it would be great if the HSA/OpenMP target/OpenACC middle-end infrastructure
> >> could do likewise, which also means deferring the error that an external
> >> function cannot be used to the middle-end/LTO FE and not placing it into the
> >> FE. - In the mentioned code, the called function does not have any OpenACC
> >> annotation but only consists of constructs which are permitted by the
> >> accelerator - thus, no automatic code gen of accelerator code happens for
> >> that. TU.
> >>
> >> (I just want to mention this to ensure that this kind of LTO/accelerator
> >> inlining is kept in mind when implementing the infrastructure for
> >> HSA/OpenACC/OpenMP target/OpenCL - even if cross-TU inlining is not
> >> supported initially.)
> >
> > In my view we'd get the "regular" OpenMP processing done during omp
> > lowering/expansion (which happens before LTO) which should mark the
> > generated worker functions apropriately.  Emitting accelerator code should
> > then happen at LTRANS time, thus after all IPA inlining took place.  The
> > interesting bits we can borrow from OMP is basically marking of functions
> > that are a) interesting, b) possible to transform.  Unmarked functions / loops
> > will have to go the autopar way, thus we have to prove via dependence analysis
> > that executing iterations in parallel is possible.
>
> Btw, we plan to re-use the GOMP runtime as otherwise any synchronisation
> between accelerator code and regular thread code is impossible.

I can't follow this line of reasoning.  Can you elaborate?  Which kind
of synchronization are you referring to?

As far as parallel execution and resource management is concerned,
libgomp has just the kinds of scheduler that you need in the OpenMP rule
set.  Work-stealing schedulers such as Cilk's are others, and might
actually become the more common approach.  And there are other thread
pools that programs might use; e.g., there's lots of discussion about
all this in ISO C++ study group 1 on parallelism and concurrency, and
several different proposals.

With that in mind, I'm wondering whether the cooperative scheduling that
we likely need should be at a lower level than libgomp or the Cilk
runtime.  Otherwise, libgomp needs to become the scheduler that runs
them all (that is, if you want it to work well when combined with other
abstractions for parallelism), and I'm not sure whether that's the right
approach.

> Which
> means changing the GOMP runtime in a way to be able to pass a descriptor
> which eventually has accelerator code (and a fallback regular function so
> you can disable accelerator usage at runtime).

It probably should be a list of different codes -- you might have more
than one suitable accelerator available.

BTW: What about putting this topic on the Cauldron agenda?  Is there
still time available to discuss what GCC might do regarding accelerators
and HW heterogeneity?


Torvald

Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Richard Biener-2
In reply to this post by Torvald Riegel-4
On Wed, May 8, 2013 at 10:12 PM, Torvald Riegel <[hidden email]> wrote:

> On Tue, 2013-05-07 at 10:27 +0200, Richard Biener wrote:
>> On Mon, May 6, 2013 at 5:16 PM, Jeff Law <[hidden email]> wrote:
>> > On 05/06/2013 07:41 AM, Tobias Burnus wrote:
>> >>
>> >> Evgeny Gavrin wrote:
>> >>>
>> >>> What do you think about support of OpenACC 1.0
>> >>> (http://www.openacc-standard.org/) in gcc?
>> >>
>> >>
>> >> I like the idea - though, I wonder whether OpenMP 4.0's "target"* would
>> >> be the better choice as it looks a bit more flexible and better defined.
>> >> (Conceptually, they are very similar; I think the
>> >> middle-end/back-end/library part would even be the same.)
>> >
>> > We're certainly hoping that OpenACC & OpenMP 4 & Cilk+ can share certain
>> > parts of their implementations.  We're already seeing OpenMP 4 and Cilk
>> > starting to converge on some stuff.
>> >
>> > In a perfect world, there'd only be one standard for this stuff.  That's not
>> > likely, so I'd be happy with parsing/FE kinds of things being specific to
>> > each standard with everything from initial gimple generation through the
>> > optimizers being shared.  That may not ultimately be possible, but I think
>> > that's the right way to look at the work.
>>
>> We're going to look at supporting HSA from GCC
>
> Could you elaborate on those plans?

The ultimate goal is to transparently offload computations to HSA, thus
have GCC generate HSAIL/BRIG code from GIMPLE and dispatch to the
HSA runtime.

>>  (which would make it
>> more or less trivial to also target openCL I think) and also hope to leverage
>> parts of the GOMP infrastructure for this
>
> Are you thinking about leveraging the compiler side of GOMP, or libgomp?
> I can see reasons for the former, but I'm not sure the latter is the
> best approach to support for HSA.

The plan is to first rely on OMP annotations to identify possible code regions
that can be offloaded.  As not all OMP annotated regions will be off-loadable
there will be a mix of HSA and OMP thread code running at the same time
which means there has to be synchronization between the two if you
consider for example

#omp parallel
  for (;;)
    ... HSA possible ..

#omp parallel
  for (;;)
    ... HSA not possible ...

#omp barrier

and you don't want to nest HSA inside OMP implementation-wise.  Thus
the idea is to teach libgomp that there is besides threads also HSA
machinery (it's an idea on paper, we'll see how that can work out).

>> (GOMP is currently the only
>> way to annotate parallel regions, apart from autodetecting them).  If Cilk+
>> and OpenACC provide additional ways of annotating parallel regions then
>> it would be nice to have the middle-end see only a single consistent way
>> of a parallel region.
>
> I agree that having one way of annotating parallel regions or task in
> code would be useful.  There's also the TM infrastructure, which isn't
> about parallelism but very much about annotated regions with additional
> constraints on code in the regions, etc.; so it might perhaps be useful
> too.  I believe it's the latter that's important here (and HW
> heterogeneity), not whether we want to execute them in parallel or not
> (i.e., you don't need language constructs to support parallel
> execution...).

There is several pieces to the picture here - first is source level annotation
where we have multiple variants (OMP, Cilk+, ...?), second is annotation
on the GIMPLE level which is where we only should have one, third is
the code-generation where we will end up supporting many (HSA, openCL, ...),
and fourth is the runtime where having multiple ones will create the issue
with mixed operation (from a single source level annotation way), so I'd
prefer to have a single runtime targeted by GCC (but the runtime itself
eventually dispatching to HW specific runtimes).

Richard.

>
> Torvald
>
>
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Richard Biener-2
In reply to this post by Torvald Riegel-4
On Wed, May 8, 2013 at 10:25 PM, Torvald Riegel <[hidden email]> wrote:

> On Tue, 2013-05-07 at 12:46 +0200, Richard Biener wrote:
>> On Tue, May 7, 2013 at 12:42 PM, Richard Biener
>> <[hidden email]> wrote:
>> > On Tue, May 7, 2013 at 11:02 AM, Tobias Burnus <[hidden email]> wrote:
>> >> Richard Biener wrote:
>> >>>
>> >>> We're going to look at supporting HSA from GCC (which would make it more
>> >>> or less trivial to also target openCL I think)
>> >>
>> >>
>> >> For the friends of link-time optimization (LTO):
>> >>
>> >> Unless I missed some fine point in OpenACC and OpenMP's target, they only
>> >> work with directives which are locally visible. Thus, if one does a function
>> >> call in the device/target section, it can only be placed on the accelerator
>> >> if the function can be inlined.
>> >>
>> >> Thus, it would be useful, if LTO could be used to inline such function into
>> >> device code. I know one OpenACC code which calls functions in different
>> >> translation units (TU) - and the Cray compiler handles this via LTO. Thus,
>> >> it would be great if the HSA/OpenMP target/OpenACC middle-end infrastructure
>> >> could do likewise, which also means deferring the error that an external
>> >> function cannot be used to the middle-end/LTO FE and not placing it into the
>> >> FE. - In the mentioned code, the called function does not have any OpenACC
>> >> annotation but only consists of constructs which are permitted by the
>> >> accelerator - thus, no automatic code gen of accelerator code happens for
>> >> that. TU.
>> >>
>> >> (I just want to mention this to ensure that this kind of LTO/accelerator
>> >> inlining is kept in mind when implementing the infrastructure for
>> >> HSA/OpenACC/OpenMP target/OpenCL - even if cross-TU inlining is not
>> >> supported initially.)
>> >
>> > In my view we'd get the "regular" OpenMP processing done during omp
>> > lowering/expansion (which happens before LTO) which should mark the
>> > generated worker functions apropriately.  Emitting accelerator code should
>> > then happen at LTRANS time, thus after all IPA inlining took place.  The
>> > interesting bits we can borrow from OMP is basically marking of functions
>> > that are a) interesting, b) possible to transform.  Unmarked functions / loops
>> > will have to go the autopar way, thus we have to prove via dependence analysis
>> > that executing iterations in parallel is possible.
>>
>> Btw, we plan to re-use the GOMP runtime as otherwise any synchronisation
>> between accelerator code and regular thread code is impossible.
>
> I can't follow this line of reasoning.  Can you elaborate?  Which kind
> of synchronization are you referring to?
>
> As far as parallel execution and resource management is concerned,
> libgomp has just the kinds of scheduler that you need in the OpenMP rule
> set.  Work-stealing schedulers such as Cilk's are others, and might
> actually become the more common approach.  And there are other thread
> pools that programs might use; e.g., there's lots of discussion about
> all this in ISO C++ study group 1 on parallelism and concurrency, and
> several different proposals.
>
> With that in mind, I'm wondering whether the cooperative scheduling that
> we likely need should be at a lower level than libgomp or the Cilk
> runtime.  Otherwise, libgomp needs to become the scheduler that runs
> them all (that is, if you want it to work well when combined with other
> abstractions for parallelism), and I'm not sure whether that's the right
> approach.

See my other mail.

>> Which
>> means changing the GOMP runtime in a way to be able to pass a descriptor
>> which eventually has accelerator code (and a fallback regular function so
>> you can disable accelerator usage at runtime).
>
> It probably should be a list of different codes -- you might have more
> than one suitable accelerator available.

Of course.  And the descriptor should be versioned to avoid future ABI
changes.  Note that I'd always generate code for the CPU as fallback.

> BTW: What about putting this topic on the Cauldron agenda?  Is there
> still time available to discuss what GCC might do regarding accelerators
> and HW heterogeneity?

I am not able to attend, but certainly the topic is interesting.

Richard.

>
> Torvald
>
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Jakub Jelinek
On Fri, May 10, 2013 at 11:00:29AM +0200, Richard Biener wrote:

> >> Which
> >> means changing the GOMP runtime in a way to be able to pass a descriptor
> >> which eventually has accelerator code (and a fallback regular function so
> >> you can disable accelerator usage at runtime).
> >
> > It probably should be a list of different codes -- you might have more
> > than one suitable accelerator available.
>
> Of course.  And the descriptor should be versioned to avoid future ABI
> changes.  Note that I'd always generate code for the CPU as fallback.

If one uses the OpenMP 4.0 accelerator pragmas, then that is the required
behavior, if the code is for whatever reason not possible to run on the
accelerator, it should be executed on host (and the vars can be shared in
that case between the standard host code and the for accelerator targetted,
but actually on host executed, code).  Otherwise, the OpenMP runtime as well
as the pragmas have a way to choose which accelerator you want to run
something on, as device id (integer), so the OpenMP runtime library should
maintain the list of supported accelerators (say if you have two Intel MIC
cards, and two AMD GPGPU devices), and probably we'll need a compiler switch
to say for which kinds of accelerators we want to generate code for, plus
the runtime could have dlopened plugins for each of the accelerator kinds.

> > BTW: What about putting this topic on the Cauldron agenda?  Is there
> > still time available to discuss what GCC might do regarding accelerators
> > and HW heterogeneity?

Yeah, it would be nice to discuss that.  By that time I hope we'll be at
least through parsing of the accelerator stuff, so we can then design what
to do with that info.

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Tobias Burnus
Jakub Jelinek wrote:
[Fallback generation of CPU code]
> If one uses the OpenMP 4.0 accelerator pragmas, then that is the required
> behavior, if the code is for whatever reason not possible to run on the
> accelerator, it should be executed on host [...]
(I haven't checked, but is this a compile time or run-time requirement?)

> Otherwise, the OpenMP runtime as well as the pragmas have a way to choose which accelerator you want to run something on, as device id (integer), so the OpenMP runtime library should maintain the list of supported accelerators (say if you have two Intel MIC cards, and two AMD GPGPU devices), and probably we'll need a compiler switch to say for which kinds of accelerators we want to generate code for, plus the runtime could have dlopened plugins for each of the accelerator kinds.

At least two OpenACC implementations I know fail hard when the GPU is
not available (nonexisting or if the /dev/... has not the right
permissions). And three of them fail at compile time with an error
message if an expression within a device section is not possible (e.g.
calling some nondevice/noninlinable function).

While it is convenient to have CPU fallback, it would be nice to know
whether some code actually uses the accelerator - both at compile time
and at run time. Otherwise, one thinks the the GPU is used - without
realizing that it isn't because, e.g. the device permissions are wrong -
or one forgot to declare a certain function as target function.

Besides having a flag which tells the compiler for which accelerator the
code should be generated, also additional flags should be handled, e.g.
for different versions of the accelerator. For instance, one accelerator
model of the same series might support double-precision variables while
another might not. - I assume that falling back to the CPU if the
accelerator doesn't support a certain feature won't work and one will
get an error in this case.


Is there actually the need to handle multiple accelerators
simultaneously? My impression is that both OpenACC and OpenMP 4 assume
that there is only one kind of accelerator available besides the host.
If I missed some fine print or something else  requires that there are
multiple different accelerators, it will get more complicated -
especially for those code section where the user didn't explicitly
specify which one should be used.


Finally, one should think about debugging. It is not really clear (to
me) how to handle this best, but as the compiler generates quite some
additional code (e.g. for copying the data around) and as printf
debugging doesn't work on GPUs, it is not that easy. I wonder whether
there should be an optional library like libgomp_debug which adds
additional sanity checks (e.g. related to copying data to/from the GPU)
and which allows to print diagnostic output, when one sets an
environment variables.

Tobias
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Dinar Temirbulatov-2
Another interesting use-case for OpenACC and OpenMP is mixing both
standard annotations for the same loop:
 // Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])

  for (int i = 0; i < size; ++i) {
    for (int j = 0; j < size; ++j) {
      float tmp = 0.;
      for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
      }
      C[i][j] = tmp;
    }
  }
This means that OpenACC pragmas should be parsed before OpenMP pass
(in case both standards were enabled), before the OpenMP pass would
change annotated GIMPLE statements irrecoverably. In my view this
use-case could be handles for example in this way:
We could add some temporary variable for example
"expand_gimple_with_openmp" and change the example above to something
like this just before the OpenMP pass:


if (expand_gimple_with_openmp) {
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i = 0; i < size; ++i) {
    for (int j = 0; j < size; ++j) {
      float tmp = 0.;
      for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
      }
      C[i][j] = tmp;
    }
  }
else {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])

  for (int i = 0; i < size; ++i) {
    for (int j = 0; j < size; ++j) {
      float tmp = 0.;
      for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
      }
      C[i][j] = tmp;
    }
}
and later at the Graphite pass we could understand that our statement
is SCOP and we could produce kernel for this statement and then we
could assume that expand_gimple_with_openmp heuristic is false and the
OpenMP version of the loop could be eliminated or vice versa. But we
have to make sure that optimization passes would not change our
OpenACC gimple that it become unparalleled.
                               thanks, Dinar.

On Fri, May 10, 2013 at 2:06 PM, Tobias Burnus <[hidden email]> wrote:

> Jakub Jelinek wrote:
> [Fallback generation of CPU code]
>>
>> If one uses the OpenMP 4.0 accelerator pragmas, then that is the required
>> behavior, if the code is for whatever reason not possible to run on the
>> accelerator, it should be executed on host [...]
>
> (I haven't checked, but is this a compile time or run-time requirement?)
>
>
>> Otherwise, the OpenMP runtime as well as the pragmas have a way to choose
>> which accelerator you want to run something on, as device id (integer), so
>> the OpenMP runtime library should maintain the list of supported
>> accelerators (say if you have two Intel MIC cards, and two AMD GPGPU
>> devices), and probably we'll need a compiler switch to say for which kinds
>> of accelerators we want to generate code for, plus the runtime could have
>> dlopened plugins for each of the accelerator kinds.
>
>
> At least two OpenACC implementations I know fail hard when the GPU is not
> available (nonexisting or if the /dev/... has not the right permissions).
> And three of them fail at compile time with an error message if an
> expression within a device section is not possible (e.g. calling some
> nondevice/noninlinable function).
>
> While it is convenient to have CPU fallback, it would be nice to know
> whether some code actually uses the accelerator - both at compile time and
> at run time. Otherwise, one thinks the the GPU is used - without realizing
> that it isn't because, e.g. the device permissions are wrong - or one forgot
> to declare a certain function as target function.
>
> Besides having a flag which tells the compiler for which accelerator the
> code should be generated, also additional flags should be handled, e.g. for
> different versions of the accelerator. For instance, one accelerator model
> of the same series might support double-precision variables while another
> might not. - I assume that falling back to the CPU if the accelerator
> doesn't support a certain feature won't work and one will get an error in
> this case.
>
>
> Is there actually the need to handle multiple accelerators simultaneously?
> My impression is that both OpenACC and OpenMP 4 assume that there is only
> one kind of accelerator available besides the host. If I missed some fine
> print or something else  requires that there are multiple different
> accelerators, it will get more complicated - especially for those code
> section where the user didn't explicitly specify which one should be used.
>
>
> Finally, one should think about debugging. It is not really clear (to me)
> how to handle this best, but as the compiler generates quite some additional
> code (e.g. for copying the data around) and as printf debugging doesn't work
> on GPUs, it is not that easy. I wonder whether there should be an optional
> library like libgomp_debug which adds additional sanity checks (e.g. related
> to copying data to/from the GPU) and which allows to print diagnostic
> output, when one sets an environment variables.
>
> Tobias
Reply | Threaded
Open this post in threaded view
|

Re: OpenACC support in 4.9

Richard Biener-2
In reply to this post by Tobias Burnus
Dinar Temirbulatov <[hidden email]> wrote:

>Another interesting use-case for OpenACC and OpenMP is mixing both
>standard
>annotations for the same loop:
> // Compute matrix multiplication.
>#pragma omp parallel for default(none) shared(A,B,C,size)
>#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
>  pcopyout(C[0:size][0:size])
>
>  for (int i = 0; i < size; ++i) {
>    for (int j = 0; j < size; ++j) {
>      float tmp = 0.;
>      for (int k = 0; k < size; ++k) {
>tmp += A[i][k] * B[k][j];
>      }
>      C[i][j] = tmp;
>    }
>  }
>This means that OpenACC pragmas should be parsed before OpenMP pass (in
>case both standards were enabled), before the OpenMP pass would
>change annotated GIMPLE statements irrecoverably. In my view this
>use-case
>could be handles for example in this way:
>We could add some temporary variable for example
>"expand_gimple_with_openmp" and change the example above to something
>like
>this just before the OpenMP pass:
>
>
>if (expand_gimple_with_openmp) {
>#pragma omp parallel for default(none) shared(A,B,C,size)
>for (int i = 0; i < size; ++i) {
>    for (int j = 0; j < size; ++j) {
>      float tmp = 0.;
>      for (int k = 0; k < size; ++k) {
>tmp += A[i][k] * B[k][j];
>      }
>      C[i][j] = tmp;
>    }
>  }
>else {
>#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
>  pcopyout(C[0:size][0:size])
>
>  for (int i = 0; i < size; ++i) {
>    for (int j = 0; j < size; ++j) {
>      float tmp = 0.;
>      for (int k = 0; k < size; ++k) {
>tmp += A[i][k] * B[k][j];
>      }
>      C[i][j] = tmp;
>    }
>}
>and later at the Graphite pass we could understand that our statement
>is
>SCOP and we could produce kernel for this statement and then we could
>assume that expand_gimple_with_openmp heuristic is false and the OpenMP
>version of the loop could be eliminated or vice versa. But we have to
>make
>sure that optimization passes would not change our OpenACC gimple that
>it
>become unparalleled.

No, the point is that we want a middle-end annotation that covers both at the same time.
Otherwise factoring in others will quickly get unmanageable.

Richard.

>                               thanks, Dinar.
>
>
>
>On Fri, May 10, 2013 at 2:06 PM, Tobias Burnus <[hidden email]> wrote:
>
>> Jakub Jelinek wrote:
>> [Fallback generation of CPU code]
>>
>>> If one uses the OpenMP 4.0 accelerator pragmas, then that is the
>required
>>> behavior, if the code is for whatever reason not possible to run on
>the
>>> accelerator, it should be executed on host [...]
>>>
>> (I haven't checked, but is this a compile time or run-time
>requirement?)
>>
>>
>>  Otherwise, the OpenMP runtime as well as the pragmas have a way to
>choose
>>> which accelerator you want to run something on, as device id
>(integer), so
>>> the OpenMP runtime library should maintain the list of supported
>>> accelerators (say if you have two Intel MIC cards, and two AMD GPGPU
>>> devices), and probably we'll need a compiler switch to say for which
>kinds
>>> of accelerators we want to generate code for, plus the runtime could
>have
>>> dlopened plugins for each of the accelerator kinds.
>>>
>>
>> At least two OpenACC implementations I know fail hard when the GPU is
>not
>> available (nonexisting or if the /dev/... has not the right
>permissions).
>> And three of them fail at compile time with an error message if an
>> expression within a device section is not possible (e.g. calling some
>> nondevice/noninlinable function).
>>
>> While it is convenient to have CPU fallback, it would be nice to know
>> whether some code actually uses the accelerator - both at compile
>time and
>> at run time. Otherwise, one thinks the the GPU is used - without
>realizing
>> that it isn't because, e.g. the device permissions are wrong - or one
>> forgot to declare a certain function as target function.
>>
>> Besides having a flag which tells the compiler for which accelerator
>the
>> code should be generated, also additional flags should be handled,
>e.g. for
>> different versions of the accelerator. For instance, one accelerator
>model
>> of the same series might support double-precision variables while
>another
>> might not. - I assume that falling back to the CPU if the accelerator
>> doesn't support a certain feature won't work and one will get an
>error in
>> this case.
>>
>>
>> Is there actually the need to handle multiple accelerators
>simultaneously?
>> My impression is that both OpenACC and OpenMP 4 assume that there is
>only
>> one kind of accelerator available besides the host. If I missed some
>fine
>> print or something else  requires that there are multiple different
>> accelerators, it will get more complicated - especially for those
>code
>> section where the user didn't explicitly specify which one should be
>used.
>>
>>
>> Finally, one should think about debugging. It is not really clear (to
>me)
>> how to handle this best, but as the compiler generates quite some
>> additional code (e.g. for copying the data around) and as printf
>debugging
>> doesn't work on GPUs, it is not that easy. I wonder whether there
>should be
>> an optional library like libgomp_debug which adds additional sanity
>checks
>> (e.g. related to copying data to/from the GPU) and which allows to
>print
>> diagnostic output, when one sets an environment variables.
>>
>> Tobias
>>