[PATCH 0/6, OpenACC, libgomp] Async re-work

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH 0/6, OpenACC, libgomp] Async re-work

Chung-Lin Tang-5
This patch is a re-organization of OpenACC asynchronous queues. The previous style of implementation
was essentially re-defining the entire async API inside the plugin-interface, and relaying all such
API calls to the target plugin, which is awkward in design; it requires (each) target plugin to
essentially re-implement large portions of the async functionality to support OpenACC, and the
way it uses a state-setting style to "select/de-select" asynchronous queues for operations litters
a lot of code paths.

The new design proposed here in this patch declares a "struct goacc_asyncqueue*" opaque type in libgomp.h,
and re-defines the plugin interface to a few operations (e.g. construct/destruct/test/synchronize/etc.)
on this async-queue type, all details are target-dependent inside the specific plugin/plugin-<target>.c file.

Also included in this patch is the code for the acc_get/set_default_async API functions in OpenACC 2.5.
It's a minor part of this patch, but since some code was merge together, I'm submitting it together here.

Testing has been done with offloading enabled. The results are mostly okay, but with a few issues
with either yet incomplete submission of our testsuite adjustment patches, or other independent problems.
Seeking permission to commit this to trunk first.

Thanks,
Chung-Lin

2018-09-25  Chung-Lin Tang  <[hidden email]>

        include/
        * gomp-constants.h (GOMP_ASYNC_DEFAULT): Define.
        (GOMP_VERSION): Increment for new plugin interface changes.

        libgomp/
        * libgomp-plugin.h (struct goacc_asyncqueue): Declare.
        (struct goacc_asyncqueue_list): Likewise.
        (goacc_aq): Likewise.
        (goacc_aq_list): Likewise.
        (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
        (GOMP_OFFLOAD_openacc_async_test): Remove.
        (GOMP_OFFLOAD_openacc_async_test_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_async): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
        (GOMP_OFFLOAD_openacc_async_set_async): Remove.
        (GOMP_OFFLOAD_openacc_exec): Adjust declaration.
        (GOMP_OFFLOAD_openacc_cuda_get_stream): Likewise.
        (GOMP_OFFLOAD_openacc_cuda_set_stream): Likewise.

        (GOMP_OFFLOAD_openacc_async_exec): Declare.
        (GOMP_OFFLOAD_openacc_async_construct): Declare.
        (GOMP_OFFLOAD_openacc_async_destruct): Declare.
        (GOMP_OFFLOAD_openacc_async_test): Declare.
        (GOMP_OFFLOAD_openacc_async_synchronize): Declare.
        (GOMP_OFFLOAD_openacc_async_serialize): Declare.
        (GOMP_OFFLOAD_openacc_async_queue_callback): Declare.
        (GOMP_OFFLOAD_openacc_async_host2dev): Declare.
        (GOMP_OFFLOAD_openacc_async_dev2host): Declare.

        * libgomp.h (struct acc_dispatch_t): Define 'async' sub-struct.
        (gomp_acc_insert_pointer): Adjust declaration.
        (gomp_copy_host2dev): New declaration.
        (gomp_copy_dev2host): Likewise.
        (gomp_map_vars_async): Likewise.
        (gomp_unmap_tgt): Likewise.
        (gomp_unmap_vars_async): Likewise.
        (gomp_fini_device): Likewise.

        * libgomp.map (OACC_2.5): Add acc_get_default_async,
        acc_get_default_async_h_, acc_set_default_async, and
        acc_set_default_async_h_.
        (GOMP_PLUGIN_1.0): Remove GOMP_PLUGIN_async_unmap_vars.

        * oacc-async.c (get_goacc_thread): New function.
        (get_goacc_thread_device): New function.
        (lookup_goacc_asyncqueue): New function.
        (get_goacc_asyncqueue): New function.
        (acc_async_test): Adjust code to use new async design.
        (acc_async_test_all): Likewise.
        (acc_wait): Likewise.
        (acc_wait_async): Likewise.
        (acc_wait_all): Likewise.
        (acc_wait_all_async): Likewise.
        (acc_get_default_async): New API function.
        (acc_set_default_async): Likewise.
        (goacc_async_unmap_tgt): New function.
        (goacc_async_copyout_unmap_vars): Likewise.
        (goacc_async_free): Likewise.
        (goacc_init_asyncqueues): Likewise.
        (goacc_fini_asyncqueues): Likewise.
        * oacc-cuda.c (acc_get_cuda_stream): Adjust code to use new async
        design.
        (acc_set_cuda_stream): Likewise.
        * oacc-host.c (host_openacc_exec): Adjust parameters, remove 'async'.
        (host_openacc_register_async_cleanup): Remove.
        (host_openacc_async_exec): New function.
        (host_openacc_async_test): Adjust parameters.
        (host_openacc_async_test_all): Remove.
        (host_openacc_async_wait): Remove.
        (host_openacc_async_wait_async): Remove.
        (host_openacc_async_wait_all): Remove.
        (host_openacc_async_wait_all_async): Remove.
        (host_openacc_async_set_async): Remove.
        (host_openacc_async_synchronize): New function.
        (host_openacc_async_serialize): New function.
        (host_openacc_async_host2dev): New function.
        (host_openacc_async_dev2host): New function.
        (host_openacc_async_queue_callback): New function.
        (host_openacc_async_construct): New function.
        (host_openacc_async_destruct): New function.
        (struct gomp_device_descr host_dispatch): Remove initialization of old
        interface, add intialization of new async sub-struct.
        * oacc-init.c (acc_shutdown_1): Adjust to use gomp_fini_device.
        (goacc_attach_host_thread_to_device): Remove old async code usage, add
        initialization of per-thread default_async.
        * oacc-int.h (struct goacc_thread): Add default_async field.
        (goacc_init_asyncqueues): New declaration.
        (goacc_fini_asyncqueues): Likewise.
        (goacc_async_copyout_unmap_vars): Likewise.
        (goacc_async_free): Likewise.
        (get_goacc_asyncqueue): Likewise.
        (lookup_goacc_asyncqueue): Likewise.

        * oacc-mem.c (memcpy_tofrom_device): Adjust code to use new async
        design.
        (present_create_copy): Adjust code to use new async design.
        (delete_copyout): Likewise.
        (update_dev_host): Likewise.
        (gomp_acc_insert_pointer): Add async parameter, adjust code to use new
        async design.
        (gomp_acc_remove_pointer): Adjust code to use new async design.
        * oacc-parallel.c (GOACC_parallel_keyed): Adjust code to use new async
        design, adjust profiling bits, interpret launch op as signed 16-bit
        field.
        (GOACC_enter_exit_data): Handle -1 as waits num, adjust code to use new
        async design.
        (goacc_wait): Adjust code to use new async design.
        (GOACC_update): Likewise.
        * oacc-plugin.c (GOMP_PLUGIN_async_unmap_vars): Remove.

        * target.c (goacc_device_copy_async): New function.
        (gomp_copy_host2dev): Remove 'static', add goacc_asyncqueue parameter,
        add goacc_device_copy_async case.
        (gomp_copy_dev2host): Likewise.
        (gomp_map_vars_existing): Add goacc_asyncqueue parameter, adjust code.
        (gomp_map_pointer): Likewise.
        (gomp_map_fields_existing): Likewise.
        (gomp_map_vars): Add function for compatiblity.
        (gomp_map_vars_async): Adapt from gomp_map_vars, add goacc_asyncqueue
        parameter.
        (gomp_unmap_tgt): Remove statis, add attribute_hidden.
        (gomp_unmap_vars): Add function for compatiblity.
        (gomp_unmap_vars_async): Adapt from gomp_unmap_vars, add
        goacc_asyncqueue parameter.
        (gomp_fini_device): New function.
        (gomp_exit_data): Adjust gomp_copy_dev2host call.
        (gomp_load_plugin_for_device): Remove old interface, adjust to load
        new async interface.
        (gomp_target_fini): Adjust code to call gomp_fini_device.

        * plugin/plugin-nvptx.c (struct cuda_map): Remove.
        (struct ptx_stream): Remove.
        (struct nvptx_thread): Remove current_stream field.
        (cuda_map_create): Remove.
        (cuda_map_destroy): Remove.
        (map_init): Remove.
        (map_fini): Remove.
        (map_pop): Remove.
        (map_push): Remove.
        (struct goacc_asyncqueue): Define.
        (struct nvptx_callback): Define.
        (struct ptx_free_block): Define.
        (struct ptx_device): Remove null_stream, active_streams, async_streams,
        stream_lock, and next fields.
        (enum ptx_event_type): Remove.
        (struct ptx_event): Remove.
        (ptx_event_lock): Remove.
        (ptx_events): Remove.
        (init_streams_for_device): Remove.
        (fini_streams_for_device): Remove.
        (select_stream_for_async): Remove.
        (nvptx_init): Remove ptx_events and ptx_event_lock references.
        (nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED
        case.
        (nvptx_open_device): Add free_blocks initialization, remove
        init_streams_for_device call.
        (nvptx_close_device): Remove fini_streams_for_device call, add
        free_blocks destruct code.
        (event_gc): Remove.
        (event_add): Remove.
        (nvptx_exec): Adjust parameters and code.
        (nvptx_free): Likewise.
        (nvptx_host2dev): Remove.
        (nvptx_dev2host): Remove.
        (nvptx_set_async): Remove.
        (nvptx_async_test): Remove.
        (nvptx_async_test_all): Remove.
        (nvptx_wait): Remove.
        (nvptx_wait_async): Remove.
        (nvptx_wait_all): Remove.
        (nvptx_wait_all_async): Remove.
        (nvptx_get_cuda_stream): Remove.
        (nvptx_set_cuda_stream): Remove.
        (GOMP_OFFLOAD_alloc): Adjust code.
        (GOMP_OFFLOAD_free): Likewise.
        (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
        (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
        (GOMP_OFFLOAD_openacc_async_test_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_async): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
        (GOMP_OFFLOAD_openacc_async_set_async): Remove.
        (cuda_free_argmem): New function.
        (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
        (GOMP_OFFLOAD_openacc_create_thread_data): Adjust code.
        (GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code.
        (GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code.
        (GOMP_OFFLOAD_openacc_async_construct): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_test): Remove and re-implement.
        (GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function.
        (cuda_callback_wrapper): New function.
        (cuda_memcpy_sanity_check): New function.
        (GOMP_OFFLOAD_host2dev): Remove and re-implement.
        (GOMP_OFFLOAD_dev2host): Remove and re-implement.
        (GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function.

        * testsuite/libgomp.oacc-c-c++-common/data-2-lib.c: Adjust testcase.
        * testsuite/libgomp.oacc-c-c++-common/data-2.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/data-3.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-71.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-77.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-79.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-81.c: Likewise.
        * testsuite/libgomp.oacc-fortran/lib-12.f90: Likewise.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Thomas Schwinge-8
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:09:49 +0800, Chung-Lin Tang <[hidden email]> wrote:
> This patch is a re-organization of OpenACC asynchronous queues.

Thanks!

> The previous style of implementation
> was essentially re-defining the entire async API inside the plugin-interface, and relaying all such
> API calls to the target plugin, which is awkward in design; it requires (each) target plugin to
> essentially re-implement large portions of the async functionality to support OpenACC, and the
> way it uses a state-setting style to "select/de-select" asynchronous queues for operations litters
> a lot of code paths.
>
> The new design proposed here in this patch declares a "struct goacc_asyncqueue*" opaque type in libgomp.h,
> and re-defines the plugin interface to a few operations (e.g. construct/destruct/test/synchronize/etc.)
> on this async-queue type, all details are target-dependent inside the specific plugin/plugin-<target>.c file.

Conceptually, ACK.


> Also included in this patch is the code for the acc_get/set_default_async API functions in OpenACC 2.5.
> It's a minor part of this patch, but since some code was merge together, I'm submitting it together here.

As I requested, I'm reviewing those changes separately, and have backed
out those changes in my working copy.


> Testing has been done with offloading enabled. The results are mostly okay, but with a few issues
> with either yet incomplete submission of our testsuite adjustment patches, or other independent problems.

We'll need to understand these.  


> Seeking permission to commit this to trunk first.

A few things will need to be clarified.


For example, for the simple program:

    int main(void)
    {
    #pragma acc parallel async(1)
      ;
    #pragma acc wait
   
      return 0;
    }

..., I'm seeing memory corruption, which (oaccasionally...) shows up as
an abort due to "free" complaining, but also reproduces more reliably
with "valgrind".  It also reproduces on openacc-gcc-8-branch:

    $ valgrind ./a.out
    [...]
    ==26392== Invalid read of size 8
    ==26392==    at 0x4E653B0: goacc_async_unmap_tgt (oacc-async.c:368)
    ==26392==    by 0x5C90901: cuda_callback_wrapper (plugin-nvptx.c:1648)
    ==26392==    by 0x6066B8D: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.390.77)
    ==26392==    by 0x607A10F: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.390.77)
    ==26392==    by 0x50816DA: start_thread (pthread_create.c:463)
    ==26392==    by 0x53BA88E: clone (clone.S:95)
    ==26392==  Address 0x8d19f50 is 0 bytes inside a block of size 64 free'd
    ==26392==    at 0x4C30D3B: free (vg_replace_malloc.c:530)
    ==26392==    by 0x4E65BEE: goacc_async_copyout_unmap_vars (oacc-async.c:383)
    ==26392==    by 0x4E607C9: GOACC_parallel_keyed_internal (oacc-parallel.c:403)
    ==26392==    by 0x4E60EAA: GOACC_parallel_keyed_v2 (oacc-parallel.c:439)
    ==26392==    by 0x40094F: ??? (in [...]/a.out)
    ==26392==    by 0x52BAB96: (below main) (libc-start.c:310)
    ==26392==  Block was alloc'd at
    ==26392==    at 0x4C2FB0F: malloc (vg_replace_malloc.c:299)
    ==26392==    by 0x4E47538: gomp_malloc (alloc.c:37)
    ==26392==    by 0x4E5AEEB: gomp_map_vars_async (target.c:731)
    ==26392==    by 0x4E60C2B: GOACC_parallel_keyed_internal (oacc-parallel.c:345)
    ==26392==    by 0x4E60EAA: GOACC_parallel_keyed_v2 (oacc-parallel.c:439)
    ==26392==    by 0x40094F: ??? (in [...]/a.out)
    ==26392==    by 0x52BAB96: (below main) (libc-start.c:310)
    [...]

Per my understanding, the problem is that, called from
libgomp/oacc-async.c:goacc_async_copyout_unmap_vars,
libgomp/target.c:gomp_unmap_vars_async runs into:

      if (tgt->list_count == 0)
        {
          free (tgt);
          return;
        }

..., and then goacc_async_copyout_unmap_vars does:

      devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt,
                                                  (void *) tgt);

..., which will then call libgomp/oacc-async.c:goacc_async_unmap_tgt:

    static void
    goacc_async_unmap_tgt (void *ptr)
    {
      struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
   
      if (tgt->refcount > 1)
        tgt->refcount--;
      else
        gomp_unmap_tgt (tgt);
    }

..., where the "Invalid read of size 8" happens, and which eventually
would try to "free (tgt)" again, via libgomp/target.c:gomp_unmap_tgt:

    attribute_hidden void
    gomp_unmap_tgt (struct target_mem_desc *tgt)
    {
      /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end region.  */
      if (tgt->tgt_end)
        gomp_free_device_memory (tgt->device_descr, tgt->to_free);
   
      free (tgt->array);
      free (tgt);
    }

Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong, or
something else?


Grüße
 Thomas
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Julian Brown-2
On Thu, 6 Dec 2018 21:42:14 +0100
Thomas Schwinge <[hidden email]> wrote:

> [...]
> ..., where the "Invalid read of size 8" happens, and which eventually
> would try to "free (tgt)" again, via libgomp/target.c:gomp_unmap_tgt:
>
>     attribute_hidden void
>     gomp_unmap_tgt (struct target_mem_desc *tgt)
>     {
>       /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end
> region.  */ if (tgt->tgt_end)
>         gomp_free_device_memory (tgt->device_descr, tgt->to_free);
>    
>       free (tgt->array);
>       free (tgt);
>     }
>
> Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong,
> or something else?

It might be worth trying this with the refcounting changes in the
attach/detach patch.

Julian
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Julian Brown-2
On Thu, 6 Dec 2018 22:22:46 +0000
Julian Brown <[hidden email]> wrote:

> On Thu, 6 Dec 2018 21:42:14 +0100
> Thomas Schwinge <[hidden email]> wrote:
>
> > [...]
> > ..., where the "Invalid read of size 8" happens, and which
> > eventually would try to "free (tgt)" again, via
> > libgomp/target.c:gomp_unmap_tgt:
> >
> >     attribute_hidden void
> >     gomp_unmap_tgt (struct target_mem_desc *tgt)
> >     {
> >       /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end
> > region.  */ if (tgt->tgt_end)
> >         gomp_free_device_memory (tgt->device_descr, tgt->to_free);
> >    
> >       free (tgt->array);
> >       free (tgt);
> >     }
> >
> > Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong,
> > or something else?  
>
> It might be worth trying this with the refcounting changes in the
> attach/detach patch.

...oh, also make sure you have this patch in the series you're testing
with:

https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01973.html

else your "wait" will be ignored, IIUC.

Julian
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Chung-Lin Tang-5
On 2018/12/7 6:26 AM, Julian Brown wrote:

> On Thu, 6 Dec 2018 22:22:46 +0000
> Julian Brown <[hidden email]> wrote:
>
>> On Thu, 6 Dec 2018 21:42:14 +0100
>> Thomas Schwinge <[hidden email]> wrote:
>>
>>> [...]
>>> ..., where the "Invalid read of size 8" happens, and which
>>> eventually would try to "free (tgt)" again, via
>>> libgomp/target.c:gomp_unmap_tgt:
>>>
>>>      attribute_hidden void
>>>      gomp_unmap_tgt (struct target_mem_desc *tgt)
>>>      {
>>>        /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end
>>> region.  */ if (tgt->tgt_end)
>>>          gomp_free_device_memory (tgt->device_descr, tgt->to_free);
>>>      
>>>        free (tgt->array);
>>>        free (tgt);
>>>      }
>>>
>>> Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong,
>>> or something else?
>>
>> It might be worth trying this with the refcounting changes in the
>> attach/detach patch.
>
> ...oh, also make sure you have this patch in the series you're testing
> with:
>
> https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01973.html
>
> else your "wait" will be ignored, IIUC.
>
> Julian

Hi Thomas,
just first asking if you tried Julian's patch during this time, and if so did it do anything different?
(and apologies for missing responding this part for so long :P )

Chung-Lin
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Thomas Schwinge-8
Hi!

On Thu, 13 Dec 2018 23:28:49 +0800, Chung-Lin Tang <[hidden email]> wrote:

> On 2018/12/7 6:26 AM, Julian Brown wrote:
> > On Thu, 6 Dec 2018 22:22:46 +0000
> > Julian Brown <[hidden email]> wrote:
> >
> >> On Thu, 6 Dec 2018 21:42:14 +0100
> >> Thomas Schwinge <[hidden email]> wrote:
> >>
> >>> [...]
> >>> ..., where the "Invalid read of size 8" happens, and which
> >>> eventually would try to "free (tgt)" again, via
> >>> libgomp/target.c:gomp_unmap_tgt:
> >>>
> >>>      attribute_hidden void
> >>>      gomp_unmap_tgt (struct target_mem_desc *tgt)
> >>>      {
> >>>        /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end
> >>> region.  */ if (tgt->tgt_end)
> >>>          gomp_free_device_memory (tgt->device_descr, tgt->to_free);
> >>>      
> >>>        free (tgt->array);
> >>>        free (tgt);
> >>>      }
> >>>
> >>> Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong,
> >>> or something else?
> >>
> >> It might be worth trying this with the refcounting changes in the
> >> attach/detach patch.

Well, which exactly?

> > ...oh, also make sure you have this patch in the series you're testing
> > with:
> >
> > https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01973.html
> >
> > else your "wait" will be ignored, IIUC.

Thanks, and right, and yes, I got that one included.

> just first asking if you tried Julian's patch during this time, and if so did it do anything different?

I did not test with all the attach/detach patches on top of this one
here.  That's too many changes at once.


Grüße
 Thomas
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Thomas Schwinge-8
In reply to this post by Thomas Schwinge-8
Hi Chung-Lin!

On Thu, 06 Dec 2018 21:42:14 +0100, I wrote:
> On Tue, 25 Sep 2018 21:09:49 +0800, Chung-Lin Tang <[hidden email]> wrote:
> > Also included in this patch is the code for the acc_get/set_default_async API functions in OpenACC 2.5.
> > It's a minor part of this patch, but since some code was merge together, I'm submitting it together here.
>
> As I requested, I'm reviewing those changes separately, and have backed
> out those changes in my working copy.

... as follows:

commit 79b89a5214dc2624a52f0593bbfad5cefed0c025
Author: Thomas Schwinge <[hidden email]>
Date:   Thu Dec 6 15:57:46 2018 +0100

    into async re-work: revert default_async changes
---
 include/gomp-constants.h                           |   1 -
 libgomp/libgomp.map                                |   4 -
 libgomp/oacc-async.c                               |  19 +-
 libgomp/oacc-init.c                                |   2 -
 libgomp/oacc-int.h                                 |   3 -
 libgomp/openacc.f90                                |  22 +-
 libgomp/openacc.h                                  |   3 -
 libgomp/openacc_lib.h                              |  13 -
 .../libgomp.oacc-c-c++-common/asyncwait-2.c        | 904 ---------------------
 9 files changed, 2 insertions(+), 969 deletions(-)

diff --git include/gomp-constants.h include/gomp-constants.h
index acd25851bcc7..1021306ed661 100644
--- include/gomp-constants.h
+++ include/gomp-constants.h
@@ -160,7 +160,6 @@ enum gomp_map_kind
 /* Asynchronous behavior.  Keep in sync with
    libgomp/{openacc.h,openacc.f90,openacc_lib.h}:acc_async_t.  */
 
-#define GOMP_ASYNC_DEFAULT 0
 #define GOMP_ASYNC_NOVAL -1
 #define GOMP_ASYNC_SYNC -2
 
diff --git libgomp/libgomp.map libgomp/libgomp.map
index c5e1b876fccd..d2381da3bf07 100644
--- libgomp/libgomp.map
+++ libgomp/libgomp.map
@@ -464,12 +464,8 @@ OACC_2.5 {
  acc_delete_finalize_async_32_h_;
  acc_delete_finalize_async_64_h_;
  acc_delete_finalize_async_array_h_;
- acc_get_default_async;
- acc_get_default_async_h_;
  acc_memcpy_from_device_async;
  acc_memcpy_to_device_async;
- acc_set_default_async;
- acc_set_default_async_h_;
  acc_update_device_async;
  acc_update_device_async_32_h_;
  acc_update_device_async_64_h_;
diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index 68aaf199a27e..553082fe3d4a 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -60,7 +60,7 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
   /* The special value acc_async_noval (-1) maps to the thread-specific
      default async stream.  */
   if (async == acc_async_noval)
-    async = thr->default_async;
+    async = 0; //TODO thr->default_async;
 
   if (async == acc_async_sync)
     return NULL;
@@ -221,23 +221,6 @@ acc_wait_all_async (int async)
   gomp_mutex_unlock (&thr->dev->openacc.async.lock);
 }
 
-int
-acc_get_default_async (void)
-{
-  struct goacc_thread *thr = get_goacc_thread ();
-  return thr->default_async;
-}
-
-void
-acc_set_default_async (int async)
-{
-  if (async < acc_async_sync)
-    gomp_fatal ("invalid async argument: %d", async);
-
-  struct goacc_thread *thr = get_goacc_thread ();
-  thr->default_async = async;
-}
-
 static void
 goacc_async_unmap_tgt (void *ptr)
 {
diff --git libgomp/oacc-init.c libgomp/oacc-init.c
index 2c2f91ce3c2c..c40f48829078 100644
--- libgomp/oacc-init.c
+++ libgomp/oacc-init.c
@@ -426,8 +426,6 @@ goacc_attach_host_thread_to_device (int ord)
   
   thr->target_tls
     = acc_dev->openacc.create_thread_data_func (ord);
-
-  thr->default_async = acc_async_default;
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
diff --git libgomp/oacc-int.h libgomp/oacc-int.h
index 3354eb654ce9..97f3fc8a61ed 100644
--- libgomp/oacc-int.h
+++ libgomp/oacc-int.h
@@ -73,9 +73,6 @@ struct goacc_thread
 
   /* Target-specific data (used by plugin).  */
   void *target_tls;
-
-  /* Default OpenACC async queue for current thread, exported to plugin.  */
-  int default_async;
 };
 
 #if defined HAVE_TLS || defined USE_EMUTLS
diff --git libgomp/openacc.f90 libgomp/openacc.f90
index 7d31ee689479..7c809fe00738 100644
--- libgomp/openacc.f90
+++ libgomp/openacc.f90
@@ -51,10 +51,9 @@ module openacc_kinds
 
   integer, parameter :: acc_handle_kind = int32
 
-  public :: acc_async_default, acc_async_noval, acc_async_sync
+  public :: acc_async_noval, acc_async_sync
 
   ! Keep in sync with include/gomp-constants.h.
-  integer (acc_handle_kind), parameter :: acc_async_default = 0
   integer (acc_handle_kind), parameter :: acc_async_noval = -1
   integer (acc_handle_kind), parameter :: acc_async_sync = -2
 
@@ -93,16 +92,6 @@ module openacc_internal
       integer (acc_device_kind) d
     end function
 
-    subroutine acc_set_default_async_h (a)
-      import
-      integer a
-    end subroutine
-
-    function acc_get_default_async_h ()
-      import
-      integer acc_get_default_async_h
-    end function
-
     function acc_async_test_h (a)
       logical acc_async_test_h
       integer a
@@ -731,7 +720,6 @@ module openacc
 
   public :: acc_get_num_devices, acc_set_device_type, acc_get_device_type
   public :: acc_set_device_num, acc_get_device_num, acc_async_test
-  public :: acc_set_default_async, acc_get_default_async
   public :: acc_async_test_all
   public :: acc_wait, acc_async_wait, acc_wait_async
   public :: acc_wait_all, acc_async_wait_all, acc_wait_all_async
@@ -764,14 +752,6 @@ module openacc
     procedure :: acc_get_device_num_h
   end interface
 
-  interface acc_set_default_async
-    procedure :: acc_set_default_async_h
-  end interface
-
-  interface acc_get_default_async
-    procedure :: acc_get_default_async_h
-  end interface
-
   interface acc_async_test
     procedure :: acc_async_test_h
   end interface
diff --git libgomp/openacc.h libgomp/openacc.h
index ede59d76c862..f61bb77f9f3e 100644
--- libgomp/openacc.h
+++ libgomp/openacc.h
@@ -63,7 +63,6 @@ typedef enum acc_device_t {
 
 typedef enum acc_async_t {
   /* Keep in sync with include/gomp-constants.h.  */
-  acc_async_default = 0,
   acc_async_noval = -1,
   acc_async_sync  = -2
 } acc_async_t;
@@ -73,8 +72,6 @@ void acc_set_device_type (acc_device_t) __GOACC_NOTHROW;
 acc_device_t acc_get_device_type (void) __GOACC_NOTHROW;
 void acc_set_device_num (int, acc_device_t) __GOACC_NOTHROW;
 int acc_get_device_num (acc_device_t) __GOACC_NOTHROW;
-void acc_set_default_async (int) __GOACC_NOTHROW;
-int acc_get_default_async (void) __GOACC_NOTHROW;
 int acc_async_test (int) __GOACC_NOTHROW;
 int acc_async_test_all (void) __GOACC_NOTHROW;
 void acc_wait (int) __GOACC_NOTHROW;
diff --git libgomp/openacc_lib.h libgomp/openacc_lib.h
index 75a693937967..820d987d72e2 100644
--- libgomp/openacc_lib.h
+++ libgomp/openacc_lib.h
@@ -46,7 +46,6 @@
       integer, parameter :: acc_handle_kind = 4
 
 !     Keep in sync with include/gomp-constants.h.
-      integer (acc_handle_kind), parameter :: acc_async_default = 0
       integer (acc_handle_kind), parameter :: acc_async_noval = -1
       integer (acc_handle_kind), parameter :: acc_async_sync = -2
 
@@ -90,18 +89,6 @@
         end function
       end interface
 
-      interface acc_set_default_async
-        subroutine acc_set_default_async_h (a)
-          integer a
-        end subroutine
-      end interface
-
-      interface acc_get_default_async
-        function acc_get_default_async_h ()
-          integer acc_get_default_async_h
-        end function
-      end interface
-
       interface acc_async_test
         function acc_async_test_h (a)
           logical acc_async_test_h
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c
deleted file mode 100644
index 94205407d41d..000000000000
--- libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c
+++ /dev/null
@@ -1,904 +0,0 @@
-/* { dg-do run { target openacc_nvidia_accel_selected } } */
-/* { dg-additional-options "-lcuda" } */
-
-#include <openacc.h>
-#include <stdlib.h>
-#include <cuda.h>
-
-#include <stdio.h>
-#include <time.h>
-#include <sys/time.h>
-
-int
-main (int argc, char **argv)
-{
-    CUresult r;
-    CUstream stream1;
-    int N = 128; //1024 * 1024;
-    float *a, *b, *c, *d, *e;
-    int i;
-    int nbytes;
-
-    srand (time (NULL));
-    int s = rand () % 100;
-
-    acc_init (acc_device_nvidia);
-
-    nbytes = N * sizeof (float);
-
-    a = (float *) malloc (nbytes);
-    b = (float *) malloc (nbytes);
-    c = (float *) malloc (nbytes);
-    d = (float *) malloc (nbytes);
-    e = (float *) malloc (nbytes);
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-    }
-
-    acc_set_default_async (s);
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 3.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 2.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc parallel wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 4.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 11.0)
-            abort ();
-    }
-
-
-    r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING);
-    if (r != CUDA_SUCCESS)
-    {
-        fprintf (stderr, "cuStreamCreate failed: %d\n", r);
-        abort ();
-    }
-
-    acc_set_cuda_stream (1, stream1);
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 5.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 7.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 7.0)
-            abort ();
-
-        if (b[i] != 49.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc parallel wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 17.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 4.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 4.0)
-            abort ();
-
-        if (b[i] != 16.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) async
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 25.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 3.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 2.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc kernels wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 4.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 11.0)
-            abort ();
-    }
-
-
-    r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING);
-    if (r != CUDA_SUCCESS)
-    {
-        fprintf (stderr, "cuStreamCreate failed: %d\n", r);
-        abort ();
-    }
-
-    acc_set_cuda_stream (1, stream1);
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 5.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 7.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 7.0)
-            abort ();
-
-        if (b[i] != 49.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc kernels wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 17.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 4.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 4.0)
-            abort ();
-
-        if (b[i] != 16.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) async
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 25.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-    acc_shutdown (acc_device_nvidia);
-
-    return 0;
-}


Grüße
 Thomas
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Thomas Schwinge-8
In reply to this post by Chung-Lin Tang-5
Hi Chung-Lin!

A little bit of documentation starter update for you to include.  Please
make sure that all relevant functions have such comments addded.

commit 7e0896281d155e1544751f43c1eaace8e005e019
Author: Thomas Schwinge <[hidden email]>
Date:   Thu Dec 13 17:59:46 2018 +0100

    [WIP] into async re-work: documentation
---
 libgomp/libgomp.h             | 3 +++
 libgomp/oacc-async.c          | 7 +++++++
 libgomp/plugin/plugin-nvptx.c | 4 ++--
 libgomp/target.c              | 3 +++
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git libgomp/libgomp.h libgomp/libgomp.h
index 8b74d6368389..574fcd1ee4ad 100644
--- libgomp/libgomp.h
+++ libgomp/libgomp.h
@@ -949,6 +949,9 @@ typedef struct acc_dispatch_t
   __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
 
   struct {
+    /* Once created and put into the "active" list, asyncqueues are then never
+       destructed and removed from the "active" list, other than if the TODO
+       device is shut down.  */
     gomp_mutex_t lock;
     int nasyncqueue;
     struct goacc_asyncqueue **asyncqueue;
diff --git libgomp/oacc-async.c libgomp/oacc-async.c
index b091ba2460ac..0f5f74bdf836 100644
--- libgomp/oacc-async.c
+++ libgomp/oacc-async.c
@@ -280,6 +280,10 @@ goacc_async_free (struct gomp_device_descr *devicep,
     devicep->openacc.async.queue_callback_func (aq, free, ptr);
 }
 
+/* This function initializes the asyncqueues for the device specified by
+   DEVICEP.  TODO DEVICEP must be locked on entry, and remains locked on
+   return.  */
+
 attribute_hidden void
 goacc_init_asyncqueues (struct gomp_device_descr *devicep)
 {
@@ -289,6 +293,9 @@ goacc_init_asyncqueues (struct gomp_device_descr *devicep)
   devicep->openacc.async.active = NULL;
 }
 
+/* This function finalizes the asyncqueues for the device specified by DEVICEP.
+   TODO DEVICEP must be locked on entry, and remains locked on return.  */
+
 attribute_hidden bool
 goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
 {
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index 7b658264b8e7..577ed39ef3f6 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -1340,14 +1340,14 @@ GOMP_OFFLOAD_openacc_cuda_get_current_context (void)
   return nvptx_get_current_cuda_context ();
 }
 
-/* NOTE: This returns a CUstream, not a ptx_stream pointer.  */
+/* This returns a CUstream.  */
 void *
 GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq)
 {
   return (void *) aq->cuda_stream;
 }
 
-/* NOTE: This takes a CUstream, not a ptx_stream pointer.  */
+/* This takes a CUstream.  */
 int
 GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
 {
diff --git libgomp/target.c libgomp/target.c
index e67d9248ae0b..96df1890a729 100644
--- libgomp/target.c
+++ libgomp/target.c
@@ -1506,6 +1506,9 @@ gomp_init_device (struct gomp_device_descr *devicep)
   devicep->state = GOMP_DEVICE_INITIALIZED;
 }
 
+/* This function finalizes the target device, specified by DEVICEP.  DEVICEP
+   must be locked on entry, and remains locked on return.  */
+
 attribute_hidden bool
 gomp_fini_device (struct gomp_device_descr *devicep)
 {


Grüße
 Thomas
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Chung-Lin Tang-5
In reply to this post by Thomas Schwinge-8
On 2018/12/13 11:51 PM, Thomas Schwinge wrote:

> On Thu, 13 Dec 2018 23:28:49 +0800, Chung-Lin Tang<[hidden email]>  wrote:
>> On 2018/12/7 6:26 AM, Julian Brown wrote:
>>> On Thu, 6 Dec 2018 22:22:46 +0000
>>> Julian Brown<[hidden email]>  wrote:
>>>
>>>> On Thu, 6 Dec 2018 21:42:14 +0100
>>>> Thomas Schwinge<[hidden email]>  wrote:
>>>>
>>>>> [...]
>>>>> ..., where the "Invalid read of size 8" happens, and which
>>>>> eventually would try to "free (tgt)" again, via
>>>>> libgomp/target.c:gomp_unmap_tgt:
>>>>>
>>>>>       attribute_hidden void
>>>>>       gomp_unmap_tgt (struct target_mem_desc *tgt)
>>>>>       {
>>>>>         /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end
>>>>> region.  */ if (tgt->tgt_end)
>>>>>           gomp_free_device_memory (tgt->device_descr, tgt->to_free);
>>>>>      
>>>>>         free (tgt->array);
>>>>>         free (tgt);
>>>>>       }
>>>>>
>>>>> Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong,
>>>>> or something else?
I think I understand the problem now. In gomp_unmap_vars_async(), in the case of
tgt->list_count == 0 (i.e. no map arguments at all) the code should simply free the tgt
and return, while the code in goacc_async_copyout_unmap_vars() didn't handle this case
and always scheduled an asynchronous free of the tgt later, causing that valgrind error
you see.

I am still testing the attached patch, but I think it is the right fix: I reviewed what I
wrote and it seemed the way I organized things into a goacc_async_copyout_unmap_vars() routine,
including the hackish refcount++, etc. is simply unneeded. I have deleted those stuff
and consolidated things back into gomp_unmap_vars_async().

I'll update the whole patches later after complete testing, the attached is the patch atop
of the prior async patches. (the small program you gave above does pass valgrind now)

Julian, I didn't try the OG8 refcount changes, it's just too large a set of changes to
reason about in so short time, maybe later when we are prepared to fix things completely as
you noted what those patches were capable of.

Chung-Lin







async-unmap-fix.diff (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Thomas Schwinge
In reply to this post by Chung-Lin Tang-5
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:09:49 +0800, Chung-Lin Tang <[hidden email]> wrote:
> This patch is a re-organization of OpenACC asynchronous queues.

Again, many thanks for that!

In addition to the review emails I just posted, I've also put all that
stuff into a GitHub branch:
<https://github.com/tschwinge/gcc/tree/wip-async_re-work>.

This also includes some more "into async re-work: replicate [...]"
commits to adjust your work for preparational things that I plan to
commit before.  I split these out intentionally, so that you can easily
see/review these changes.


Grüße
 Thomas