[PATCH 0/6, OpenACC, libgomp] Async re-work

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH 0/6, OpenACC, libgomp] Async re-work

Chung-Lin Tang-5
This patch is a re-organization of OpenACC asynchronous queues. The previous style of implementation
was essentially re-defining the entire async API inside the plugin-interface, and relaying all such
API calls to the target plugin, which is awkward in design; it requires (each) target plugin to
essentially re-implement large portions of the async functionality to support OpenACC, and the
way it uses a state-setting style to "select/de-select" asynchronous queues for operations litters
a lot of code paths.

The new design proposed here in this patch declares a "struct goacc_asyncqueue*" opaque type in libgomp.h,
and re-defines the plugin interface to a few operations (e.g. construct/destruct/test/synchronize/etc.)
on this async-queue type, all details are target-dependent inside the specific plugin/plugin-<target>.c file.

Also included in this patch is the code for the acc_get/set_default_async API functions in OpenACC 2.5.
It's a minor part of this patch, but since some code was merge together, I'm submitting it together here.

Testing has been done with offloading enabled. The results are mostly okay, but with a few issues
with either yet incomplete submission of our testsuite adjustment patches, or other independent problems.
Seeking permission to commit this to trunk first.

Thanks,
Chung-Lin

2018-09-25  Chung-Lin Tang  <[hidden email]>

        include/
        * gomp-constants.h (GOMP_ASYNC_DEFAULT): Define.
        (GOMP_VERSION): Increment for new plugin interface changes.

        libgomp/
        * libgomp-plugin.h (struct goacc_asyncqueue): Declare.
        (struct goacc_asyncqueue_list): Likewise.
        (goacc_aq): Likewise.
        (goacc_aq_list): Likewise.
        (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
        (GOMP_OFFLOAD_openacc_async_test): Remove.
        (GOMP_OFFLOAD_openacc_async_test_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_async): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
        (GOMP_OFFLOAD_openacc_async_set_async): Remove.
        (GOMP_OFFLOAD_openacc_exec): Adjust declaration.
        (GOMP_OFFLOAD_openacc_cuda_get_stream): Likewise.
        (GOMP_OFFLOAD_openacc_cuda_set_stream): Likewise.

        (GOMP_OFFLOAD_openacc_async_exec): Declare.
        (GOMP_OFFLOAD_openacc_async_construct): Declare.
        (GOMP_OFFLOAD_openacc_async_destruct): Declare.
        (GOMP_OFFLOAD_openacc_async_test): Declare.
        (GOMP_OFFLOAD_openacc_async_synchronize): Declare.
        (GOMP_OFFLOAD_openacc_async_serialize): Declare.
        (GOMP_OFFLOAD_openacc_async_queue_callback): Declare.
        (GOMP_OFFLOAD_openacc_async_host2dev): Declare.
        (GOMP_OFFLOAD_openacc_async_dev2host): Declare.

        * libgomp.h (struct acc_dispatch_t): Define 'async' sub-struct.
        (gomp_acc_insert_pointer): Adjust declaration.
        (gomp_copy_host2dev): New declaration.
        (gomp_copy_dev2host): Likewise.
        (gomp_map_vars_async): Likewise.
        (gomp_unmap_tgt): Likewise.
        (gomp_unmap_vars_async): Likewise.
        (gomp_fini_device): Likewise.

        * libgomp.map (OACC_2.5): Add acc_get_default_async,
        acc_get_default_async_h_, acc_set_default_async, and
        acc_set_default_async_h_.
        (GOMP_PLUGIN_1.0): Remove GOMP_PLUGIN_async_unmap_vars.

        * oacc-async.c (get_goacc_thread): New function.
        (get_goacc_thread_device): New function.
        (lookup_goacc_asyncqueue): New function.
        (get_goacc_asyncqueue): New function.
        (acc_async_test): Adjust code to use new async design.
        (acc_async_test_all): Likewise.
        (acc_wait): Likewise.
        (acc_wait_async): Likewise.
        (acc_wait_all): Likewise.
        (acc_wait_all_async): Likewise.
        (acc_get_default_async): New API function.
        (acc_set_default_async): Likewise.
        (goacc_async_unmap_tgt): New function.
        (goacc_async_copyout_unmap_vars): Likewise.
        (goacc_async_free): Likewise.
        (goacc_init_asyncqueues): Likewise.
        (goacc_fini_asyncqueues): Likewise.
        * oacc-cuda.c (acc_get_cuda_stream): Adjust code to use new async
        design.
        (acc_set_cuda_stream): Likewise.
        * oacc-host.c (host_openacc_exec): Adjust parameters, remove 'async'.
        (host_openacc_register_async_cleanup): Remove.
        (host_openacc_async_exec): New function.
        (host_openacc_async_test): Adjust parameters.
        (host_openacc_async_test_all): Remove.
        (host_openacc_async_wait): Remove.
        (host_openacc_async_wait_async): Remove.
        (host_openacc_async_wait_all): Remove.
        (host_openacc_async_wait_all_async): Remove.
        (host_openacc_async_set_async): Remove.
        (host_openacc_async_synchronize): New function.
        (host_openacc_async_serialize): New function.
        (host_openacc_async_host2dev): New function.
        (host_openacc_async_dev2host): New function.
        (host_openacc_async_queue_callback): New function.
        (host_openacc_async_construct): New function.
        (host_openacc_async_destruct): New function.
        (struct gomp_device_descr host_dispatch): Remove initialization of old
        interface, add intialization of new async sub-struct.
        * oacc-init.c (acc_shutdown_1): Adjust to use gomp_fini_device.
        (goacc_attach_host_thread_to_device): Remove old async code usage, add
        initialization of per-thread default_async.
        * oacc-int.h (struct goacc_thread): Add default_async field.
        (goacc_init_asyncqueues): New declaration.
        (goacc_fini_asyncqueues): Likewise.
        (goacc_async_copyout_unmap_vars): Likewise.
        (goacc_async_free): Likewise.
        (get_goacc_asyncqueue): Likewise.
        (lookup_goacc_asyncqueue): Likewise.

        * oacc-mem.c (memcpy_tofrom_device): Adjust code to use new async
        design.
        (present_create_copy): Adjust code to use new async design.
        (delete_copyout): Likewise.
        (update_dev_host): Likewise.
        (gomp_acc_insert_pointer): Add async parameter, adjust code to use new
        async design.
        (gomp_acc_remove_pointer): Adjust code to use new async design.
        * oacc-parallel.c (GOACC_parallel_keyed): Adjust code to use new async
        design, adjust profiling bits, interpret launch op as signed 16-bit
        field.
        (GOACC_enter_exit_data): Handle -1 as waits num, adjust code to use new
        async design.
        (goacc_wait): Adjust code to use new async design.
        (GOACC_update): Likewise.
        * oacc-plugin.c (GOMP_PLUGIN_async_unmap_vars): Remove.

        * target.c (goacc_device_copy_async): New function.
        (gomp_copy_host2dev): Remove 'static', add goacc_asyncqueue parameter,
        add goacc_device_copy_async case.
        (gomp_copy_dev2host): Likewise.
        (gomp_map_vars_existing): Add goacc_asyncqueue parameter, adjust code.
        (gomp_map_pointer): Likewise.
        (gomp_map_fields_existing): Likewise.
        (gomp_map_vars): Add function for compatiblity.
        (gomp_map_vars_async): Adapt from gomp_map_vars, add goacc_asyncqueue
        parameter.
        (gomp_unmap_tgt): Remove statis, add attribute_hidden.
        (gomp_unmap_vars): Add function for compatiblity.
        (gomp_unmap_vars_async): Adapt from gomp_unmap_vars, add
        goacc_asyncqueue parameter.
        (gomp_fini_device): New function.
        (gomp_exit_data): Adjust gomp_copy_dev2host call.
        (gomp_load_plugin_for_device): Remove old interface, adjust to load
        new async interface.
        (gomp_target_fini): Adjust code to call gomp_fini_device.

        * plugin/plugin-nvptx.c (struct cuda_map): Remove.
        (struct ptx_stream): Remove.
        (struct nvptx_thread): Remove current_stream field.
        (cuda_map_create): Remove.
        (cuda_map_destroy): Remove.
        (map_init): Remove.
        (map_fini): Remove.
        (map_pop): Remove.
        (map_push): Remove.
        (struct goacc_asyncqueue): Define.
        (struct nvptx_callback): Define.
        (struct ptx_free_block): Define.
        (struct ptx_device): Remove null_stream, active_streams, async_streams,
        stream_lock, and next fields.
        (enum ptx_event_type): Remove.
        (struct ptx_event): Remove.
        (ptx_event_lock): Remove.
        (ptx_events): Remove.
        (init_streams_for_device): Remove.
        (fini_streams_for_device): Remove.
        (select_stream_for_async): Remove.
        (nvptx_init): Remove ptx_events and ptx_event_lock references.
        (nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED
        case.
        (nvptx_open_device): Add free_blocks initialization, remove
        init_streams_for_device call.
        (nvptx_close_device): Remove fini_streams_for_device call, add
        free_blocks destruct code.
        (event_gc): Remove.
        (event_add): Remove.
        (nvptx_exec): Adjust parameters and code.
        (nvptx_free): Likewise.
        (nvptx_host2dev): Remove.
        (nvptx_dev2host): Remove.
        (nvptx_set_async): Remove.
        (nvptx_async_test): Remove.
        (nvptx_async_test_all): Remove.
        (nvptx_wait): Remove.
        (nvptx_wait_async): Remove.
        (nvptx_wait_all): Remove.
        (nvptx_wait_all_async): Remove.
        (nvptx_get_cuda_stream): Remove.
        (nvptx_set_cuda_stream): Remove.
        (GOMP_OFFLOAD_alloc): Adjust code.
        (GOMP_OFFLOAD_free): Likewise.
        (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
        (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
        (GOMP_OFFLOAD_openacc_async_test_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_async): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
        (GOMP_OFFLOAD_openacc_async_set_async): Remove.
        (cuda_free_argmem): New function.
        (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
        (GOMP_OFFLOAD_openacc_create_thread_data): Adjust code.
        (GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code.
        (GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code.
        (GOMP_OFFLOAD_openacc_async_construct): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_test): Remove and re-implement.
        (GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function.
        (cuda_callback_wrapper): New function.
        (cuda_memcpy_sanity_check): New function.
        (GOMP_OFFLOAD_host2dev): Remove and re-implement.
        (GOMP_OFFLOAD_dev2host): Remove and re-implement.
        (GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function.

        * testsuite/libgomp.oacc-c-c++-common/data-2-lib.c: Adjust testcase.
        * testsuite/libgomp.oacc-c-c++-common/data-2.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/data-3.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-71.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-77.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-79.c: Likewise.
        * testsuite/libgomp.oacc-c-c++-common/lib-81.c: Likewise.
        * testsuite/libgomp.oacc-fortran/lib-12.f90: Likewise.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Thomas Schwinge-8
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:09:49 +0800, Chung-Lin Tang <[hidden email]> wrote:
> This patch is a re-organization of OpenACC asynchronous queues.

Thanks!

> The previous style of implementation
> was essentially re-defining the entire async API inside the plugin-interface, and relaying all such
> API calls to the target plugin, which is awkward in design; it requires (each) target plugin to
> essentially re-implement large portions of the async functionality to support OpenACC, and the
> way it uses a state-setting style to "select/de-select" asynchronous queues for operations litters
> a lot of code paths.
>
> The new design proposed here in this patch declares a "struct goacc_asyncqueue*" opaque type in libgomp.h,
> and re-defines the plugin interface to a few operations (e.g. construct/destruct/test/synchronize/etc.)
> on this async-queue type, all details are target-dependent inside the specific plugin/plugin-<target>.c file.

Conceptually, ACK.


> Also included in this patch is the code for the acc_get/set_default_async API functions in OpenACC 2.5.
> It's a minor part of this patch, but since some code was merge together, I'm submitting it together here.

As I requested, I'm reviewing those changes separately, and have backed
out those changes in my working copy.


> Testing has been done with offloading enabled. The results are mostly okay, but with a few issues
> with either yet incomplete submission of our testsuite adjustment patches, or other independent problems.

We'll need to understand these.  


> Seeking permission to commit this to trunk first.

A few things will need to be clarified.


For example, for the simple program:

    int main(void)
    {
    #pragma acc parallel async(1)
      ;
    #pragma acc wait
   
      return 0;
    }

..., I'm seeing memory corruption, which (oaccasionally...) shows up as
an abort due to "free" complaining, but also reproduces more reliably
with "valgrind".  It also reproduces on openacc-gcc-8-branch:

    $ valgrind ./a.out
    [...]
    ==26392== Invalid read of size 8
    ==26392==    at 0x4E653B0: goacc_async_unmap_tgt (oacc-async.c:368)
    ==26392==    by 0x5C90901: cuda_callback_wrapper (plugin-nvptx.c:1648)
    ==26392==    by 0x6066B8D: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.390.77)
    ==26392==    by 0x607A10F: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.390.77)
    ==26392==    by 0x50816DA: start_thread (pthread_create.c:463)
    ==26392==    by 0x53BA88E: clone (clone.S:95)
    ==26392==  Address 0x8d19f50 is 0 bytes inside a block of size 64 free'd
    ==26392==    at 0x4C30D3B: free (vg_replace_malloc.c:530)
    ==26392==    by 0x4E65BEE: goacc_async_copyout_unmap_vars (oacc-async.c:383)
    ==26392==    by 0x4E607C9: GOACC_parallel_keyed_internal (oacc-parallel.c:403)
    ==26392==    by 0x4E60EAA: GOACC_parallel_keyed_v2 (oacc-parallel.c:439)
    ==26392==    by 0x40094F: ??? (in [...]/a.out)
    ==26392==    by 0x52BAB96: (below main) (libc-start.c:310)
    ==26392==  Block was alloc'd at
    ==26392==    at 0x4C2FB0F: malloc (vg_replace_malloc.c:299)
    ==26392==    by 0x4E47538: gomp_malloc (alloc.c:37)
    ==26392==    by 0x4E5AEEB: gomp_map_vars_async (target.c:731)
    ==26392==    by 0x4E60C2B: GOACC_parallel_keyed_internal (oacc-parallel.c:345)
    ==26392==    by 0x4E60EAA: GOACC_parallel_keyed_v2 (oacc-parallel.c:439)
    ==26392==    by 0x40094F: ??? (in [...]/a.out)
    ==26392==    by 0x52BAB96: (below main) (libc-start.c:310)
    [...]

Per my understanding, the problem is that, called from
libgomp/oacc-async.c:goacc_async_copyout_unmap_vars,
libgomp/target.c:gomp_unmap_vars_async runs into:

      if (tgt->list_count == 0)
        {
          free (tgt);
          return;
        }

..., and then goacc_async_copyout_unmap_vars does:

      devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt,
                                                  (void *) tgt);

..., which will then call libgomp/oacc-async.c:goacc_async_unmap_tgt:

    static void
    goacc_async_unmap_tgt (void *ptr)
    {
      struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
   
      if (tgt->refcount > 1)
        tgt->refcount--;
      else
        gomp_unmap_tgt (tgt);
    }

..., where the "Invalid read of size 8" happens, and which eventually
would try to "free (tgt)" again, via libgomp/target.c:gomp_unmap_tgt:

    attribute_hidden void
    gomp_unmap_tgt (struct target_mem_desc *tgt)
    {
      /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end region.  */
      if (tgt->tgt_end)
        gomp_free_device_memory (tgt->device_descr, tgt->to_free);
   
      free (tgt->array);
      free (tgt);
    }

Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong, or
something else?


Grüße
 Thomas
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Julian Brown-2
On Thu, 6 Dec 2018 21:42:14 +0100
Thomas Schwinge <[hidden email]> wrote:

> [...]
> ..., where the "Invalid read of size 8" happens, and which eventually
> would try to "free (tgt)" again, via libgomp/target.c:gomp_unmap_tgt:
>
>     attribute_hidden void
>     gomp_unmap_tgt (struct target_mem_desc *tgt)
>     {
>       /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end
> region.  */ if (tgt->tgt_end)
>         gomp_free_device_memory (tgt->device_descr, tgt->to_free);
>    
>       free (tgt->array);
>       free (tgt);
>     }
>
> Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong,
> or something else?

It might be worth trying this with the refcounting changes in the
attach/detach patch.

Julian
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 0/6, OpenACC, libgomp] Async re-work

Julian Brown-2
On Thu, 6 Dec 2018 22:22:46 +0000
Julian Brown <[hidden email]> wrote:

> On Thu, 6 Dec 2018 21:42:14 +0100
> Thomas Schwinge <[hidden email]> wrote:
>
> > [...]
> > ..., where the "Invalid read of size 8" happens, and which
> > eventually would try to "free (tgt)" again, via
> > libgomp/target.c:gomp_unmap_tgt:
> >
> >     attribute_hidden void
> >     gomp_unmap_tgt (struct target_mem_desc *tgt)
> >     {
> >       /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end
> > region.  */ if (tgt->tgt_end)
> >         gomp_free_device_memory (tgt->device_descr, tgt->to_free);
> >    
> >       free (tgt->array);
> >       free (tgt);
> >     }
> >
> > Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong,
> > or something else?  
>
> It might be worth trying this with the refcounting changes in the
> attach/detach patch.

...oh, also make sure you have this patch in the series you're testing
with:

https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01973.html

else your "wait" will be ignored, IIUC.

Julian