[PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes

Chung-Lin Tang-5
Hi Tom,
this patch removes large portions of plugin/plugin-nvptx.c, since a lot of it is
now in oacc-async.c now. The new code is essentially a NVPTX/CUDA-specific implementation
of the new-style goacc_asyncqueues.

Also, some needed functions in cuda-lib.def are added. The cuda.h function has also
been updated to build independently without a CUDA installation.

Thanks,
Chung-Lin

        * plugin/plugin-nvptx.c (struct cuda_map): Remove.
        (struct ptx_stream): Remove.
        (struct nvptx_thread): Remove current_stream field.
        (cuda_map_create): Remove.
        (cuda_map_destroy): Remove.
        (map_init): Remove.
        (map_fini): Remove.
        (map_pop): Remove.
        (map_push): Remove.
        (struct goacc_asyncqueue): Define.
        (struct nvptx_callback): Define.
        (struct ptx_free_block): Define.
        (struct ptx_device): Remove null_stream, active_streams, async_streams,
        stream_lock, and next fields.
        (enum ptx_event_type): Remove.
        (struct ptx_event): Remove.
        (ptx_event_lock): Remove.
        (ptx_events): Remove.
        (init_streams_for_device): Remove.
        (fini_streams_for_device): Remove.
        (select_stream_for_async): Remove.
        (nvptx_init): Remove ptx_events and ptx_event_lock references.
        (nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED
        case.
        (nvptx_open_device): Add free_blocks initialization, remove
        init_streams_for_device call.
        (nvptx_close_device): Remove fini_streams_for_device call, add
        free_blocks destruct code.
        (event_gc): Remove.
        (event_add): Remove.
        (nvptx_exec): Adjust parameters and code.
        (nvptx_free): Likewise.
        (nvptx_host2dev): Remove.
        (nvptx_dev2host): Remove.
        (nvptx_set_async): Remove.
        (nvptx_async_test): Remove.
        (nvptx_async_test_all): Remove.
        (nvptx_wait): Remove.
        (nvptx_wait_async): Remove.
        (nvptx_wait_all): Remove.
        (nvptx_wait_all_async): Remove.
        (nvptx_get_cuda_stream): Remove.
        (nvptx_set_cuda_stream): Remove.
        (GOMP_OFFLOAD_alloc): Adjust code.
        (GOMP_OFFLOAD_free): Likewise.
        (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
        (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
        (GOMP_OFFLOAD_openacc_async_test_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_async): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all): Remove.
        (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
        (GOMP_OFFLOAD_openacc_async_set_async): Remove.
        (cuda_free_argmem): New function.
        (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
        (GOMP_OFFLOAD_openacc_create_thread_data): Adjust code.
        (GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code.
        (GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code.
        (GOMP_OFFLOAD_openacc_async_construct): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_test): Remove and re-implement.
        (GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function.
        (cuda_callback_wrapper): New function.
        (cuda_memcpy_sanity_check): New function.
        (GOMP_OFFLOAD_host2dev): Remove and re-implement.
        (GOMP_OFFLOAD_dev2host): Remove and re-implement.
        (GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function.
        (GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function.

async-06.nvptx.patch (57K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes

Tom de Vries-6
On 9/25/18 3:11 PM, Chung-Lin Tang wrote:
> Hi Tom,
> this patch removes large portions of plugin/plugin-nvptx.c, since a lot
> of it is
> now in oacc-async.c now.

Yay!

> The new code is essentially a
> NVPTX/CUDA-specific implementation
> of the new-style goacc_asyncqueues.
>
> Also, some needed functions in cuda-lib.def are added. The cuda.h
> function has also
> been updated to build independently without a CUDA installation.
>

I see these formatting issues:
...
$ check_GNU_style.sh async-06.nvptx.patch

There should be exactly one space between function name and parenthesis.
35:+CUresult cuStreamAddCallback(CUstream, CUstreamCallback, void *,
unsigned int);

Trailing operator.
1320:+      struct nvptx_thread *nvthd =
...

Otherwise, OK.

Thanks,
- Tom


> Thanks,
> Chung-Lin
>
>     * plugin/plugin-nvptx.c (struct cuda_map): Remove.
>     (struct ptx_stream): Remove.
>     (struct nvptx_thread): Remove current_stream field.
>     (cuda_map_create): Remove.
>     (cuda_map_destroy): Remove.
>     (map_init): Remove.
>     (map_fini): Remove.
>     (map_pop): Remove.
>     (map_push): Remove.
>     (struct goacc_asyncqueue): Define.
>     (struct nvptx_callback): Define.
>     (struct ptx_free_block): Define.
>     (struct ptx_device): Remove null_stream, active_streams, async_streams,
>     stream_lock, and next fields.
>     (enum ptx_event_type): Remove.
>     (struct ptx_event): Remove.
>     (ptx_event_lock): Remove.
>     (ptx_events): Remove.
>     (init_streams_for_device): Remove.
>     (fini_streams_for_device): Remove.
>     (select_stream_for_async): Remove.
>     (nvptx_init): Remove ptx_events and ptx_event_lock references.
>     (nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED
>     case.
>     (nvptx_open_device): Add free_blocks initialization, remove
>     init_streams_for_device call.
>     (nvptx_close_device): Remove fini_streams_for_device call, add
>     free_blocks destruct code.
>     (event_gc): Remove.
>     (event_add): Remove.
>     (nvptx_exec): Adjust parameters and code.
>     (nvptx_free): Likewise.
>     (nvptx_host2dev): Remove.
>     (nvptx_dev2host): Remove.
>     (nvptx_set_async): Remove.
>     (nvptx_async_test): Remove.
>     (nvptx_async_test_all): Remove.
>     (nvptx_wait): Remove.
>     (nvptx_wait_async): Remove.
>     (nvptx_wait_all): Remove.
>     (nvptx_wait_all_async): Remove.
>     (nvptx_get_cuda_stream): Remove.
>     (nvptx_set_cuda_stream): Remove.
>     (GOMP_OFFLOAD_alloc): Adjust code.
>     (GOMP_OFFLOAD_free): Likewise.
>     (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
>     (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
>     (GOMP_OFFLOAD_openacc_async_test_all): Remove.
>     (GOMP_OFFLOAD_openacc_async_wait): Remove.
>     (GOMP_OFFLOAD_openacc_async_wait_async): Remove.
>     (GOMP_OFFLOAD_openacc_async_wait_all): Remove.
>     (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
>     (GOMP_OFFLOAD_openacc_async_set_async): Remove.
>     (cuda_free_argmem): New function.
>     (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
>     (GOMP_OFFLOAD_openacc_create_thread_data): Adjust code.
>     (GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code.
>     (GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code.
>     (GOMP_OFFLOAD_openacc_async_construct): New plugin hook function.
>     (GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function.
>     (GOMP_OFFLOAD_openacc_async_test): Remove and re-implement.
>     (GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function.
>     (GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function.
>     (GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function.
>     (cuda_callback_wrapper): New function.
>     (cuda_memcpy_sanity_check): New function.
>     (GOMP_OFFLOAD_host2dev): Remove and re-implement.
>     (GOMP_OFFLOAD_dev2host): Remove and re-implement.
>     (GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function.
>     (GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function.
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes

Thomas Schwinge-8
In reply to this post by Chung-Lin Tang-5
Hi Chung-Lin!

On Tue, 25 Sep 2018 21:11:58 +0800, Chung-Lin Tang <[hidden email]> wrote:
> Hi Tom,
> this patch removes large portions of plugin/plugin-nvptx.c, since a lot of it is
> now in oacc-async.c now. The new code is essentially a NVPTX/CUDA-specific implementation
> of the new-style goacc_asyncqueues.

> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c

> +struct goacc_asyncqueue *
> +GOMP_OFFLOAD_openacc_async_construct (void)
> +{
> +  struct goacc_asyncqueue *aq
> +    = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
> +  aq->cuda_stream = NULL;
> +  CUDA_CALL_ASSERT (cuStreamCreate, &aq->cuda_stream, CU_STREAM_DEFAULT);

Curiously (this was the same in the code before): does this have to be
"CU_STREAM_DEFAULT" instead of "CU_STREAM_NON_BLOCKING", because we want
to block anything from running in parallel with "acc_async_sync" GPU
kernels, that use the "NULL" stream?  (Not asking you to change this now,
but I wonder if this is overly strict?)

> +  if (aq->cuda_stream == NULL)
> +    GOMP_PLUGIN_fatal ("CUDA stream create NULL\n");

Can this actually happen, given the "CUDA_CALL_ASSERT" usage above?

> +  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);

Why is the synchronization needed here?

> +  return aq;
> +}


Grüße
 Thomas
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes

Chung-Lin Tang-5
On 2018/12/7 04:57 AM, Thomas Schwinge wrote>> ---
a/libgomp/plugin/plugin-nvptx.c

>> +++ b/libgomp/plugin/plugin-nvptx.c
>
>> +struct goacc_asyncqueue *
>> +GOMP_OFFLOAD_openacc_async_construct (void)
>> +{
>> +  struct goacc_asyncqueue *aq
>> +    = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
>> +  aq->cuda_stream = NULL;
>> +  CUDA_CALL_ASSERT (cuStreamCreate, &aq->cuda_stream, CU_STREAM_DEFAULT);
>
> Curiously (this was the same in the code before): does this have to be
> "CU_STREAM_DEFAULT" instead of "CU_STREAM_NON_BLOCKING", because we want
> to block anything from running in parallel with "acc_async_sync" GPU
> kernels, that use the "NULL" stream?  (Not asking you to change this now,
> but I wonder if this is overly strict?)

IIUC, this non-blocking only pertains to the "Legacy Default Stream" in
CUDA, which we're pretty much ignoring; we should be using the newer
per-thread default stream model. We could review this issue later.

>> +  if (aq->cuda_stream == NULL)
>> +    GOMP_PLUGIN_fatal ("CUDA stream create NULL\n");
>
> Can this actually happen, given the "CUDA_CALL_ASSERT" usage above?

Hmm, yeah I think this is superfluous too...

>> +  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
>
> Why is the synchronization needed here?

I don't remember, could likely be something added during debug.
I'll remove this and test if things are okay.

Thanks,
Chung-Lin
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes (revised, v2)

Chung-Lin Tang-5
On 2018/12/10 6:02 PM, Chung-Lin Tang wrote:

> On 2018/12/7 04:57 AM, Thomas Schwinge wrote>> --- a/libgomp/plugin/plugin-nvptx.c
>>> +++ b/libgomp/plugin/plugin-nvptx.c
>>
>>> +struct goacc_asyncqueue *
>>> +GOMP_OFFLOAD_openacc_async_construct (void)
>>> +{
>>> +  struct goacc_asyncqueue *aq
>>> +    = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
>>> +  aq->cuda_stream = NULL;
>>> +  CUDA_CALL_ASSERT (cuStreamCreate, &aq->cuda_stream, CU_STREAM_DEFAULT);
>>
>> Curiously (this was the same in the code before): does this have to be
>> "CU_STREAM_DEFAULT" instead of "CU_STREAM_NON_BLOCKING", because we want
>> to block anything from running in parallel with "acc_async_sync" GPU
>> kernels, that use the "NULL" stream?  (Not asking you to change this now,
>> but I wonder if this is overly strict?)
>
> IIUC, this non-blocking only pertains to the "Legacy Default Stream" in CUDA, which we're pretty much ignoring; we should be using the newer per-thread default stream model. We could review this issue later.
>
>>> +  if (aq->cuda_stream == NULL)
>>> +    GOMP_PLUGIN_fatal ("CUDA stream create NULL\n");
>>
>> Can this actually happen, given the "CUDA_CALL_ASSERT" usage above?
>
> Hmm, yeah I think this is superfluous too...
>
>>> +  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
>>
>> Why is the synchronization needed here?
>
> I don't remember, could likely be something added during debug.
> I'll remove this and test if things are okay.
I have removed the above seemingly unneeded lines and re-tested, appears okay.
Also the formerly attached version seemed to for some reason had many conflicts
with older code, all resolved in this v2 nvptx part.

Thanks,
Chung-Lin

async-06.nvptx.v2.patch (57K) Download Attachment