[GSoC'19] First Evaluations: Implementing OpenMP Work Stealing Scheduling

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[GSoC'19] First Evaluations: Implementing OpenMP Work Stealing Scheduling

김규래
Hi everyone,
I'll share my status for GSoC first evaluation.
 
Current status of libgomp task system:
I'll first summarize my understanding of libgomp.
Please correct me if I'm wrong.
Currently libgomp has 3 different queues: children_queue, taskloop_queue and team_queue.
These three queues are protected using a big lock (team->task_lock).
The current 3 queue setup makes the implementation of work-stealing hard because they must be inter-synchronized.
 
Implementation of competing systems:
​The intel OpenMP implementation [1] is simpler.
It uses a single queue for each thread and a single subroutine for dequeuing and executing the tasks  [2, 3].
The taskgroup tasks and childen tasks are only counted (not queued) [4, 5, 6].
So the taskwait or barrier like constructs only have to check whether all the tasks of interest were computed.
This unifies the task queuing system and makes scheduling much simpler.

What to do on libgomp:
I think we should follow a similar path to libomp.
Instead of using 3 different queues, we could simply use one and only count the tasks of interest.
This should also reduce the synchronization overhead between the queues (such as in gomp_task_run_post_remove_taskgroup).
Then the big task_lock lock could be split into one lock for each thread.
I'm currently implementing this but not yet have testable results.
I'll share results as soon as they become available.

Any feedback would be much appreciated.

Ray Kim


[1] https://github.com/llvm-mirror/openmp
[2] https://github.com/llvm-mirror/openmp/blob/3f381f546ec9b065f9133d1fcd5d2711affb646a/runtime/src/kmp_tasking.cpp#L2889
[3] Each thread's dedicated queue. https://github.com/llvm-mirror/openmp/blob/bbb6f0170731679d690cee002be712f2d703b8fe/runtime/src/kmp.h#L2384
[4] The task executing part of taskwait. https://github.com/llvm-mirror/openmp/blob/bbb6f0170731679d690cee002be712f2d703b8fe/runtime/src/kmp.h#L2187
[5] Atomic variable holding number of uncompleted children. https://github.com/llvm-mirror/openmp/blob/bbb6f0170731679d690cee002be712f2d703b8fe/runtime/src/kmp.h#L2352
[6] Atomic variable hodling number of uncompleted tasks in taskgroup. https://github.com/llvm-mirror/openmp/blob/bbb6f0170731679d690cee002be712f2d703b8fe/runtime/src/kmp.h#L2188
 
Reply | Threaded
Open this post in threaded view
|

Re: [GSoC'19] First Evaluations: Implementing OpenMP Work Stealing Scheduling

Jakub Jelinek
On Thu, Jun 27, 2019 at 05:20:56AM +0900, 김규래 wrote:
> I'll share my status for GSoC first evaluation.
>  
> Current status of libgomp task system:
> I'll first summarize my understanding of libgomp.
> Please correct me if I'm wrong.
> Currently libgomp has 3 different queues: children_queue, taskloop_queue and team_queue.
> These three queues are protected using a big lock (team->task_lock).

Indeed.  That lock protects those queues, and the task dependencies
(task->{depend_hash,depend_count,dependers}), *task->taskwait, *taskgroup
etc.

> The current 3 queue setup makes the implementation of work-stealing hard because they must be inter-synchronized.
>  
> Implementation of competing systems:
> The intel OpenMP implementation [1] is simpler.
> It uses a single queue for each thread and a single subroutine for dequeuing and executing the tasks  [2, 3].
> The taskgroup tasks and childen tasks are only counted (not queued) [4, 5, 6].
> So the taskwait or barrier like constructs only have to check whether all the tasks of interest were computed.
> This unifies the task queuing system and makes scheduling much simpler.

I see the per-thread locks td_deque_lock in there, does it have anything
like the per team lock (or per taskgroup lock) for some cases or not?
What kind of locking does it use for task dependencies?  E.g. a task could
depend on tasks running in multiple threads and some further tasks still
queued in various threads' queues (plus some not yet satisfied tasks), for
the dependencies certainly in libgomp data structures all the bookkeeping
can't be done just with atomic accesses, so some mutex is needed.

> What to do on libgomp:
> I think we should follow a similar path to libomp.
> Instead of using 3 different queues, we could simply use one and only count the tasks of interest.

We already do have such counts at least in some cases, e.g.
taskgroup->num_children and sure, new ones can be added.
The replacement of the 3 queues with either a single one or
a per implicit task is the easy part.

> This should also reduce the synchronization overhead between the queues (such as in gomp_task_run_post_remove_taskgroup).
> Then the big task_lock lock could be split into one lock for each thread.

This is the hard part, we need to figure out the right locking scheme for
all the data accesses that need protection or use atomic accesses otherwise
where possible, or say __atomic_load outside of mutex protected area and
if it indicates there might be something of interest, take a lock and retry
under lock.
Where will a new task be placed when created?  Pick a random thread, take
that thread's lock and put it into that queue?  What about threads that are
currently sleeping and need to be woken up using gomp_sem_post, do we somehow
prefer them for the new task placement?  Do we consider priorities just
within each thread's queue and not globally?  I'm afraid if we want to avoid
the team->task_lock altogether or as much as possible, we have to and in the
standard, priority is just a best effort, so it might be fine.
For the work-stealing, if the current thread doesn't have anything in its
own queue, will it pick a random other thread (or say remember what it tried
last or similar)?  And as said earlier, if there is no work to do by the
current thread, but it needs to wait for some other tasks to complete,
it shouldn't busy wait forever, but needs to sleep and needs to be awaken if
there is further work for it or if the condition it is waiting on is finally
met (gomp_team_barrier_wake, taskwait->taskwait_sem,
taskgroup->taskgroup_sem).
Do we need any other locks to protect some state (say, do we need a
per-taskgroup lock, or can we get away with only atomic accesses)?

        Jakub
Reply | Threaded
Open this post in threaded view
|

Re: [GSoC'19] First Evaluations: Implementing OpenMP Work Stealing Scheduling

김규래
> > Implementation of competing systems:
> > The intel OpenMP implementation [1] is simpler.
> > It uses a single queue for each thread and a single subroutine for dequeuing and executing the tasks  [2, 3].
> > The taskgroup tasks and childen tasks are only counted (not queued) [4, 5, 6].
> > So the taskwait or barrier like constructs only have to check whether all the tasks of interest were computed.
> > This unifies the task queuing system and makes scheduling much simpler.

> I see the per-thread locks td_deque_lock in there, does it have anything
> like the per team lock (or per taskgroup lock) for some cases or not?
> What kind of locking does it use for task dependencies?  E.g. a task could
> depend on tasks running in multiple threads and some further tasks still
> queued in various threads' queues (plus some not yet satisfied tasks), for
> the dependencies certainly in libgomp data structures all the bookkeeping
> can't be done just with atomic accesses, so some mutex is needed.
 
It doesn't seem like there are such locks.
Everything done outside of queue accesses are atomic count reads.
The dependencies, however, are handled in an interesting way.
They create a mutex for each dependency and try to acquire them before selecting a task [1, 2, 3].
 

> Where will a new task be placed when created?  Pick a random thread, take
> that thread's lock and put it into that queue?  What about threads that are
> currently sleeping and need to be woken up using gomp_sem_post, do we somehow
> prefer them for the new task placement?  Do we consider priorities just
> within each thread's queue and not globally?  I'm afraid if we want to avoid
> the team->task_lock altogether or as much as possible, we have to and in the
> standard, priority is just a best effort, so it might be fine.
> For the work-stealing, if the current thread doesn't have anything in its
> own queue, will it pick a random other thread (or say remember what it tried
> last or similar)?  And as said earlier, if there is no work to do by the
> current thread, but it needs to wait for some other tasks to complete,
> it shouldn't busy wait forever, but needs to sleep and needs to be awaken if
> there is further work for it or if the condition it is waiting on is finally
> met (gomp_team_barrier_wake, taskwait->taskwait_sem,
> taskgroup->taskgroup_sem).
 
I think this is more of a policy issue.
libomp currently queues the children tasks on the queue of the current thread.
When there's no work to do, (or the current task is waiting on dependencies),
it uniformly chooses a victim, tests the dependencies and steals it.
We could play around with different configurations to see what works once we get a running prototype.
There's also some recent literature done in this part (locality-aware victim selection stuff).
For the case of the priorities, libomp doesn't actually seem to care about priority at all.
Since the OpenMP standard doesn't require strict priorities, I think we can just relax the priority constraints on a per-thread basis.
 
> Another option, which I guess starts to go out of scope of your gsoc, is
> parallel depth first (PDF) search (Blelloch 1999) as an alternative to work
> stealing. Here's a presentation about some recent work in this area,
> although for Julia and not OpenMP (no idea if PDF would fit with OpenMP at
> all): https://www.youtube.com/watch?v=YdiZa0Y3F3c 
 
I think PDF is also an interesting idea.
It seems the Julia fellows implemented PDF by assigning higher priorities to children tasks.
If that is the case, it should be trivial to implement since we already use prioirity queues (and libomp doesn't)
 
> > What to do on libgomp:
> > I think we should follow a similar path to libomp.
> > Instead of using 3 different queues, we could simply use one and only count the tasks of interest.

> We already do have such counts at least in some cases, e.g.
> taskgroup->num_children and sure, new ones can be added.
> The replacement of the 3 queues with either a single one or
> a per implicit task is the easy part.
 
I'll first try to get libgomp work without the child and taskgroup queues.
 
Ray Kim
 
[1] Check done before stealing/executing a task. https://github.com/llvm-mirror/openmp/blob/a31ae2e44a8f91ca20aa03319603829d994a1635/runtime/src/kmp_tasking.cpp#L2721
[2] Acquirement of all the dependency mutexes. https://github.com/llvm-mirror/openmp/blob/a31ae2e44a8f91ca20aa03319603829d994a1635/runtime/src/kmp_tasking.cpp#L284
[3] Construction of mutexes for each dependency. https://github.com/llvm-mirror/openmp/blob/7e084d9fdffa07d74a4d4769c3681fbfbe86e6f3/runtime/src/kmp_taskdeps.cpp#L237