Parallelize GCC with Threads -- First Evaluation

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Parallelize GCC with Threads -- First Evaluation

Giuliano Belinassi-2
Hi,

Parallelize GCC with Threads -- First Evaluation

Hi everyone,

I am attaching the first evaluation report here publicly for gathering
feedback. The file is in markdown format and it can be easily be converted to
PDF for better visualization.

I am also open to suggestions and ideas in order to improve the current project :-)

My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel

Giuliano

gsoc_parallel_deliver_1.md (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Parallelize GCC with Threads -- First Evaluation

nick-2


On 2019-06-24 8:59 a.m., Giuliano Belinassi wrote:

> Hi,
>
> Parallelize GCC with Threads -- First Evaluation
>
> Hi everyone,
>
> I am attaching the first evaluation report here publicly for gathering
> feedback. The file is in markdown format and it can be easily be converted to
> PDF for better visualization.
>
> I am also open to suggestions and ideas in order to improve the current project :-)
>
> My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel
>
> Giuliano
>

Guiliano,

Three things first your original proposal was just for expand_all_functions so don't
know if it's extended out now but there's other parts in my research so the title
was a little confusing.

I'm assuming this is outside the scope of the current project but does your all_rtl_passes
function help out with architecture specific tuning flags as it seems that my research
states that's one area of shared state.

In addition for memory this may be really hard to do but can you have a signaler
that tells each phase what  data to pass on therefore notifying what needs to be
passed on to the next pass. So if expand_all_functions needs to pass x the signaler
will notify the pass and just swap the values into the pass lock less if possible
or just kill it off if not. This would mean writing a GENERIC to RTL final passes
signaler which may take too long considering the scope of the project.

Again that's just off the top of my head so it may be a really bad idea,

Nick

P.S. Good luck through.
Reply | Threaded
Open this post in threaded view
|

Re: Parallelize GCC with Threads -- First Evaluation

Richard Biener
In reply to this post by Giuliano Belinassi-2
On Mon, 24 Jun 2019, Giuliano Belinassi wrote:

> Hi,
>
> Parallelize GCC with Threads -- First Evaluation
>
> Hi everyone,
>
> I am attaching the first evaluation report here publicly for gathering
> feedback. The file is in markdown format and it can be easily be converted to
> PDF for better visualization.
>
> I am also open to suggestions and ideas in order to improve the current project :-)
>
> My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel

Thanks for your work on this!

I didn't think of the default_obstack and input_location issues
so it's good we get to know these.  The bitmap default obstack
is merely a convenience and fortunately (content!) lifetime is constrained
to be per-pass though that's not fully enforced.  Your solution is fine
I think and besides when parallelizing the IPA phase should be not
worse than making the default-obstack per thread context given how
many times we zap it.

input_location shouldn't really be used...  but oh well.

As of per-pass global state I expect that eventually all
global variables in the individual passes are a problem
and in nearly all cases they are global out of laziness
to pass down state across functions.

You identified some global state in infrastructure code
which are the more interesting cases, most relevant for
GIMPLE are the ones in tree-ssa-operands.c, tree-cfg.c
and et-forest.c I guess.

For individual passes a first step would be to wrap
all globals into a struct we can allocate at ::execute
time and pass down as pointer.  That's slightly less
intrusive than wrapping all of the pass in a class
but functionally equivalent.

<<<
1. The GCC `object_pool_allocator`

    There is also the GCC object_pool_allocator, which is used to allocate
some
    objects. Since these objects may be used later in the compilation by
other
    threads, we can't simply make them private to each thread. Therefore I
added a
    threadsafe_object_pool_allocator object that currently uses locks
guarantee
    safety, however I am not able to check its correctness. This is also
    not efficient and might require a better approach later.
>>>

I guess the same applies to the GC allocator - to make these
more efficient we'd have a per-thread freelist we can allocate
from without locking and which we'd, once empty, fill from the
main pool in larger chunks with locking.  At thread finalization
we have to return the freelist to the main allocator then
and for the GC allocator possibly at garbage collection time.

This scheme may also work for the bitmap default obstack
and its freelist.  I would also suggest when you run into
a specific issue with the default obstack to use a separate
obstack in the respective area.  You mention issues with LTO,
are the reproducible on the branch?  I suppose you are
currently testing GCC with num_threads = 1 to make sure you
are not introducing non-threading related issues?

Thanks again,
Richard.
Reply | Threaded
Open this post in threaded view
|

Re: Parallelize GCC with Threads -- First Evaluation

Giuliano Belinassi-2
Hi, Richard

On 06/25, Richard Biener wrote:

> On Mon, 24 Jun 2019, Giuliano Belinassi wrote:
>
> > Hi,
> >
> > Parallelize GCC with Threads -- First Evaluation
> >
> > Hi everyone,
> >
> > I am attaching the first evaluation report here publicly for gathering
> > feedback. The file is in markdown format and it can be easily be converted to
> > PDF for better visualization.
> >
> > I am also open to suggestions and ideas in order to improve the current project :-)
> >
> > My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel
>
> Thanks for your work on this!
>
> I didn't think of the default_obstack and input_location issues
> so it's good we get to know these.  The bitmap default obstack
> is merely a convenience and fortunately (content!) lifetime is constrained
> to be per-pass though that's not fully enforced.  Your solution is fine
> I think and besides when parallelizing the IPA phase should be not
> worse than making the default-obstack per thread context given how
> many times we zap it.

This is something that is bothering me. In this current state, when I
declare the default_obstack as TLS, several LTO tests fails in
ssa_verify  and I am not sure why yet. I will investigate futher in
this.

As for IPA phase, you mean the expand_ipa or the IPA itself? If
expand_ipa falls in Inter Process otimization scheme the current
strategy will work, of course.

As for IPA itself, I am still not sure if it is parallelizable yet,
however, I've seen some academic works parallelizing Dataflow
Analysis, which can give interesting ideas about how to handle the call
graphs (?). But this may be something interesting as the LTO WHOPR is
entirely sequential AFIK.

>
> input_location shouldn't really be used...  but oh well.
>
> As of per-pass global state I expect that eventually all
> global variables in the individual passes are a problem
> and in nearly all cases they are global out of laziness
> to pass down state across functions.
>
> You identified some global state in infrastructure code
> which are the more interesting cases, most relevant for
> GIMPLE are the ones in tree-ssa-operands.c, tree-cfg.c
> and et-forest.c I guess.
>
> For individual passes a first step would be to wrap
> all globals into a struct we can allocate at ::execute
> time and pass down as pointer.  That's slightly less
> intrusive than wrapping all of the pass in a class
> but functionally equivalent.

Sounds really good, but this will also require a lot of code change.
Hopefully most of it can be solved without any unpleasant surprises.

>
> <<<
> 1. The GCC `object_pool_allocator`
>
>     There is also the GCC object_pool_allocator, which is used to allocate
> some
>     objects. Since these objects may be used later in the compilation by
> other
>     threads, we can't simply make them private to each thread. Therefore I
> added a
>     threadsafe_object_pool_allocator object that currently uses locks
> guarantee
>     safety, however I am not able to check its correctness. This is also
>     not efficient and might require a better approach later.
> >>>
>
> I guess the same applies to the GC allocator - to make these
> more efficient we'd have a per-thread freelist we can allocate
> from without locking and which we'd, once empty, fill from the
> main pool in larger chunks with locking.  At thread finalization
> we have to return the freelist to the main allocator then
> and for the GC allocator possibly at garbage collection time.

This also sounds good. If I understeand correctly, the sequential
allocation state will be amortized, as we could ask for an object which
is 2x bigger than each thread currently has. (I am assuming allocation
is O(1))

>
> This scheme may also work for the bitmap default obstack
> and its freelist.  I would also suggest when you run into
> a specific issue with the default obstack to use a separate
> obstack in the respective area.  You mention issues with LTO,
> are the reproducible on the branch?  I suppose you are
> currently testing GCC with num_threads = 1 to make sure you
> are not introducing non-threading related issues?

Yes. The issue with LTO is that obstack, as I mentioned earlier. It is
reproducible in my branch, just compile it without any changes, then
declare the obstack TLS, recompile the changes and run gcc.dg tests.

>
> Thanks again,
> Richard.

Giuliano.
Reply | Threaded
Open this post in threaded view
|

Re: Parallelize GCC with Threads -- First Evaluation

Richard Biener
On Tue, 25 Jun 2019, Giuliano Belinassi wrote:

> Hi, Richard
>
> On 06/25, Richard Biener wrote:
> > On Mon, 24 Jun 2019, Giuliano Belinassi wrote:
> >
> > > Hi,
> > >
> > > Parallelize GCC with Threads -- First Evaluation
> > >
> > > Hi everyone,
> > >
> > > I am attaching the first evaluation report here publicly for gathering
> > > feedback. The file is in markdown format and it can be easily be converted to
> > > PDF for better visualization.
> > >
> > > I am also open to suggestions and ideas in order to improve the current project :-)
> > >
> > > My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel
> >
> > Thanks for your work on this!
> >
> > I didn't think of the default_obstack and input_location issues
> > so it's good we get to know these.  The bitmap default obstack
> > is merely a convenience and fortunately (content!) lifetime is constrained
> > to be per-pass though that's not fully enforced.  Your solution is fine
> > I think and besides when parallelizing the IPA phase should be not
> > worse than making the default-obstack per thread context given how
> > many times we zap it.
>
> This is something that is bothering me. In this current state, when I
> declare the default_obstack as TLS, several LTO tests fails in
> ssa_verify  and I am not sure why yet. I will investigate futher in
> this.

If you tell me how to reproduce on your branch I can have a look as well.

> As for IPA phase, you mean the expand_ipa or the IPA itself? If
> expand_ipa falls in Inter Process otimization scheme the current
> strategy will work, of course.

I meant IPA itself.

> As for IPA itself, I am still not sure if it is parallelizable yet,
> however, I've seen some academic works parallelizing Dataflow
> Analysis, which can give interesting ideas about how to handle the call
> graphs (?). But this may be something interesting as the LTO WHOPR is
> entirely sequential AFIK.
>
> >
> > input_location shouldn't really be used...  but oh well.
> >
> > As of per-pass global state I expect that eventually all
> > global variables in the individual passes are a problem
> > and in nearly all cases they are global out of laziness
> > to pass down state across functions.
> >
> > You identified some global state in infrastructure code
> > which are the more interesting cases, most relevant for
> > GIMPLE are the ones in tree-ssa-operands.c, tree-cfg.c
> > and et-forest.c I guess.
> >
> > For individual passes a first step would be to wrap
> > all globals into a struct we can allocate at ::execute
> > time and pass down as pointer.  That's slightly less
> > intrusive than wrapping all of the pass in a class
> > but functionally equivalent.
>
> Sounds really good, but this will also require a lot of code change.

Yeah, it's really a monkeys job ;)  One reason I originally proposed
the pipeline approach to avoid the need to fix all of these.

> Hopefully most of it can be solved without any unpleasant surprises.

I would guess so, it's also a generally desirable cleanup.

> >
> > <<<
> > 1. The GCC `object_pool_allocator`
> >
> >     There is also the GCC object_pool_allocator, which is used to allocate
> > some
> >     objects. Since these objects may be used later in the compilation by
> > other
> >     threads, we can't simply make them private to each thread. Therefore I
> > added a
> >     threadsafe_object_pool_allocator object that currently uses locks
> > guarantee
> >     safety, however I am not able to check its correctness. This is also
> >     not efficient and might require a better approach later.
> > >>>
> >
> > I guess the same applies to the GC allocator - to make these
> > more efficient we'd have a per-thread freelist we can allocate
> > from without locking and which we'd, once empty, fill from the
> > main pool in larger chunks with locking.  At thread finalization
> > we have to return the freelist to the main allocator then
> > and for the GC allocator possibly at garbage collection time.
>
> This also sounds good. If I understeand correctly, the sequential
> allocation state will be amortized, as we could ask for an object which
> is 2x bigger than each thread currently has. (I am assuming allocation
> is O(1))

Yeah, it's mostly trading (freelist) memory for speed.

> >
> > This scheme may also work for the bitmap default obstack
> > and its freelist.  I would also suggest when you run into
> > a specific issue with the default obstack to use a separate
> > obstack in the respective area.  You mention issues with LTO,
> > are the reproducible on the branch?  I suppose you are
> > currently testing GCC with num_threads = 1 to make sure you
> > are not introducing non-threading related issues?
>
> Yes. The issue with LTO is that obstack, as I mentioned earlier. It is
> reproducible in my branch, just compile it without any changes, then
> declare the obstack TLS, recompile the changes and run gcc.dg tests.

Trying that but cannot reproduce any failure sofar.  Did you adjust
bitmap_default_obstack_depth to be TLS as well?

Richard.
Reply | Threaded
Open this post in threaded view
|

Re: Parallelize GCC with Threads -- First Evaluation

Giuliano Belinassi-2
In reply to this post by nick-2
Hi

On 06/24, nick wrote:

>
>
> On 2019-06-24 8:59 a.m., Giuliano Belinassi wrote:
> > Hi,
> >
> > Parallelize GCC with Threads -- First Evaluation
> >
> > Hi everyone,
> >
> > I am attaching the first evaluation report here publicly for gathering
> > feedback. The file is in markdown format and it can be easily be converted to
> > PDF for better visualization.
> >
> > I am also open to suggestions and ideas in order to improve the current project :-)
> >
> > My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel
> >
> > Giuliano
> >
>
> Guiliano,
>
> Three things first your original proposal was just for expand_all_functions so don't
> know if it's extended out now but there's other parts in my research so the title
> was a little confusing.

Everything that I am doing is to parallelize this function. Notice that
in trunk there is a call to node->expand(), and in order to expand two
nodes in parallel I have to explore all shared states inside these,
including the passes. Also my work until now is focused in GIMPLE.

>
> I'm assuming this is outside the scope of the current project but does your all_rtl_passes
> function help out with architecture specific tuning flags as it seems that my research
> states that's one area of shared state.

Sorry, but I am not sure if I understeand what you meant. When splitting
all_passes into all_passes and all_rtl_passes, I didn't touch it and I
am assuming it is working as all tests except three which I told Richard
are passing. As for tuning flags, I documented a few of then but I am
currently ignoring it since I marked it as backend dependency.

>
> In addition for memory this may be really hard to do but can you have a signaler
> that tells each phase what  data to pass on therefore notifying what needs to be
> passed on to the next pass. So if expand_all_functions needs to pass x the signaler
> will notify the pass and just swap the values into the pass lock less if possible
> or just kill it off if not. This would mean writing a GENERIC to RTL final passes
> signaler which may take too long considering the scope of the project.

I am assuming you are talking about the pass-pipeline approach Richi
recommended to avoid the per-pass global-states. I don't think a signal
is a good idea here as we can lose signals around if we are not careful
enough. What I would do is to use a producer-consumer queue on each
pass, passing the function to optimize and a struct whith everything the
pass needs. Or protect the global states with a binary semaphore: when the
pass is running, it decrements the semaphore, and increments it after it
works is done. When the previous pass wants to send information, I
decrement the semaphore, change the variables, and increment it again.
This is pretty similar of how a mutex work.

>
> Again that's just off the top of my head so it may be a really bad idea,
>
> Nick
>
> P.S. Good luck through.
Thank you,
Giuliano
Reply | Threaded
Open this post in threaded view
|

Re: Parallelize GCC with Threads -- First Evaluation

Giuliano Belinassi-2
In reply to this post by Richard Biener
On 06/25, Richard Biener wrote:

> On Tue, 25 Jun 2019, Giuliano Belinassi wrote:
>
> > Hi, Richard
> >
> > On 06/25, Richard Biener wrote:
> > > On Mon, 24 Jun 2019, Giuliano Belinassi wrote:
> > >
> > > > Hi,
> > > >
> > > > Parallelize GCC with Threads -- First Evaluation
> > > >
> > > > Hi everyone,
> > > >
> > > > I am attaching the first evaluation report here publicly for gathering
> > > > feedback. The file is in markdown format and it can be easily be converted to
> > > > PDF for better visualization.
> > > >
> > > > I am also open to suggestions and ideas in order to improve the current project :-)
> > > >
> > > > My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel
> > >
> > > Thanks for your work on this!
> > >
> > > I didn't think of the default_obstack and input_location issues
> > > so it's good we get to know these.  The bitmap default obstack
> > > is merely a convenience and fortunately (content!) lifetime is constrained
> > > to be per-pass though that's not fully enforced.  Your solution is fine
> > > I think and besides when parallelizing the IPA phase should be not
> > > worse than making the default-obstack per thread context given how
> > > many times we zap it.
> >
> > This is something that is bothering me. In this current state, when I
> > declare the default_obstack as TLS, several LTO tests fails in
> > ssa_verify  and I am not sure why yet. I will investigate futher in
> > this.
>
> If you tell me how to reproduce on your branch I can have a look as well.

To reproduce it, clone the branch with:

git clone https://gitlab.com/flusp/gcc.git

Then change branch with

git checkout giulianob_parallel

create a new folder (build_parallel, for instance) and compile it with

../configure --disable-bootstrap --enable-languages=c
make -j<NUM_JOBS>

Then

```
diff --git a/gcc/bitmap.c b/gcc/bitmap.c
index 894aefa13de..87e6c7ac6fc 100644
--- a/gcc/bitmap.c
+++ b/gcc/bitmap.c
@@ -65,7 +65,7 @@ release_overhead (bitmap b, size_t amount, bool remove_from_map)
 
 /* Global data */
 bitmap_element bitmap_zero_bits;  /* An element of all zero bits.  */
-bitmap_obstack bitmap_default_obstack;    /* The default bitmap obstack.  */
+__thread bitmap_obstack bitmap_default_obstack;    /* The default bitmap obstack.  */
 static int bitmap_default_obstack_depth;
 static GTY((deletable)) bitmap_element *bitmap_ggc_free; /* Freelist of
            GC'd elements.  */
diff --git a/gcc/bitmap.h b/gcc/bitmap.h
index 39f509db611..b31e99ffcd0 100644
--- a/gcc/bitmap.h
+++ b/gcc/bitmap.h
@@ -359,7 +359,7 @@ struct GTY(()) bitmap_head {
 
 /* Global data */
 extern bitmap_element bitmap_zero_bits; /* Zero bitmap element */
-extern bitmap_obstack bitmap_default_obstack;   /* Default bitmap obstack */
+extern __thread bitmap_obstack bitmap_default_obstack;   /* Default bitmap obstack */
 
 /* Change the view of the bitmap to list, or tree.  */
 void bitmap_list_view (bitmap);
```

then recompile.

Just a side note, while I was writting this, Richi found the issue. The
obstack_depth also needs to be marked as TLS. Why it only crashed LTO
is a mistery, through.


>
> > As for IPA phase, you mean the expand_ipa or the IPA itself? If
> > expand_ipa falls in Inter Process otimization scheme the current
> > strategy will work, of course.
>
> I meant IPA itself.
>
> > As for IPA itself, I am still not sure if it is parallelizable yet,
> > however, I've seen some academic works parallelizing Dataflow
> > Analysis, which can give interesting ideas about how to handle the call
> > graphs (?). But this may be something interesting as the LTO WHOPR is
> > entirely sequential AFIK.
> >
> > >
> > > input_location shouldn't really be used...  but oh well.
> > >
> > > As of per-pass global state I expect that eventually all
> > > global variables in the individual passes are a problem
> > > and in nearly all cases they are global out of laziness
> > > to pass down state across functions.
> > >
> > > You identified some global state in infrastructure code
> > > which are the more interesting cases, most relevant for
> > > GIMPLE are the ones in tree-ssa-operands.c, tree-cfg.c
> > > and et-forest.c I guess.
> > >
> > > For individual passes a first step would be to wrap
> > > all globals into a struct we can allocate at ::execute
> > > time and pass down as pointer.  That's slightly less
> > > intrusive than wrapping all of the pass in a class
> > > but functionally equivalent.
> >
> > Sounds really good, but this will also require a lot of code change.
>
> Yeah, it's really a monkeys job ;)  One reason I originally proposed
> the pipeline approach to avoid the need to fix all of these.
>
> > Hopefully most of it can be solved without any unpleasant surprises.
>
> I would guess so, it's also a generally desirable cleanup.
>
> > >
> > > <<<
> > > 1. The GCC `object_pool_allocator`
> > >
> > >     There is also the GCC object_pool_allocator, which is used to allocate
> > > some
> > >     objects. Since these objects may be used later in the compilation by
> > > other
> > >     threads, we can't simply make them private to each thread. Therefore I
> > > added a
> > >     threadsafe_object_pool_allocator object that currently uses locks
> > > guarantee
> > >     safety, however I am not able to check its correctness. This is also
> > >     not efficient and might require a better approach later.
> > > >>>
> > >
> > > I guess the same applies to the GC allocator - to make these
> > > more efficient we'd have a per-thread freelist we can allocate
> > > from without locking and which we'd, once empty, fill from the
> > > main pool in larger chunks with locking.  At thread finalization
> > > we have to return the freelist to the main allocator then
> > > and for the GC allocator possibly at garbage collection time.
> >
> > This also sounds good. If I understeand correctly, the sequential
> > allocation state will be amortized, as we could ask for an object which
> > is 2x bigger than each thread currently has. (I am assuming allocation
> > is O(1))
>
> Yeah, it's mostly trading (freelist) memory for speed.
>
> > >
> > > This scheme may also work for the bitmap default obstack
> > > and its freelist.  I would also suggest when you run into
> > > a specific issue with the default obstack to use a separate
> > > obstack in the respective area.  You mention issues with LTO,
> > > are the reproducible on the branch?  I suppose you are
> > > currently testing GCC with num_threads = 1 to make sure you
> > > are not introducing non-threading related issues?
> >
> > Yes. The issue with LTO is that obstack, as I mentioned earlier. It is
> > reproducible in my branch, just compile it without any changes, then
> > declare the obstack TLS, recompile the changes and run gcc.dg tests.
>
> Trying that but cannot reproduce any failure sofar.  Did you adjust
> bitmap_default_obstack_depth to be TLS as well?

No. You found the issue :P

>
> Richard.
Reply | Threaded
Open this post in threaded view
|

Re: Parallelize GCC with Threads -- First Evaluation

nick-2
In reply to this post by Giuliano Belinassi-2


On 2019-06-25 9:40 a.m., Giuliano Belinassi wrote:

> Hi
>
> On 06/24, nick wrote:
>>
>>
>> On 2019-06-24 8:59 a.m., Giuliano Belinassi wrote:
>>> Hi,
>>>
>>> Parallelize GCC with Threads -- First Evaluation
>>>
>>> Hi everyone,
>>>
>>> I am attaching the first evaluation report here publicly for gathering
>>> feedback. The file is in markdown format and it can be easily be converted to
>>> PDF for better visualization.
>>>
>>> I am also open to suggestions and ideas in order to improve the current project :-)
>>>
>>> My branch can be seen here: https://gitlab.com/flusp/gcc/tree/giulianob_parallel
>>>
>>> Giuliano
>>>
>>
>> Guiliano,
>>
>> Three things first your original proposal was just for expand_all_functions so don't
>> know if it's extended out now but there's other parts in my research so the title
>> was a little confusing.
>
> Everything that I am doing is to parallelize this function. Notice that
> in trunk there is a call to node->expand(), and in order to expand two
> nodes in parallel I have to explore all shared states inside these,
> including the passes. Also my work until now is focused in GIMPLE.
>
>>
>> I'm assuming this is outside the scope of the current project but does your all_rtl_passes
>> function help out with architecture specific tuning flags as it seems that my research
>> states that's one area of shared state.
>
> Sorry, but I am not sure if I understeand what you meant. When splitting
> all_passes into all_passes and all_rtl_passes, I didn't touch it and I
> am assuming it is working as all tests except three which I told Richard
> are passing. As for tuning flags, I documented a few of then but I am
> currently ignoring it since I marked it as backend dependency.
>
Exactly that's fine, it's been part of my research these days when I've had
time. If you pass the documentation of which backend  passes are not getting
touched still that would be great as I'm going to inspect those later this
week probably. IPA as your finding out now and SSA lowering seem to have
issues too so I'm going to look into through as well.
 

>>
>> In addition for memory this may be really hard to do but can you have a signaler
>> that tells each phase what  data to pass on therefore notifying what needs to be
>> passed on to the next pass. So if expand_all_functions needs to pass x the signaler
>> will notify the pass and just swap the values into the pass lock less if possible
>> or just kill it off if not. This would mean writing a GENERIC to RTL final passes
>> signaler which may take too long considering the scope of the project.
>
> I am assuming you are talking about the pass-pipeline approach Richi
> recommended to avoid the per-pass global-states. I don't think a signal
> is a good idea here as we can lose signals around if we are not careful
> enough. What I would do is to use a producer-consumer queue on each
> pass, passing the function to optimize and a struct whith everything the
> pass needs. Or protect the global states with a binary semaphore: when the
> pass is running, it decrements the semaphore, and increments it after it
> works is done. When the previous pass wants to send information, I
> decrement the semaphore, change the variables, and increment it again.
> This is pretty similar of how a mutex work.
>
>>
That was basically my idea fleshed out so it's fine now.
>> Again that's just off the top of my head so it may be a really bad idea,
>>
>> Nick
>>
>> P.S. Good luck through.
> Thank you,
> Giuliano
>

Sorry for not being clearer,
Nick