[Bug rtl-optimization/87902] New: [9 Regression] Shrink-wrapping multiple conditions

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] New: [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

            Bug ID: 87902
           Summary: [9 Regression] Shrink-wrapping multiple conditions
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: iii at linux dot ibm.com
                CC: krebbel at gcc dot gnu.org
  Target Milestone: ---
            Target: s390x-linux-gnu

I've noticed that r265830 fails to shrink-wrap multiple early returns in
gcc/testsuite/gcc.target/s390/nobp-return-mem-z900.c, while r264877 managed to
do so just fine.

After reload we end up with the following code for those conditions:

;; basic block 2
(note 5 1 3 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 3 5 2 2 NOTE_INSN_FUNCTION_BEG)
(insn 2 3 7 2 (set (reg/v:DI 12 %r12 [orig:63 aD.2191+-4 ] [63])
        (reg:DI 2 %r2 [72]))
"gcc/testsuite/gcc.target/s390/nobp-return-mem-z900.c":14:1 1269 {*movdi_64}
     (nil))
(insn 7 2 8 2 (set (reg:CCZ 33 %cc)
        (compare:CCZ (reg:SI 12 %r12 [orig:63 aD.2191 ] [63])
            (const_int 42 [0x2a])))
"gcc/testsuite/gcc.target/s390/nobp-return-mem-z900.c":17:6 1232 {*cmpsi_cct}
     (nil))
(jump_insn 8 7 9 2 (set (pc)
        (if_then_else (eq (reg:CCZ 33 %cc)
                (const_int 0 [0]))
            (label_ref:DI 33)
            (pc))) "gcc/testsuite/gcc.target/s390/nobp-return-mem-z900.c":17:6
1896 {*cjump_64}
     (int_list:REG_BR_PROB 225163668 (nil))
 -> 33)

;; basic block 3
(note 9 8 12 3 [bb 3] NOTE_INSN_BASIC_BLOCK)
(insn 12 9 13 3 (set (reg:CCS 33 %cc)
        (compare:CCS (reg:SI 12 %r12 [orig:63 aD.2191 ] [63])
            (const_int 0 [0])))
"gcc/testsuite/gcc.target/s390/nobp-return-mem-z900.c":20:3 1222
{*tstsi_cconly2}
     (nil))
(jump_insn 13 12 14 3 (set (pc)
        (if_then_else (le (reg:CCS 33 %cc)
                (const_int 0 [0]))
            (label_ref:DI 33)
            (pc))) "gcc/testsuite/gcc.target/s390/nobp-return-mem-z900.c":20:3
1896 {*cjump_64}
     (int_list:REG_BR_PROB 118111604 (nil))
 -> 33)

Note that comparisons use a copy in caller-saved %r12, and not %r2.  Then,
prepare_shrink_wrap () successfully propagates it to basic block 2. Basic block
3 is not affected - this seems to be by design, since prepare_shrink_wrap ()
only concerns itself with the first basic block.

In the past reload used to eliminate the copy and use %r2 directly in both
comparisons, but this seems to be no longer the case.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |9.0
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
In reply to this post by asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

--- Comment #1 from Ilya Leoshkevich <iii at linux dot ibm.com> ---
Bisect points to r265398: combine: Do not combine moves from hard
registers.

I wonder what would be the best place to fix this?  I was thinking about
making shrink-wrapping try harder by not limiting the processing to the
first basic block.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
In reply to this post by asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

--- Comment #2 from Segher Boessenkool <segher at gcc dot gnu.org> ---
So why does it use r12 there if it could use r2?  That's an RA problem.
This is related to PR87708, in a way.

prepare_shrinkwrap needs a good overhaul.  Moving all copies down also
*degrades* code quality, more often if you don't restrict it to the
first BB.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
In reply to this post by asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

--- Comment #3 from Ilya Leoshkevich <iii at linux dot ibm.com> ---
Judging by the following comment in lra-coalesce.c, RA doesn't do this
intentionally:

   Here we coalesce only spilled pseudos.  Coalescing non-spilled
   pseudos (with different hard regs) might result in spilling
   additional pseudos because of possible conflicts with other
   non-spilled pseudos and, as a consequence, in more constraint
   passes and even LRA infinite cycling.  Trivial the same hard
   register moves will be removed by subsequent compiler passes.

In which cases would moving copies down in prepare_shrink_wrap () make
the code worse?
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
In reply to this post by asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

--- Comment #4 from Segher Boessenkool <segher at gcc dot gnu.org> ---
All instructions that depend on the new registers can start later, too, if
you move all new registers down.  If you move copies from hard registers
down it is much worse: you are extending the lifetime of those hard regs
so that nothing else can live there, but more likely, it will have to be
spilled even.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
In reply to this post by asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

--- Comment #5 from Ilya Leoshkevich <iii at linux dot ibm.com> ---
By the time shrink-wrapping is performed, which is after LRA
(pass_thread_prologue_and_epilogue, to be precise), aren't all spilling
decisions already made?  Because if that's true, we have to be
conservative in prepare_shrink_wrap () anyway, and move down copies only
when the parameter register still contains the parameter value.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
In reply to this post by asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

--- Comment #6 from Segher Boessenkool <segher at gcc dot gnu.org> ---
Oh sure, if all you want to do is extend the prepare_shrinkwrap function,
that just works there and it doesn't need to do a lot of profitability
trade-offs.  However it isn't very effective there.  It's better to do it
just before register allocation.  IRA tries to do a little, too, also not
very effective :-(

If you want to just extend prepare_shrinkwrap, so that it handles more than
just the first BB, what order should it try?  Should it be just greedy, or
should it look how it can get best gain?

Shrink-wrapping could wrap about 3x as many BBs as it does currently, but
just extending prepare_shrinkwrap doesn't get very far.  Which is not an
argument to not do a better job there, of course ;-)
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
In reply to this post by asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

--- Comment #7 from Ilya Leoshkevich <iii at linux dot ibm.com> ---
Apparently, for this specific case doing more of hard register copy
propagation is enough.  I've just tried running pass_cprop_hardreg
before pass_thread_prologue_and_epilogue, and it helped.

So, would running a mini-cprop_hardreg instead of just
copyprop_hardreg_forward_bb_without_debug_insn (entry_block) be
reasonable here?  Something along the lines of:

- Do something like pre_and_rev_post_order_compute_fn (), but do not go
  further from bbs which contain insns satisfying
  requires_stack_frame_p (), since shrink-wrapping cannot happen past
  those anyway.

  Same for bbs which have more than 1 predecessor, since
  cprop_hardreg forgets everything it saw when it encounters those.  Not
  sure if a reasonable merge function can be defined for struct
  value_data to improve this?

  Maybe also stop completely when a certain number of bbs is found.

- Do something like pass_cprop_hardreg::execute (), but use only bbs
  computed during the previous step.  Btw, would reverse postorder be
  the "more intelligent queuing of blocks" mentioned in
  pass_cprop_hardreg::execute ()?



When you say that what IRA does is not effective, do you mean just the
need to track indirect hard register copies, or can it be improved even
further?
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/87902] [9 Regression] Shrink-wrapping multiple conditions

asolokha at gmx dot com
In reply to this post by asolokha at gmx dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87902

--- Comment #8 from Segher Boessenkool <segher at gcc dot gnu.org> ---
(In reply to Ilya Leoshkevich from comment #7)

> Apparently, for this specific case doing more of hard register copy
> propagation is enough.  I've just tried running pass_cprop_hardreg
> before pass_thread_prologue_and_epilogue, and it helped.
>
> So, would running a mini-cprop_hardreg instead of just
> copyprop_hardreg_forward_bb_without_debug_insn (entry_block) be
> reasonable here?  Something along the lines of:
>
> - Do something like pre_and_rev_post_order_compute_fn (), but do not go
>   further from bbs which contain insns satisfying
>   requires_stack_frame_p (), since shrink-wrapping cannot happen past
>   those anyway.

I don't think that is true.  Separate shrink-wrapping...

>   Same for bbs which have more than 1 predecessor, since
>   cprop_hardreg forgets everything it saw when it encounters those.  Not
>   sure if a reasonable merge function can be defined for struct
>   value_data to improve this?
>
>   Maybe also stop completely when a certain number of bbs is found.
>
> - Do something like pass_cprop_hardreg::execute (), but use only bbs
>   computed during the previous step.

I think running the normal cprop_hardreg here is fine.  Or is it so
expensive?

>   Btw, would reverse postorder be
>   the "more intelligent queuing of blocks" mentioned in
>   pass_cprop_hardreg::execute ()?

Maybe?  It's not totally clear what is wanted here.

> When you say that what IRA does is not effective, do you mean just the
> need to track indirect hard register copies, or can it be improved even
> further?

I mean that split_live_ranges_for_shrink_wrap does not help much.