[Bug tree-optimization/78200] New: [7 regression]: 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/78200] New: [7 regression]: 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

            Bug ID: 78200
           Summary: [7 regression]: 429.mcf of cpu2006 regresses in GCC
                    trunk for avx2 target.
           Product: gcc
           Version: tree-ssa
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: venkataramanan.kumar at amd dot com
  Target Milestone: ---

Noticed 5% regression with 429.mcf of cpu2006 on x86_64 AVX2 (bdver4) with GCC
trunk gcc version 7.0.0 20161028 (experimental) (GCC).

Flag used is -O3 -mavx2 -mprefer-avx128

Not seen with GCC 6.1 or with GCC trunk for -O3 -mavx -mprefer-avx128

Assembly difference is observed in hot function primal_bea_mpp of pbeampp.c.

-O3 -mavx -mprefer-avx128               -O3 -mavx2 -mprefer-avx128

.L98:                                 |  .L98:
  ------------------------------------|          jle     .L97 <==  order of
comparison
          cmpl    $2, %r9d            |          cmpl    $2, %r9d  is
different.
          jne     .L97                |          jne     .L97
          testq   %rdi, %rdi          |  -----------------------------------
          jle     .L97                |  -----------------------------------
  .L99:                               |  .L99:
          addq    $1, %r13            |          addq    $1, %r13
          movq    %rdi, %r12          |          movq    %rdi, %r12
          movq    perm(,%r13,8), %r9  |          movq    perm(,%r13,8), %r9
          sarq    $63, %r12           |          sarq    $63, %r12
          movq    %rdi, 8(%r9)        |          movq    %rdi, 8(%r9)
+ +-- 12 lines: xorq %r12, %rdi-------|+ +-- 12 lines: xorq %r12, %rdi------
          jle     .L97                |          jle     .L97
          movq    8(%rax), %r14       |          movq    8(%rax), %r14
          movq    (%rax), %rdi        |          movq    (%rax), %rdi
          subq    (%r14), %rdi        |          subq    (%r14), %rdi
          movq    16(%rax), %r14      |          movq    16(%rax), %r14
          addq    (%r14), %rdi        |          addq    (%r14), %rdi
          jns     .L98                |          cmpq    $0, %rdi
  ------------------------------------|          jge     .L98


Gimple optimzied dump shows

GCC trunk -O3 -mavx -mprefer-avx128
;;   basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;;    pred:       18 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  # RANGE [0, 1]
  _496 = _512 == 2;
  # RANGE [0, 1]
  _495 = red_cost_503 > 0;
  # RANGE [0, 1]
  _494 = _495 & _496;
  if (_494 != 0)
    goto <bb 21>;
  else
    goto <bb 22>;


GCC trunk -O3 -mavx2 -mprefer-avx128

;;   basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;;    pred:       18 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  # RANGE [0, 1]
  _496 = _512 == 2;
  # RANGE [0, 1]
  _495 = red_cost_503 > 0;  
  # RANGE [0, 1]
  _494 = _495 & _496; <== operation order is different on AVX2.
  if (_494 != 0)
    goto <bb 21>;
  else
    goto <bb 22>;

operation order is changed at pbeampp.c.171t.reassoc2.
;;   basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;;    pred:       18 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  _496 = _512 == 2;
  _495 = red_cost_503 > 0;
  _494 = _495 & _496;
  if (_494 != 0)
    goto <bb 21>;
  else
    goto <bb 22>;

Looking backwards further, found that in tree if conversion generates
non-canonical gimple.
pbeampp.c.155t.ifcvt

;;   basic block 27, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 26, next block 28, flags: (NEW, REACHABLE, VISITED)
;;    pred:       25 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  _496 = _512 == 2;
  _495 = red_cost_503 > 0;
  _494 = _496 & _495;    <== comparison order is same but LHS of "&" has a
greater number.
  if (_494 != 0)
    goto <bb 28>;
  else
    goto <bb 29>;


pbeampp.c.154t.ch_vect
;;   basic block 23, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 22, next block 24, flags: (NEW, REACHABLE, VISITED)
;;    pred:       21 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  _340 = _23 == 2;
  _341 = red_cost_86 > 0;
  _338 = _340 & _341;  <==  comparison order is same here.
  if (_338 != 0)
    goto <bb 24>;
  else
    goto <bb 25>;



compiling pbeampp.c with -O3 -mavx2 -mprefer-avx128
-fno-tree-loop-if-conversion
and rest of benchmark changes with  -O3 -mavx2 -mprefer-avx128 brings back the
score same as that of
-O3 -mavx  or GCC 6.1 -O3 -mavx2.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/78200] [7 regression]: 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #1 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
(In reply to Venkataramanan from comment #0)

> Noticed 5% regression with 429.mcf of cpu2006 on x86_64 AVX2 (bdver4) with
> GCC trunk gcc version 7.0.0 20161028 (experimental) (GCC).
>
> Flag used is -O3 -mavx2 -mprefer-avx128
>
> Not seen with GCC 6.1 or with GCC trunk for -O3 -mavx -mprefer-avx128
>
> Assembly difference is observed in hot function primal_bea_mpp of pbeampp.c.
>
> -O3 -mavx -mprefer-avx128 -O3 -mavx2 -mprefer-avx128
>
> .L98:                                 |  .L98:
>   ------------------------------------|          jle     .L97 <==  order of
> comparison
>           cmpl    $2, %r9d            |          cmpl    $2, %r9d  is
> different.
>           jne     .L97                |          jne     .L97
>           testq   %rdi, %rdi          |  -----------------------------------
>           jle     .L97                |  -----------------------------------
>   .L99:                               |  .L99:
>           addq    $1, %r13            |          addq    $1, %r13
>           movq    %rdi, %r12          |          movq    %rdi, %r12
>           movq    perm(,%r13,8), %r9  |          movq    perm(,%r13,8), %r9
>           sarq    $63, %r12           |          sarq    $63, %r12
>           movq    %rdi, 8(%r9)        |          movq    %rdi, 8(%r9)
> + +-- 12 lines: xorq %r12, %rdi-------|+ +-- 12 lines: xorq %r12, %rdi------
>           jle     .L97                |          jle     .L97
>           movq    8(%rax), %r14       |          movq    8(%rax), %r14
>           movq    (%rax), %rdi        |          movq    (%rax), %rdi
>           subq    (%r14), %rdi        |          subq    (%r14), %rdi
>           movq    16(%rax), %r14      |          movq    16(%rax), %r14
>           addq    (%r14), %rdi        |          addq    (%r14), %rdi
>           jns     .L98                |          cmpq    $0, %rdi
>   ------------------------------------|          jge     .L98
>
>
> Gimple optimzied dump shows
>
> GCC trunk -O3 -mavx -mprefer-avx128
> ;;   basic block 20, loop depth 2, count 0, freq 1067, maybe hot
> ;;   Invalid sum of incoming frequencies 1216, should be 1067
> ;;    prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       18 [64.0%]  (FALSE_VALUE,EXECUTABLE)
>   # RANGE [0, 1]
>   _496 = _512 == 2;
>   # RANGE [0, 1]
>   _495 = red_cost_503 > 0;
>   # RANGE [0, 1]
>   _494 = _495 & _496;
>   if (_494 != 0)
>     goto <bb 21>;
>   else
>     goto <bb 22>;
>
for GCC trunk -O3 -mavx -mprefer-avx128 optimized dumps look like this.
;;   basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;;   Invalid sum of incoming frequencies 1216, should be 1067
;;    prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;;    pred:       18 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  # RANGE [0, 1]
  _340 = _23 == 2;
  # RANGE [0, 1]
  _341 = red_cost_86 > 0;
  # RANGE [0, 1]
  _338 = _340 & _341;
  if (_338 != 0)
    goto <bb 21>;
  else
    goto <bb 22>;
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
             Status|UNCONFIRMED                 |WAITING
            Version|tree-ssa                    |7.0
           Keywords|                            |missed-optimization
   Last reconfirmed|                            |2016-11-04
          Component|tree-optimization           |rtl-optimization
                 CC|richard.guenther at gmail dot com  |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1
            Summary|[7 regression]: 429.mcf of  |[7 Regression] 429.mcf of
                   |cpu2006 regresses in GCC    |cpu2006 regresses in GCC
                   |trunk for avx2 target.      |trunk for avx2 target.
   Target Milestone|---                         |7.0

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
The issue is the missed cmp-and-branch fusion which you should a) enable in the
first place like via -mtune=bdver4 b) is appearantly made impossible/hard by
changes in original RTL expansion with regard to a complex condition.

Please attach a testcase.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
http://gcc.opensuse.org/SPEC/CINT/sb-czerny-head-64-2006/index.html

confirms the regression (even bigger with -Ofast -flto -march=native which
is a skylake here).  Likewise
http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/index.html (though
peak makes it hard to spot in the graph).
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #4 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
Created attachment 39976
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39976&action=edit
Test case for noncanonical gimple formation at tree if conversion.

The test case is simulated from primal_bea_mpp of 429.mcf.

gcc version 7.0.0 20161106

---- snip from test.c.155t.ifcvt----
;;   basic block 30, loop depth 1, count 0, freq 407, maybe hot
;;    prev block 29, next block 31, flags: (NEW, REACHABLE, VISITED)
;;    pred:       28 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  _96 = _67 == 2;
  _29 = if_conversion_var_66 > 0;
  _12 = _96 & _29; <== Non canonical gimple
  if (_12 != 0)
    goto <bb 31>;
  else
    goto <bb 32>;
-----snip ends-----
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
So it's

addq    (%r14), %rdi
jns     .L98

.L98:
cmpl    $2, %r9d
jne     .L97
testq   %rdi, %rdi
jle     .L97

vs.

addq    (%r14), %rdi
cmpq    $0, %rdi
jge     .L98

.L98:
jle     .L97
cmpl    $2, %r9d
jne     .L97

I would guess the order of the branches after .L98 in the first case is
unrelated (to be checked).
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the non-canonical GIMPLE is actually created by loop versioning and
update-ssa
replacing uses with new defs but not re-canonicalizing operand order.  Not
gimplification as I speculated.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #7 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
Bisecting shows non canonical gimple generation at r238370.

--snip--
commit f3dce1cdd016e16cf9dc051d127bdf6eb58430fc
Author: rguenth <rguenth@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Jul 15 10:53:29 2016 +0000

    2016-07-15  Richard Biener  <[hidden email]>

        * tree-ssa-pre.c (get_representative_for): Make sure to return
        the value number of SSA names.
        (phi_translate_1): get_representative_for cannot return NULL.
        (do_pre_regular_insertion): Remove redundant call to
        fully_constant_expression.
        (do_pre_partial_partial_insertion): Likewise.


    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@238370
138bc75d-0d04-0410-961f-82ee72b054a4
--snip--

r238370 test.c.148t.ifcvt
    ;;   basic block 30, loop depth 1, count 0, freq 407, maybe hot
    ;;    prev block 29, next block 31, flags: (NEW, REACHABLE)
    ;;    pred:       28 [64.0%]  (FALSE_VALUE,EXECUTABLE)
      _71 = _109 == 2;
      _15 = if_conversion_var_108 > 0;
      _70 = _71 & _15;
      if (_70 != 0)
        goto <bb 31>;
      else
        goto <bb 32>;


r238369 test.c.148t.ifcvt

  ;;   basic block 30, loop depth 1, count 0, freq 407, maybe hot
    ;;    prev block 29, next block 31, flags: (NEW, REACHABLE)
    ;;    pred:       28 [64.0%]  (FALSE_VALUE,EXECUTABLE)
      _26 = _73 == 2;
      _38 = if_conversion_var_72 > 0;
      _86 = _26 & _38;
      if (_86 != 0)
        goto <bb 31>;
      else
        goto <bb 32>;
    ;;    succ:       31 [34.0%]  (TRUE_VALUE,EXECUTABLE)
    ;;                32 [66.0%]  (FALSE_VALUE,EXECUTABLE)
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
if-combine is combining parts of

            if( (red_cost < 0 && arc->ident == 1)
                || (red_cost > 0 && arc->ident == 2) )
            {

to

  if (red_cost_86 < 0)
    goto <bb 17>;
  else
    goto <bb 18>;

  <bb 17>:
  if (_23 == 1)
    goto <bb 19>;
  else
    goto <bb 20>;

  <bb 18>:
  _340 = _23 == 2;
  _341 = red_cost_86 > 0;
  _338 = _340 & _341;
  if (_338 != 0)
    goto <bb 19>;
  else
    goto <bb 20>;

the guard could be written as

  red_cost != 0 && arc->ident == 1 + (red_cost > 0)

before if-combine we see

  red_cost_86 = _27 + _29;
  if (red_cost_86 < 0)
    goto <bb 17>;
  else
    goto <bb 18>;

  <bb 17>:
  if (_23 == 1)
    goto <bb 20>;
  else
    goto <bb 21>;

  <bb 18>:
  if (red_cost_86 > 0)
    goto <bb 19>;
  else
    goto <bb 21>;

  <bb 19>:
  if (_23 == 2)
    goto <bb 20>;
  else
    goto <bb 21>;
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
So RTL expansion ends up in

          /* If jumps are cheap and the target does not support conditional
             compare, turn some more codes into jumpy sequences.  */
          else if (BRANCH_COST (optimize_insn_for_speed_p (), false) < 4
                   && targetm.gen_ccmp_first == NULL)
            {
              if ((code2 == BIT_AND_EXPR
                   && TYPE_PRECISION (TREE_TYPE (op0)) == 1
                   && TREE_CODE (gimple_assign_rhs2 (second)) != INTEGER_CST)
                  || code2 == TRUTH_AND_EXPR)
                {
                  code = TRUTH_ANDIF_EXPR;
                  op0 = gimple_assign_rhs1 (second);
                  op1 = gimple_assign_rhs2 (second);

where we could adjust operand order based on the immediately dominating
condition.

Unfortunately sth as simple as

                  /* We'll expand RTL for op0 first, see if we'd better
                     expand RTL for op1 first.  */
                  if (TREE_CODE (op1) == SSA_NAME
                      && single_pred_p (bb))
                    {
                      gimple *def1 = SSA_NAME_DEF_STMT (op1);
                      if (is_gimple_assign (def1)
                          && TREE_CODE_CLASS (gimple_assign_rhs_code (def1)) ==
tcc_comparison)
                        {
                          basic_block pred = single_pred (bb);
                          gimple *last = last_stmt (pred);
                          if (last
                              && gimple_code (last) == GIMPLE_COND
                              && gimple_assign_rhs1 (def1) == gimple_cond_lhs
(last))
                            std::swap (op0, op1);
                        }
                    }

doesn't work as the predecessor is no longer in GIMPLE (we dropped the seq
for the GIMPLE stmts and GIMPLE_CONDs have no DEFs...).  Also I'm not sure
the half-way RTL CFG will still point to the original block.

Of course the above heuristic is really only applicable if there's not much
code expanded between this jump and the one in the predecessor.

OTOH if we have the BRANCH_COST check during RTL expansion (similar to
what we have for LOGICAL_OP_NON_SHORT_CIRCUIT in fold-const.c) then maybe
if-combining shouldn't combine the conditionals.  There's a slight disconnect
here, the above is BRANCH_COST < 4 while the other is BRANCH_COST >= 2 ...

The cfgexpand code could also be done as a pre-pass on the IL turning
the straight-line code back to CFG (I guess that's a good idea anyway).
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
OTOH we _do_ have initial RTL

(insn 167 166 168 20 (set (reg:CCGOC 17 flags)
        (compare:CCGOC (reg/v:DI 217 [ red_cost ])
            (const_int 0 [0]))) "pbeampp.c":42 -1
     (nil))
(jump_insn 168 167 169 20 (set (pc)
        (if_then_else (ge (reg:CCGOC 17 flags)
                (const_int 0 [0]))
            (label_ref 175)
            (pc))) "pbeampp.c":42 -1
     (int_list:REG_BR_PROB 6400 (nil))
 -> 175)
;;  succ:       21 [36.0%]  (FALLTHRU)
;;              23 [64.0%]

;; basic block 23, loop depth 2, count 0, freq 1067, maybe hot
;; Invalid sum of incoming frequencies 1216, should be 1067
;;  prev block 22, next block 24, flags: (NEW, REACHABLE, RTL, MODIFIED,
VISITED)
;;  pred:       20 [64.0%]
(code_label 175 173 176 23 98 "" [1 uses])
(note 176 175 177 23 [bb 23] NOTE_INSN_BASIC_BLOCK)
(insn 177 176 178 23 (set (reg:CCNO 17 flags)
        (compare:CCNO (reg/v:DI 217 [ red_cost ])
            (const_int 0 [0]))) "pbeampp.c":42 -1
     (nil))
(insn 178 177 179 23 (set (reg:QI 273)
        (gt:QI (reg:CCNO 17 flags)
            (const_int 0 [0]))) "pbeampp.c":42 -1
     (nil))
(insn 179 178 180 23 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 273)
            (const_int 0 [0]))) "pbeampp.c":42 -1
     (nil))
(jump_insn 180 179 587 23 (set (pc)
        (if_then_else (eq (reg:CCZ 17 flags)
                (const_int 0 [0]))
            (label_ref 196)
            (pc))) "pbeampp.c":42 -1
     (int_list:REG_BR_PROB 3300 (nil))
 -> 196)

that is, it compares in a sensible order allowing for combining (which
appearantly is what causes the code to run slower for not yet explored
reasons).

Expanding the other way around does not have any justification IMHO
and thus the "fix" would be to the later stage where we combine
the compare with the one on the backedge.

The issue is CSE2 which does

(insn 167 166 168 21 (set (reg:CC 17 flags)
        (compare:CC (reg/v:DI 217 [ red_cost ])
            (const_int 0 [0]))) "pbeampp.c":42 8 {*cmpdi_1}
     (nil))
(jump_insn 168 167 169 21 (set (pc)
        (if_then_else (ge (reg:CC 17 flags)
                (const_int 0 [0]))
            (label_ref 175)
            (pc))) "pbeampp.c":42 635 {*jcc_1}
     (expr_list:REG_DEAD (reg:CC 17 flags)
        (int_list:REG_BR_PROB 6400 (nil)))
 -> 175)
...
(insn 178 176 179 24 (set (reg:QI 273)
        (gt:QI (reg:CC 17 flags)
            (const_int 0 [0]))) "pbeampp.c":42 631 {*setcc_qi}
     (expr_list:REG_DEAD (reg:CC 17 flags)
        (nil)))

thus changes the earlier compare to CC and re-uses that CCmode.  Note it's
still a mystery to me why this is slower (and I did not reproduce that myself
yet).

Then we combine it to

(insn 167 166 168 18 (set (reg:CC 17 flags)
        (compare:CC (reg/v:DI 217 [ red_cost ])
            (const_int 0 [0]))) "pbeampp.c":42 8 {*cmpdi_1}
     (nil))
(jump_insn 168 167 169 18 (set (pc)
        (if_then_else (ge (reg:CC 17 flags)
                (const_int 0 [0]))
            (label_ref 175)
            (pc))) "pbeampp.c":42 635 {*jcc_1}
     (int_list:REG_BR_PROB 6400 (nil))
 -> 175)
;;  succ:       19 [36.0%]  (FALLTHRU)
;;              20 [64.0%]


;; basic block 20, loop depth 0, count 0, freq 1067, maybe hot
;; Invalid sum of incoming frequencies 1216, should be 1067
(jump_insn 180 179 587 20 (set (pc)
        (if_then_else (le (reg:CC 17 flags)
                (const_int 0 [0]))
            (label_ref:DI 196)
            (pc))) "pbeampp.c":42 635 {*jcc_1}
     (int_list:REG_BR_PROB 3300 (expr_list:REG_DEAD (reg:CCZ 17 flags)
            (nil)))
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #11 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
Hi Richard


On haswell machine original run time for -O3 -max2 -mprefer-avx2

real    2m35.325s
user    2m35.257s
sys     0m0.070s

Changing the assembly from  

.L98:
        jle     .L97
        cmpl    $2, %r9d
        jne     .L97
.L99:

To
.L98:
       cmpl    $2, %r9d
        jne     .L97
        cmpq    $0, %rdi
        jle     .L97          
.L99:

real    2m27.224s
user    2m27.138s
sys     0m0.087s

improves run time.


> -----Original Message-----
> From: rguenth at gcc dot gnu.org [mailto:[hidden email]]
> Sent: Wednesday, November 9, 2016 6:02 PM
> To: Kumar, Venkataramanan <[hidden email]>
> Subject: [Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006
> regresses in GCC trunk for avx2 target.
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200
>
> --- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- OTOH we
> _do_ have initial RTL
>
> (insn 167 166 168 20 (set (reg:CCGOC 17 flags)
>         (compare:CCGOC (reg/v:DI 217 [ red_cost ])
>             (const_int 0 [0]))) "pbeampp.c":42 -1
>      (nil))
> (jump_insn 168 167 169 20 (set (pc)
>         (if_then_else (ge (reg:CCGOC 17 flags)
>                 (const_int 0 [0]))
>             (label_ref 175)
>             (pc))) "pbeampp.c":42 -1
>      (int_list:REG_BR_PROB 6400 (nil))
>  -> 175)
> ;;  succ:       21 [36.0%]  (FALLTHRU)
> ;;              23 [64.0%]
>
> ;; basic block 23, loop depth 2, count 0, freq 1067, maybe hot ;; Invalid sum of
> incoming frequencies 1216, should be 1067 ;;  prev block 22, next block 24,
> flags: (NEW, REACHABLE, RTL, MODIFIED,
> VISITED)
> ;;  pred:       20 [64.0%]
> (code_label 175 173 176 23 98 "" [1 uses]) (note 176 175 177 23 [bb 23]
> NOTE_INSN_BASIC_BLOCK) (insn 177 176 178 23 (set (reg:CCNO 17 flags)
>         (compare:CCNO (reg/v:DI 217 [ red_cost ])
>             (const_int 0 [0]))) "pbeampp.c":42 -1
>      (nil))
> (insn 178 177 179 23 (set (reg:QI 273)
>         (gt:QI (reg:CCNO 17 flags)
>             (const_int 0 [0]))) "pbeampp.c":42 -1
>      (nil))
> (insn 179 178 180 23 (set (reg:CCZ 17 flags)
>         (compare:CCZ (reg:QI 273)
>             (const_int 0 [0]))) "pbeampp.c":42 -1
>      (nil))
> (jump_insn 180 179 587 23 (set (pc)
>         (if_then_else (eq (reg:CCZ 17 flags)
>                 (const_int 0 [0]))
>             (label_ref 196)
>             (pc))) "pbeampp.c":42 -1
>      (int_list:REG_BR_PROB 3300 (nil))
>  -> 196)
>
> that is, it compares in a sensible order allowing for combining (which
> appearantly is what causes the code to run slower for not yet explored reasons).
>
> Expanding the other way around does not have any justification IMHO and thus
> the "fix" would be to the later stage where we combine the compare with the
> one on the backedge.
>
> The issue is CSE2 which does
>
> (insn 167 166 168 21 (set (reg:CC 17 flags)
>         (compare:CC (reg/v:DI 217 [ red_cost ])
>             (const_int 0 [0]))) "pbeampp.c":42 8 {*cmpdi_1}
>      (nil))
> (jump_insn 168 167 169 21 (set (pc)
>         (if_then_else (ge (reg:CC 17 flags)
>                 (const_int 0 [0]))
>             (label_ref 175)
>             (pc))) "pbeampp.c":42 635 {*jcc_1}
>      (expr_list:REG_DEAD (reg:CC 17 flags)
>         (int_list:REG_BR_PROB 6400 (nil)))  -> 175) ...
> (insn 178 176 179 24 (set (reg:QI 273)
>         (gt:QI (reg:CC 17 flags)
>             (const_int 0 [0]))) "pbeampp.c":42 631 {*setcc_qi}
>      (expr_list:REG_DEAD (reg:CC 17 flags)
>         (nil)))
>
> thus changes the earlier compare to CC and re-uses that CCmode.  Note it's still
> a mystery to me why this is slower (and I did not reproduce that myself yet).
>
> Then we combine it to
>
> (insn 167 166 168 18 (set (reg:CC 17 flags)
>         (compare:CC (reg/v:DI 217 [ red_cost ])
>             (const_int 0 [0]))) "pbeampp.c":42 8 {*cmpdi_1}
>      (nil))
> (jump_insn 168 167 169 18 (set (pc)
>         (if_then_else (ge (reg:CC 17 flags)
>                 (const_int 0 [0]))
>             (label_ref 175)
>             (pc))) "pbeampp.c":42 635 {*jcc_1}
>      (int_list:REG_BR_PROB 6400 (nil))
>  -> 175)
> ;;  succ:       19 [36.0%]  (FALLTHRU)
> ;;              20 [64.0%]
>
>
> ;; basic block 20, loop depth 0, count 0, freq 1067, maybe hot ;; Invalid sum of
> incoming frequencies 1216, should be 1067 (jump_insn 180 179 587 20 (set (pc)
>         (if_then_else (le (reg:CC 17 flags)
>                 (const_int 0 [0]))
>             (label_ref:DI 196)
>             (pc))) "pbeampp.c":42 635 {*jcc_1}
>      (int_list:REG_BR_PROB 3300 (expr_list:REG_DEAD (reg:CCZ 17 flags)
>             (nil)))
>
> --
> You are receiving this mail because:
> You reported the bug.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, I see with -mavx2

        addq    (%r9), %rax
        jns     .L90

.L90:
        je      .L92
        cmpl    $2, 24(%rdx)
        je      .L91

thus there is no extra cmpq $0, %rdi in the predecessor.

Note when I profile avx (base) vs. avx2 (peak) I see

 18.17%        451662  mcf_base.amd64-  mcf_base.amd64-m64-gcc42-nn  [.]
refresh_potential
 18.12%        424592  mcf_base.amd64-  mcf_base.amd64-m64-gcc42-nn  [.]
primal_bea_mpp
 17.96%        465325  mcf_peak.amd64-  mcf_peak.amd64-m64-gcc42-nn  [.]
primal_bea_mpp
 14.93%        373309  mcf_peak.amd64-  mcf_peak.amd64-m64-gcc42-nn  [.]
refresh_potential

plus a 3-run of avx (base) vs. avx2 (peak) gives me

429.mcf          9120        252       36.1 *    9120        264       34.6 S
429.mcf          9120        257       35.5 S    9120        253       36.0 S
429.mcf          9120        232       39.3 S    9120        258       35.4 *

which isn't really conclusive.

If you are trying to narrow down a regression GCC 6 vs. GCC 7 I wouldn't
look at flags but at profiling and what changed.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
When I compare GCC 6 (r241818) against trunk (r241997) with -Ofast
-march=native (on Haswell) I get

429.mcf          9120        230       39.7 S    9120        240       38.0 *
429.mcf          9120        227       40.1 S    9120        244       37.4 S
429.mcf          9120        230       39.7 *    9120        237       38.6 S

thats ~5% regression.  Profiling that shows

 20.89%        398295  mcf_peak.amd64-  mcf_peak.amd64-m64-gcc42-nn  [.]
primal_bea_mpp                            
 18.34%        349619  mcf_base.amd64-  mcf_base.amd64-m64-gcc42-nn  [.]
primal_bea_mpp                            
 15.45%        294258  mcf_peak.amd64-  mcf_peak.amd64-m64-gcc42-nn  [.]
refresh_potential                          
 13.53%        257796  mcf_base.amd64-  mcf_base.amd64-m64-gcc42-nn  [.]
refresh_potential
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #14 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
Between GCC 6.2.0 and GCC 7 (Nov/10/2016) I see three major differences in
gimple opts dump.

1. IPA inline is more aggressive in GCC 7. Looks like it is in-lining more in
hot function "primal_bea_mpp". However completely disabling ipa inline still
produces regression in GCC 7.

2. Non canonicial gimple formation at tree if conversion. This is already
pointed in comment 1.  Using -fno-tree-loop-if-convert in GCC 7 brings back the
runtime same as GCC 6.

3. In GCC 7 the true block is kept after the "red_cost > 0" check.

            red_cost = arc->cost - arc->tail->potential + arc->head->potential;
            if( bea_is_dual_infeasible( arc, red_cost ) )
            {
         True block ==>       next++;
                perm[next]->a = arc;
                perm[next]->cost = red_cost;
                perm[next]->abs_cost = ABS(red_cost);

Note bea_is_dual_infeasible is expanded by this check
red_cost < 0 && arc->ident == AT_LOWER)
|| (red_cost > 0 && arc->ident == AT_UPPER)


Tree VRP in GCC6 moves the true block to the last of the gimple blocks.

Now When I compile GCC 6 with -fno-tree-vrp the true block is at same position
as in GCC 7 (trunk). I get the regression in GCC 6.

So block movement in GCC 6 and not done in GCC 7 is also interesting
observation.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #15 from rguenther at suse dot de <rguenther at suse dot de> ---
On Fri, 11 Nov 2016, venkataramanan.kumar at amd dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200
>
> --- Comment #14 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
> Between GCC 6.2.0 and GCC 7 (Nov/10/2016) I see three major differences in
> gimple opts dump.
>
> 1. IPA inline is more aggressive in GCC 7. Looks like it is in-lining more in
> hot function "primal_bea_mpp". However completely disabling ipa inline still
> produces regression in GCC 7.
>
> 2. Non canonicial gimple formation at tree if conversion. This is already
> pointed in comment 1.  Using -fno-tree-loop-if-convert in GCC 7 brings back the
> runtime same as GCC 6.
>
> 3. In GCC 7 the true block is kept after the "red_cost > 0" check.
>
>             red_cost = arc->cost - arc->tail->potential + arc->head->potential;
>             if( bea_is_dual_infeasible( arc, red_cost ) )
>             {
>          True block ==>       next++;
>                 perm[next]->a = arc;
>                 perm[next]->cost = red_cost;
>                 perm[next]->abs_cost = ABS(red_cost);
>
> Note bea_is_dual_infeasible is expanded by this check
> red_cost < 0 && arc->ident == AT_LOWER)
> || (red_cost > 0 && arc->ident == AT_UPPER)
>
>
> Tree VRP in GCC6 moves the true block to the last of the gimple blocks.
>
> Now When I compile GCC 6 with -fno-tree-vrp the true block is at same position
> as in GCC 7 (trunk). I get the regression in GCC 6.
>
> So block movement in GCC 6 and not done in GCC 7 is also interesting
> observation.

block movement on gimple doesn't really mean anything but what is
interesting is the (guessed) profile data on the blocks/edges.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #16 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
GCC7 added early treading pass and gimple thread pass before VRP. When I
disable these passes, tree-vrp is able to move the true block same as that of
GCC6.

It again the tree-if-convert causing the moved block to come back to its
original position. Also non canonical gimple gets formed.

 <bb 27>:
  _496 = _512 == 2;
  _495 = red_cost_503 > 0;
  _494 = _496 & _495;
  if (_494 != 0)
    goto <bb 28>;
  else
    goto <bb 29>;

  <bb 28>:                             <== True block.
  _502 = basket_size_lsm.75_514 + 1;
  _501 = perm[_502];
  _501->a = arc_516;
  _501->cost = red_cost_503;
  _498 = ABS_EXPR <red_cost_503>;
  _501->abs_cost = _498;

If we turn off tree-if-conversion, the basic block reordering at RTL brings
back the block to its position. But we dont see regression since the  check is
in order.

  <bb 20>:
  _344 = _23 == 2;
  _345 = red_cost_86 > 0;
  _342 = _344 & _345;
  if (_342 != 0)
    goto <bb 86>;
  else
    goto <bb 21>;

This makes me believe that swapping the operands causes the regression.
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #17 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
Looking at the check
               red_cost < 0 && arc->ident == AT_LOWER)
            || (red_cost > 0 && arc->ident == AT_UPPER

The order if-combine created seem to be the best.

if (red_cost_86 < 0)
    goto <bb 17>;
  else
    goto <bb 18>;

  <bb 17>:
  if (_23 == 1)
    goto <bb 19>;
  else
    goto <bb 20>;

  <bb 18>:
  _340 = _23 == 2;
  _341 = red_cost_86 > 0;
  _338 = _340 & _341;
  if (_338 != 0)
    goto <bb 19>;
  else
    goto <bb 20>;

  <bb 19>:
  basket_size.5_30 = basket_size;
  _31 = basket_size.5_30 + 1;
  basket_size = _31;
  _32 = perm[_31];
  _32->a = arc_47;
  _32->cost = red_cost_86;
  _33 = ABS_EXPR <red_cost_86>;
  _32->abs_cost = _33;

If red_cost < 0  is false then checking for arc->ident == AT_UPPER first. This
is better, since we know red_cost >0 will always be true.


Non canonical gimple generated at if conversion
<bb 27>:
  _496 = _512 == 2;
  _495 = red_cost_503 > 0;
  _494 = _496 & _495;
  if (_494 != 0)
    goto <bb 28>;
  else
    goto <bb 29>;

should be retain the correct order when we swap back to make it canonical ?
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #18 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 15 Nov 2016, venkataramanan.kumar at amd dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200
>
> --- Comment #17 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
> Looking at the check
>                red_cost < 0 && arc->ident == AT_LOWER)
>             || (red_cost > 0 && arc->ident == AT_UPPER
>
> The order if-combine created seem to be the best.
>
> if (red_cost_86 < 0)
>     goto <bb 17>;
>   else
>     goto <bb 18>;
>
>   <bb 17>:
>   if (_23 == 1)
>     goto <bb 19>;
>   else
>     goto <bb 20>;
>
>   <bb 18>:
>   _340 = _23 == 2;
>   _341 = red_cost_86 > 0;
>   _338 = _340 & _341;
>   if (_338 != 0)
>     goto <bb 19>;
>   else
>     goto <bb 20>;
>
>   <bb 19>:
>   basket_size.5_30 = basket_size;
>   _31 = basket_size.5_30 + 1;
>   basket_size = _31;
>   _32 = perm[_31];
>   _32->a = arc_47;
>   _32->cost = red_cost_86;
>   _33 = ABS_EXPR <red_cost_86>;
>   _32->abs_cost = _33;
>
> If red_cost < 0  is false then checking for arc->ident == AT_UPPER first. This
> is better, since we know red_cost >0 will always be true.

red_cost can be zero.  The "bad" order is best (just slower for some
still unknown reason).
Reply | Threaded
Open this post in threaded view
|

[Bug rtl-optimization/78200] [7 Regression] 429.mcf of cpu2006 regresses in GCC trunk for avx2 target.

thiago at kde dot org
In reply to this post by thiago at kde dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200

--- Comment #19 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
(In reply to [hidden email] from comment #18)

> On Tue, 15 Nov 2016, venkataramanan.kumar at amd dot com wrote:
>
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200
> >
> > --- Comment #17 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
> > Looking at the check
> >                red_cost < 0 && arc->ident == AT_LOWER)
> >             || (red_cost > 0 && arc->ident == AT_UPPER
> >
> > The order if-combine created seem to be the best.
> >
> > if (red_cost_86 < 0)
> >     goto <bb 17>;
> >   else
> >     goto <bb 18>;
> >
> >   <bb 17>:
> >   if (_23 == 1)
> >     goto <bb 19>;
> >   else
> >     goto <bb 20>;
> >
> >   <bb 18>:
> >   _340 = _23 == 2;
> >   _341 = red_cost_86 > 0;
> >   _338 = _340 & _341;
> >   if (_338 != 0)
> >     goto <bb 19>;
> >   else
> >     goto <bb 20>;
> >
> >   <bb 19>:
> >   basket_size.5_30 = basket_size;
> >   _31 = basket_size.5_30 + 1;
> >   basket_size = _31;
> >   _32 = perm[_31];
> >   _32->a = arc_47;
> >   _32->cost = red_cost_86;
> >   _33 = ABS_EXPR <red_cost_86>;
> >   _32->abs_cost = _33;
> >
> > If red_cost < 0  is false then checking for arc->ident == AT_UPPER first. This
> > is better, since we know red_cost >0 will always be true.
>
> red_cost can be zero.  The "bad" order is best (just slower for some
> still unknown reason).

Could be the case

red_cost is always > zero but arc->ident is not AT_UPPER,

checking like this (arc->ident == AT_UPPER && red_cost > 0)  is faster.

The static predication also says so red_cost > 0 is 88% true.

  if (red_cost_86 < 0)
    goto <bb 17>;
  else
    goto <bb 18>;
;;    succ:       17 [36.0%]  (TRUE_VALUE,EXECUTABLE)
;;                18 [64.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 17, loop depth 2, count 0, freq 684, maybe hot
;;    prev block 16, next block 18, flags: (NEW, REACHABLE, VISITED)
;;    pred:       16 [36.0%]  (TRUE_VALUE,EXECUTABLE)
  if (_23 == 1)
    goto <bb 20>;
  else
    goto <bb 21>;
;;    succ:       20 [34.0%]  (TRUE_VALUE,EXECUTABLE)
;;                21 [66.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 18, loop depth 2, count 0, freq 1217, maybe hot
;;    prev block 17, next block 19, flags: (NEW, REACHABLE, VISITED)
;;    pred:       16 [64.0%]  (FALSE_VALUE,EXECUTABLE)
  if (red_cost_86 > 0)
    goto <bb 19>;
  else
    goto <bb 21>;
;;    succ:       19 [87.7%]  (TRUE_VALUE,EXECUTABLE) <== ~88%
;;                21 [12.3%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 19, loop depth 2, count 0, freq 1067, maybe hot
;;    prev block 18, next block 20, flags: (NEW, REACHABLE, VISITED)
;;    pred:       18 [87.7%]  (TRUE_VALUE,EXECUTABLE)
  if (_23 == 2)
    goto <bb 20>;
  else
    goto <bb 21>;
;;    succ:       20 [34.0%]  (TRUE_VALUE,EXECUTABLE)
;;                21 [66.0%]  (FALSE_VALUE,EXECUTABLE)
12