[PATCH v4, rs6000] Replace X-form addressing with D-form addressing in new pass for Power9

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[PATCH v4, rs6000] Replace X-form addressing with D-form addressing in new pass for Power9

Kelvin Nilsen-2

This patch adds a new optimization pass for rs6000 targets.

This new pass scans existing rtl expressions and replaces X-form loads and stores with rtl expressions that favor selection of the D-form instructions in contexts for which the D-form instructions are preferred.  The new pass runs after the RTL loop optimizations since loop unrolling often introduces opportunities for beneficial replacements of X-form addressing instructions.

For each of the new tests, multiple X-form instructions are replaced with D-form instructions, some addi instructions are replaced with add instructions, and some addi instructions are eliminated.  The typical improvement for the included tests is a decrease of 4.28% to 12.12% in the number of instructions executed on each iteration of the loop.  The optimization has not shown measurable improvement on specmark tests, presumably because the typical loops that are benefited by this optimization are memory bounded and this optimization does not eliminate memory loads or stores.  However, it is anticipated that multi-threaded workloads and measurements of total power and cooling costs for heavy server workloads would benefit.

This version 4 patch responds to feedback and numerous suggestions by Segher:

  1. Further improvements to comments and discussion of computational complexity.

  2. Changed the name of insn_sequence_no to luid.

  3. Fixed some typos in comments.

  4. Added macro-defined constants to enforce upper bounds on the sizes (and number of required iterations) for certain data structures.  The intent is to bound compile time for programs that represent large numbers of opportunities for D-form replacements.  This optimization pass ignores  parts of a source program that exceed these macro-defined size limits.

In a separate mail, I have sent discussion regarding the behavior of preceding passes and how this behavior relates to this new pass.

I have built and regression tested this patch on powerpc64le-unknown-linux target with no regressions.

Is this ok for trunk?

gcc/ChangeLog:

2019-10-25  Kelvin Nilsen  <[hidden email]>

        * config/rs6000/rs6000-p9dform.c: New file.
        * config/rs6000/rs6000-passes.def: Add pass_insert_dform.
        * config/rs6000/rs6000-protos.h
        (rs6000_target_supports_dform_offset_p): New function prototype.
        (make_pass_insert_dform): Likewise.
        * config/rs6000/rs6000.c (rs6000_target_supports_dform_offset_p):
        New function.
        * config/rs6000/t-rs6000 (rs6000-p9dform.o): New build target.
        * config.gcc: Add rs6000-p9dform.o object file.

gcc/testsuite/ChangeLog:

2019-10-25  Kelvin Nilsen  <[hidden email]>

        * gcc.target/powerpc/p9-dform-0.c: New test.
        * gcc.target/powerpc/p9-dform-1.c: New test.
        * gcc.target/powerpc/p9-dform-10.c: New test.
        * gcc.target/powerpc/p9-dform-11.c: New test.
        * gcc.target/powerpc/p9-dform-12.c: New test.
        * gcc.target/powerpc/p9-dform-13.c: New test.
        * gcc.target/powerpc/p9-dform-14.c: New test.
        * gcc.target/powerpc/p9-dform-15.c: New test.
        * gcc.target/powerpc/p9-dform-2.c: New test.
        * gcc.target/powerpc/p9-dform-3.c: New test.
        * gcc.target/powerpc/p9-dform-4.c: New test.
        * gcc.target/powerpc/p9-dform-5.c: New test.
        * gcc.target/powerpc/p9-dform-6.c: New test.
        * gcc.target/powerpc/p9-dform-7.c: New test.
        * gcc.target/powerpc/p9-dform-8.c: New test.
        * gcc.target/powerpc/p9-dform-9.c: New test.
        * gcc.target/powerpc/p9-dform-generic.h: New test.

Index: gcc/config/rs6000/rs6000-p9dform.c
===================================================================
--- gcc/config/rs6000/rs6000-p9dform.c (nonexistent)
+++ gcc/config/rs6000/rs6000-p9dform.c (working copy)
@@ -0,0 +1,1763 @@
+/* Subroutines used to transform array subscripting expressions into
+   forms that are more amenable to d-form instruction selection for p9
+   little-endian VSX code.
+   Copyright (C) 1991-2019 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "tree.h"
+#include "memmodel.h"
+#include "df.h"
+#include "tm_p.h"
+#include "ira.h"
+#include "print-tree.h"
+#include "varasm.h"
+#include "explow.h"
+#include "expr.h"
+#include "output.h"
+#include "tree-pass.h"
+#include "rtx-vector-builder.h"
+#include "cfgloop.h"
+
+#include "insn-config.h"
+#include "recog.h"
+
+#include "print-rtl.h"
+#include "tree-pretty-print.h"
+
+#include "genrtl.h"
+
+/* This pass transforms array indexing expressions from a form that
+   favors selection of X-form instructions into a form that favors
+   selection of D-form instructions.
+
+   Showing favor for D-form instructions is especially important when
+   targeting Power9, as the Power9 architecture added a number of new
+   D-form instruction capabilities.
+
+   Consider, for example, the following loop, excerpted from an actual
+   program:
+
+    double sacc, x[], y[], z[];
+    sacc = 0.00;
+    for (unsigned long long int i = 0; i < N; i++) {
+      z[i] = x[i] * y[i];
+      sacc += z[i];
+    }
+
+   Compile this program with the following gcc options which enable both
+   vectorization and loop unrolling:
+    -m64 -fdump-rtl-all-details -mcpu=power9 -mtune=power9 -funroll-loops -O3
+
+   Without this pass, this loop is represented by the following:
+
+   lxvx:       16
+ addi: 8
+ xvmuldp: 8
+ stxvx: 8
+ fmr: 8
+ xxpermdi: 8
+ fadd:       16
+ bdnz: 1
+      ___
+      total:   73 instructions
+
+.L3:
+ lxvx 0,29,11
+ lxvx 12,30,11
+ addi 12,11,16
+ addi 0,11,48
+ addi 5,11,64
+ addi 9,11,32
+ addi 6,11,80
+ addi 7,11,96
+ addi 8,11,112
+ lxvx 2,29,12
+ lxvx 3,30,12
+ lxvx 4,29,0
+ lxvx 5,30,0
+ lxvx 10,30,9
+ lxvx 11,29,5
+ xvmuldp 6,0,12
+ lxvx 13,30,5
+ lxvx 8,29,9
+ lxvx 27,29,6
+ lxvx 28,30,6
+ xvmuldp 7,2,3
+ lxvx 29,29,7
+ lxvx 30,30,7
+ xvmuldp 9,4,5
+ lxvx 3,30,8
+ lxvx 0,29,8
+ xvmuldp 8,8,10
+ xvmuldp 10,11,13
+ xvmuldp 11,27,28
+ xxpermdi 26,6,6,3
+ fmr 2,6
+ stxvx 6,31,11
+ xvmuldp 12,29,30
+ addi 11,11,128
+ fadd 1,26,1
+ xxpermdi 26,7,7,3
+ stxvx 7,31,12
+ fmr 27,7
+ xvmuldp 0,0,3
+ xxpermdi 30,9,9,3
+ fmr 31,9
+ stxvx 8,31,9
+ xxpermdi 13,10,10,3
+ xxpermdi 28,8,8,3
+ stxvx 9,31,0
+ stxvx 10,31,5
+ fadd 1,2,1
+ fmr 2,10
+ xxpermdi 3,11,11,3
+ stxvx 11,31,6
+ fmr 4,11
+ fmr 29,8
+ xxpermdi 5,12,12,3
+ fmr 6,12
+ stxvx 12,31,7
+ xxpermdi 9,0,0,3
+ fmr 8,0
+ fadd 7,26,1
+ stxvx 0,31,8
+ fadd 10,27,7
+ fadd 11,28,10
+ fadd 12,29,11
+ fadd 26,30,12
+ fadd 27,31,26
+ fadd 30,13,27
+ fadd 31,2,30
+ fadd 0,3,31
+ fadd 28,4,0
+ fadd 29,5,28
+ fadd 13,6,29
+ fadd 1,9,13
+ fadd 1,8,1
+ bdnz .L3
+
+   With this pass, the same loop is represented by:
+
+   lxvx:        4
+ lxv:       12
+ addi: 2
+ add: 3
+ xvmuldp: 8
+ stxvx: 2
+ stxv: 6
+ fmr: 8
+ xxpermdi: 8
+ fadd:       16
+ bdnz: 1
+      ___
+      total:   70 instructions
+
+.L3:
+ lxvx 0,29,6
+ lxvx 12,30,6
+ addi 10,6,16
+ add 7,29,10
+ add 8,30,10
+ lxvx 2,29,10
+ lxvx 3,30,10
+ add 11,31,10
+ lxv 4,16(7)
+ lxv 9,16(8)
+ xvmuldp 6,0,12
+ lxv 13,64(8)
+ lxv 5,32(7)
+ lxv 28,32(8)
+ lxv 11,64(7)
+ xvmuldp 7,2,3
+ lxv 30,80(7)
+ lxv 12,80(8)
+ lxv 31,48(8)
+ lxv 10,48(7)
+ xvmuldp 8,4,9
+ lxv 3,96(8)
+ lxv 0,96(7)
+ xvmuldp 9,5,28
+ xvmuldp 11,11,13
+ xxpermdi 29,6,6,3
+ fmr 4,6
+ stxvx 6,31,6
+ xvmuldp 12,30,12
+ addi 6,6,128
+ fadd 1,29,1
+ xxpermdi 28,7,7,3
+ fmr 29,7
+ stxvx 7,31,10
+ xvmuldp 10,10,31
+ xxpermdi 30,8,8,3
+ fmr 31,8
+ stxv 8,16(11)
+ xvmuldp 0,0,3
+ xxpermdi 5,11,11,3
+ fmr 6,11
+ xxpermdi 13,9,9,3
+ fadd 1,4,1
+ stxv 11,64(11)
+ xxpermdi 7,12,12,3
+ fmr 8,12
+ stxv 12,80(11)
+ fmr 2,9
+ xxpermdi 3,10,10,3
+ fmr 4,10
+ stxv 9,32(11)
+ stxv 10,48(11)
+ fadd 28,28,1
+ xxpermdi 9,0,0,3
+ fmr 10,0
+ stxv 0,96(11)
+ fadd 11,29,28
+ fadd 29,30,11
+ fadd 12,31,29
+ fadd 30,13,12
+ fadd 31,2,30
+ fadd 13,3,31
+ fadd 2,4,13
+ fadd 1,5,2
+ fadd 0,6,1
+ fadd 3,7,0
+ fadd 4,8,3
+ fadd 5,9,4
+ fadd 1,10,5
+ bdnz .L3
+
+   The optimized loop body replaces 12 lxvx instructions with lxv
+   instructions, 6 stxvx instructions with stxv, and has 3 fewer add
+   operations.
+
+   This pass runs immediately after pass_loop2.  Loops have already
+   been unrolled.  The pass searches for sequences of code of the following
+   form.  These code sequences often appear within the expanded loop bodies
+   that result from unrolling.  The memory access patterns below match
+   both load and store instructions.  The set of memory operations
+   that derive from the same originating expression are grouped together
+   by this algorithm into a collection identified within the code as an
+   equivalence class.
+
+   A0: *(array_base + offset)
+   ;; The above is known as an originating access to memory.
+
+   Aij: offset += constant (i, j)
+   ;; Between consecutive accesses to memory, there may appear zero or
+   ;; more constant adjustments to the memory offset subexpression.
+
+   Ai: *(array_base + offset)
+   ;; The memory address for each subsequent access to memory differs
+   ;; from the originating memory access by a constant offset, which
+   ;; is computed by adding together all of the preceding constant
+   ;; (i,j) values.
+
+   ;; In any given equivalence class, there may be multiple subsequent
+   ;; memory accesses, identifed as A2, A3, ... AN, and there may be
+   ;; multiple constant adjustments to the offset expression between
+   ;; each pair Ai-1 and Ai where the N intervening constant
+   ;; adjustments are identified as Aij for j ranging from 0 to N-1.
+
+   ;; It is required that each element of the matched pattern dominate
+   ;; the element that follows.  In other words, the flow through the
+   ;; various matched elements must be unconditional.  Otherwise, the
+   ;; matched elements cannot be considered to reside within the same
+   ;; equivalence class for purposes of this optimization.
+
+   This pass replaces the above-matched sequences with:
+
+   Ai: derived_pointer = array_base + offset
+       *(derived_pointer)
+
+   Aij: leave these alone.  expect that subsequent optimization deletes
+        this code as it may become dead (since we don't use the
+        indexing expression following our code transformations.)
+
+   Ai:
+   *(derived_pointer + constant_i)
+     (where constant_i equals sum of constant (n,j) for all n from 1
+      to i paired with all j from 1 to Kn,
+
+   For example, if the original code is of the form
+     int offset = some_var;
+     x0 = base[offset];
+     offset -= 8;
+     x1 = base[offset];
+     other_offset = offset + 16;
+     x2 = base[other_offset]
+
+   This is transformed into
+
+     iv_ptr = base+some_var;
+     x0 = iv_ptr[0];
+     x1 = iv_ptr[-8];
+     x2 = iv_ptr[8];
+
+   This pass makes no effort to assure that the calculated iv_ptr
+   value points into the middle of the range of addresses to be
+   spanned by the pointer.  We simply choose the value of the
+   addressing expression that dominates all the other addressing
+   instructions within the same equivalence class.  In theory, a more
+   judicious choice of the calculated iv_ptr value can result in fewer
+   instructions not matching the d-form addressing mode because the
+   constant offset exceeds the reach of the instruction's encoding
+   limits.  However, this is very rare and the extra effort required
+   to optimize the choice of the computed iv_ptr value is deemed not
+   worth the effort at the current time.
+
+   Note that there may be multiple equivalence classes, each
+   associated with the same or possibly a different array_base value
+   within each function that is processed by this optimization pass.  */
+
+/* This is based on the union-find logic in web.c.  web_entry_base is
+   defined in df.h.  */
+class indexing_web_entry: public web_entry_base
+{
+ public:
+  rtx_insn *insn;
+  basic_block bb;
+
+  /* A unique sequence number, identified as the local unique id, or
+     luid, is assigned to each instruction for the purpose of
+     simplifying domination tests.  Within each basic block, local
+     unique id numbers are assigned in strictly increasing order.
+     Thus, for any two instructions known to reside in the same basic
+     block, the instruction with a lower luid is known to dominate the
+     instruction with a higher luid.  */
+  unsigned int luid;
+
+  /* If this insn is relevant, it is a load or store with a memory
+     address that is comprised of a base pointer (e.g. the address of
+     an array or array slice) and an index expression (e.g. an index
+     within the array).  By convention, the base expression is assumed
+     to be the left argument to the addressing arithmetic plus
+     operator and the index expression is assumed to be the right
+     argument.  Symmetric presentation of the same two arguments of
+     the plus operator will not be recognized as belonging to the same
+     equivalence class.  The original_base_use and original_index_use
+     fields represent the numbers of the instructions that define the
+     base and index values which are summed together with a constant
+     value to determine the value of this instruction's memory
+     address.  */
+  unsigned int original_base_use;
+  unsigned int original_index_use;
+
+  /* If this insn is relevant, the register assigned by insn
+     original_base_use is original_base_reg.  The insn assigned by insn
+     original_index_use is original_index_reg.  */
+  unsigned int original_base_reg;
+  unsigned int original_index_reg;
+
+  /* If this insn is_relevant, this is the constant that is added to
+     the originating expression to calculate the value of this insn's
+     memory address.  */
+  int base_delta;
+  int index_delta;
+
+  /* If this insn is relevant, it belongs to an equivalence class.
+     The equivalence classes are identified by the definitions that
+     define the inputs to this insn.   */
+  unsigned int base_equivalence_hash;
+  unsigned int index_equivalence_hash;
+
+  /* When multiple insns fall within the same equivalence class, they
+     are linked together through this field.  The value UINT_MAX
+     represents the end of this list.  */
+  unsigned int equivalent_sibling;
+
+  /* Only instructions that represent loads or stores for which the
+     memory address computation is in a particular simple form are
+     considered relevant to this d-form optimization pass.
+
+     If a particular entry is identified as is_relevant == false, the
+     values of the following fields are all undefined: is_load,
+     is_store, is_originating, original_base_use, original_index_use,
+     original_base_reg, original_index_reg, base_delta, index_delta,
+     base_equivalence_hash, index_equivalence_hash, and
+     equivalent_sibling.  */
+  unsigned int is_relevant : 1;
+  unsigned int is_load : 1;
+  unsigned int is_store : 1;
+  unsigned int is_originating : 1;
+};
+
+/* How many relevant uses are contained within this function?  A
+   relevant insn is a memory store or load insn for which the memory
+   address is computed as the sum or difference of a base address and
+   an index expression and there is exactly one definition of the base
+   expression and there is exactly one definition of the index
+   expression.  The complexity of this pass is determined in part by
+   this number. */
+static unsigned int num_relevant_insns;
+
+/* What is the largest number of definitions that reach relevant
+   uses within this function?  The complexity of this pass
+   is determined in part by this number.  */
+static unsigned int max_use_defs;
+
+/* How many otherwise relevant instructions were rejected due to
+   overflow of the MAX_RELEVANT_INSNS or BOUND_MAX_USE_DEFS
+   limitations?  */
+static unsigned int rejected_otherwise_relevant;
+
+/* The complexity of this pass is O (num_relevant_insns *
+   max_use_defs).  If num_relevant_insns > MAX_RELEVANT_INSNS, suspend
+   further optimization so as to manage compilation complexity.  */
+#define MAX_RELEVANT_INSNS 128
+
+/* The complexity of this pass is O (num_relevant_insns *
+   max_use_defs).  If max_use_defs > BOUND_MAX_USE_DEFS, suspend
+   further optimization so as to manage compilation complexity.  */
+#define BOUND_MAX_USE_DEFS 8
+
+/* A relevant instruction is expected to make use of no more than 3
+   inputs: the base register, the index register, and, for store
+   operations, the register holding the value to be stored.  If an
+   instruction has more uses, consider that instruction to be not
+   relevant.  */
+#define BOUND_RELEVANT_USES 3
+
+/* Count how many definitions reach the use that is represented by the
+   DEF_LINK argument.  Update max_use_defs if returned count exceeds
+   previously encountered maximum count value.  */
+static unsigned int
+count_links (struct df_link *def_link)
+{
+  unsigned int count;
+  for (count = 0; def_link != NULL; count++)
+    def_link = def_link->next;
+  if (count > max_use_defs)
+    max_use_defs = count;
+  return count;
+}
+
+/* Helper comparison function for use by qsort.  */
+static int
+int_compare (const void *v1, const void *v2)
+{
+  const int *i1 = (const int *) v1;
+  const int *i2 = (const int *) v2;
+  return *i1 - *i2;
+}
+
+/* Calculate the hash value for the use represented by DEF_LINK, given
+   that COUNT definitions are known to reach this use.
+
+   Complexity is n*log(n) (for qsort) where n is COUNT.  */
+static unsigned int
+help_hash (unsigned int count, struct df_link *def_link)
+{
+  int *ids;
+  int i = 0;
+
+  ids = (int *) alloca (count * sizeof (ids[0]));
+
+  while (def_link != NULL)
+    {
+      ids[i++] = DF_REF_ID (def_link->ref);
+      def_link = def_link->next;
+    }
+
+  /* sort to put ids in ascending order. */
+  qsort ((void *) ids, count, sizeof (ids[0]), int_compare);
+
+  unsigned int result = 0;
+  for (unsigned i = 0; i < count; i++)
+    {
+      result = (result << 6) ^ ((result >> 28) & 0x0f);
+      result += ids[i];
+    }
+  return result;
+}
+
+/* Calculate a hash code that represents all of the definitions that
+   contribute to a given variable's computed value.  */
+static unsigned int
+equivalence_hash (struct df_link *def_link)
+{
+  unsigned int count = count_links (def_link);
+  return help_hash (count, def_link);
+}
+
+/* Set *header to point to the chain of originating definitions
+   corresponding to USE.  */
+static void
+overwrite_defs_header (indexing_web_entry *insn_entry, unsigned int uid,
+       df_ref use, struct df_link **header)
+{
+  struct df_link *def_link = DF_REF_CHAIN (use);
+  unsigned int uid2;
+
+  /* If there is no def, or the definition is artificial, or if there
+     are multiple definitions, this is an originating use.  */
+  if (!def_link || !def_link->ref
+      || DF_REF_IS_ARTIFICIAL (def_link->ref) || def_link->next)
+    *header = def_link;
+  else if ((uid2 = insn_entry[uid].original_base_use) > 0)
+    {
+      df_ref use2;
+      rtx_insn *insn2 = insn_entry[uid2].insn;
+      struct df_insn_info *insn_info2 = DF_INSN_INFO_GET (insn2);
+      use2 = DF_INSN_INFO_USES (insn_info2);
+      if (use2)
+ *header = DF_REF_CHAIN (use2);
+      else
+ *header = NULL;
+    }
+  else
+    *header = NULL;
+}
+
+/* Given that UID represents a "relevant" instruction, overwrite
+   *insn_base with a pointer to the chain of originating definitions
+   for the base expression.  Overwrite *insn_index with a pointer
+   to the chain of originating definitions for the index expression.  */
+static void
+help_find_defs (indexing_web_entry *insn_entry,
+ unsigned int uid, rtx base_reg, rtx index_reg,
+ struct df_link **insn_base, struct df_link **insn_index)
+{
+  rtx_insn *insn = insn_entry[uid].insn;
+  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+  df_ref use;
+
+  /* Iterate over the uses consumed by this insn: loop iterates at
+     most 3 times, once for use representing the base register, once
+     for use representing the index register, and, for store
+     operations, once for the use representing the value to be stored
+     to memory.  */
+  FOR_EACH_INSN_INFO_USE (use, insn_info)
+    {
+      if (rtx_equal_p (DF_REF_REG (use), base_reg))
+ overwrite_defs_header (insn_entry, uid, use, insn_base);
+      else if (rtx_equal_p (DF_REF_REG (use), index_reg))
+ overwrite_defs_header (insn_entry, uid, use, insn_index);
+    }
+}
+
+/* Given that INSN is known to be relevant, find the linked list of
+   definitions that define the base register and the the linked list
+   of definitions that define the index register, and overwrite
+   *INSN_BASE and *INSN_INDEX with pointers to the two linked lists
+   respectively.
+
+   Since INSN is "relevant", it is known to represent a load or store
+   operation and the associated memory address is known to be
+   represented by a sum or difference of a base register and an index
+   offset register.  */
+static void
+find_defs (indexing_web_entry *insn_entry, rtx_insn *insn,
+   struct df_link **insn_base, struct df_link **insn_index)
+{
+  unsigned int uid = INSN_UID (insn);
+  rtx body = PATTERN (insn);
+  rtx mem = NULL;
+
+  gcc_assert (GET_CODE (body) == SET);
+
+  if (MEM_P (SET_SRC (body)))
+    mem = XEXP (SET_SRC (body), 0);
+  else if (MEM_P (SET_DEST (body)))
+    mem = XEXP (SET_DEST (body), 0);
+  else
+    gcc_unreachable ();
+
+  enum rtx_code code = GET_CODE (mem);
+  gcc_assert ((code == PLUS) || (code == MINUS));
+
+  rtx base_reg = XEXP (mem, 0);
+  rtx index_reg = XEXP (mem, 1);
+  if (REG_P (base_reg) && REG_P (index_reg))
+    help_find_defs (insn_entry, uid,
+    base_reg, index_reg, insn_base, insn_index);
+}
+
+/* Return non-zero if and only if USE represents a compile-time constant.  */
+static bool
+represents_constant_p (df_ref use)
+{
+  struct df_link *def_link = DF_REF_CHAIN (use);
+
+  /* If there is no definition, or the definition is artificial,
+     or if there are multiple definitions, this is an originating use
+     which is not a constant.  */
+  if (!def_link || !def_link->ref
+      || DF_REF_IS_ARTIFICIAL (def_link->ref) || def_link->next)
+    return false;
+  else
+    {
+      /* Consult the instruction that provides definition of this use.  */
+      rtx def_insn = DF_REF_INSN (def_link->ref);
+      rtx body = PATTERN (def_insn);
+      if (CONST_INT_P (body))
+ return true;
+      else if (GET_CODE (body) == SET)
+ {
+  /* Recurse on the uses that define the value of this definition.  */
+  struct df_insn_info *inner_insn_info = DF_INSN_INFO_GET (def_insn);
+  df_ref inner_use;
+  FOR_EACH_INSN_INFO_USE (inner_use, inner_insn_info)
+    {
+      if (!represents_constant_p (inner_use))
+ return false;
+    }
+  /* All uses are constant.  */
+  return true;
+ }
+      else
+ return false; /* Treat unrecognized codes as not constant.  */
+    }
+}
+
+/* This function helps analyze opportunities for copy propogation and
+   constant folding.
+
+   An originator represents the first point at which the value of
+   DEF_LINK is derived from potentially more than one input
+   definition, or the point at which DEF_LINK's value is defined by an
+   algebraic expression involving only constants,
+
+   If DEF_LINK's value depends on a constant combined with a single
+   variable or a simple propagation of a single variable, continue
+   the search for the originator by examining the origin of the source
+   variable's value.
+
+   The value of *ADJUSTMENT is overwritten with the constant value that is
+   added to the originator expression to obtain the value intended to
+   be represented by DEF_LINK.  In the case that find_true_originator
+   returns NULL, the value held in *ADJUSTMENT is undefined.
+
+   Returns NULL if there is no single true originator.  In general, the search
+   for an originator expression only spans SET operations that are
+   based on simple algebraic expressions, each of which involves no
+   more than one variable input.
+
+   Complexity: Linear in the number of definitions used by this
+   instruction (<= BOUND_MAX_USE_DEFS) multiplied (for recursive
+   calls) by the maximum depth of the potential copy propogation
+   chain.
+  */
+static rtx
+find_true_originator (struct df_link *def_link, long long int *adjustment)
+{
+  rtx def_insn = DF_REF_INSN (def_link->ref);
+
+  rtx inner_body = PATTERN (def_insn);
+  if (GET_CODE (inner_body) == SET)
+    {
+      struct df_insn_info *inner_insn_info = DF_INSN_INFO_GET (def_insn);
+      df_ref inner_use;
+
+      /* We're only happy with multiple uses if all but one represent
+ constant values.  */
+      int non_constant_uses = 0;
+      rtx result = NULL;
+      FOR_EACH_INSN_INFO_USE (inner_use, inner_insn_info)
+ {
+  if (!represents_constant_p (inner_use))
+    {
+      non_constant_uses++;
+      /* There should be only one non-constant use, and it should
+ satisfy find_true_originator.  */
+      struct df_link *def_link = DF_REF_CHAIN (inner_use);
+
+      /* If there is no definition, or the definition is artificial,
+ or there are multiple definitions, this is an
+ originating use.  */
+      if (!def_link || !def_link->ref
+  || DF_REF_IS_ARTIFICIAL (def_link->ref) || def_link->next)
+ result = def_insn;
+      else
+ result = find_true_originator (def_link, adjustment);
+    }
+ }
+
+      /* If non_constant_uses > 1, the value of result is not well
+ defined because it is overwritten during multiple iterations
+ of the above loop.  In the case that non_constant_uses > 1,
+ we ignore the result value computed above.  */
+      if (non_constant_uses == 1) {
+
+ /* If my SET looks like a simple register copy, or if it looks
+   like PLUS or MINUS of a constant and a register, this is
+   what we optimize.  Otherwise, punt.  */
+
+ if (result == NULL)
+  /* Doing constant arithmetic with unknown originator is not
+     useful.  */
+  return def_insn;
+
+ rtx source_expr = SET_SRC (inner_body);
+ int source_code = GET_CODE (source_expr);
+ if (source_code == PLUS)
+  {
+    rtx op1 = XEXP (source_expr, 0);
+    rtx op2 = XEXP (source_expr, 1);
+
+    if (CONST_INT_P (op1) && CONST_INT_P (op2))
+      *adjustment += INTVAL (op1);
+    else if (!CONST_INT_P (op1) && CONST_INT_P (op2))
+      *adjustment += INTVAL (op2);
+  }
+ else if (source_code == MINUS)
+  {
+    rtx op1 = XEXP (source_expr, 0);
+    rtx op2 = XEXP (source_expr, 1);
+
+    if (!CONST_INT_P (op1) && CONST_INT_P (op2))
+      *adjustment -= INTVAL (op1);
+    else
+      /* assumption is that *adjustment is added to a positive variable
+ expression, so don't optimize this rare condition.  */
+      result = def_insn;
+  }
+ else if (source_code != REG)
+  /* We don't handle ashift, UNSPEC, etc..  */
+  result = def_insn;
+ /* else, register copy expression does not impact adjustment.  */
+
+ return result;
+      }
+      else
+ /* Same behavior if there are too many non-constant inputs or if
+   all inputs are constant.  */
+ return def_insn;
+    }
+  else
+    /* This is not a SET.  It does not serve as a true originator. */
+    return NULL;
+}
+
+/* The size of the insn_entry array.  Note that this array does not
+   represent instructions created during this optimization pass.  */
+static unsigned int max_uid_at_start;
+
+/* Return true if and only if ELEMENT is on LIST.  This is linear in
+   the length of the list.  The list length is no longer than
+   BOUND_MAX_USE_DEFS.  */
+static bool
+in_use_list (struct df_link *list, struct df_link *element)
+{
+  while (list != NULL)
+    {
+      if (element->ref == list->ref)
+ return true;
+      list = list->next;
+    }
+  /* Got to end of list without finding element.  */
+  return false;
+}
+
+/* Return true iff the instruction represented by uid_1 is in the same
+   equivalence class as the instruction represented by uid_2.
+
+   Returns false generally in constant time (based on unequal hash
+   values).  Returning true, as currently implemented, requires
+   quadratic time in the number of definitions that contribute to
+   either the base or index expressions of the equivalence class.
+   Number of iterations is no greater than (BOUND_MAX_USE_DEFS *
+   BOUND_MAX_USE_DEFS).
+
+   To improve complexity to linear in the number of contributing
+   definitions, allocate and remember the sorted list of all
+   definitions for each relevant value.   */
+static bool
+equivalent_p (indexing_web_entry *insn_entry,
+      unsigned int uid_1, unsigned int uid_2)
+{
+  if ((insn_entry[uid_1].base_equivalence_hash !=
+       insn_entry[uid_2].base_equivalence_hash) ||
+      (insn_entry[uid_1].index_equivalence_hash !=
+       insn_entry[uid_2].index_equivalence_hash))
+    return false;
+
+  /* Hash codes match.  Check details.  */
+  rtx_insn *insn_1, *insn_2;
+  insn_1 = insn_entry[uid_1].insn;
+  insn_2 = insn_entry[uid_2].insn;
+
+  struct df_link *insn1_base_defs, *insn1_index_defs;
+  struct df_link *insn2_base_defs, *insn2_index_defs;
+
+  find_defs (insn_entry, insn_1, &insn1_base_defs, &insn1_index_defs);
+  find_defs (insn_entry, insn_2, &insn2_base_defs, &insn2_index_defs);
+
+  int base_count_1 = count_links (insn1_base_defs);
+  int index_count_1 = count_links (insn1_index_defs);
+  int base_count_2 = count_links (insn2_base_defs);
+  int index_count_2 = count_links (insn2_index_defs);
+
+  if ((base_count_1 != base_count_2) || (index_count_1 != index_count_2))
+    return false;
+
+  if (insn_entry [uid_1].original_base_reg
+      != insn_entry [uid_2].original_base_reg)
+    return false;
+  else if (insn_entry [uid_1].original_index_reg
+   != insn_entry [uid_2].original_index_reg)
+    return false;
+
+  /* Counts are the same.  Make sure elements match.  */
+  /* The following comparison code is quadratic in counts.  Improve to
+     n*log(n) by sorting the four arrays and comparing eleents pairwise.
+     Improve to linear by remembering the sorted contents of each of
+     the four arrays from when the hash values were first computed.
+     Since the typical sizes of the count variables are fairly small,
+     leave as is unless performance measurements justify increased
+     complexity.  */
+  while (insn1_base_defs != NULL)
+    {
+      if (!in_use_list (insn2_base_defs, insn1_base_defs))
+ return false;
+      insn1_base_defs = insn1_base_defs->next;
+    }
+  /* base patterns match, but stil need to consider index matches.  */
+  while (insn1_index_defs != NULL)
+    {
+      if (!in_use_list (insn2_index_defs, insn1_index_defs))
+ return false;
+      insn1_index_defs = insn1_index_defs->next;
+    }
+
+  return true;
+}
+
+/* Return true iff instruction E2 dominates instruction E1.  Note
+   that insn_dominated_by_p defined in ira.c is declared static and
+   requires initialization of auxilary data not computed in this
+   context. */
+static bool
+insn_dominated_by_p (indexing_web_entry *e1, indexing_web_entry *e2)
+{
+  basic_block bb1 = e1->bb;
+  basic_block bb2 = e2->bb;
+
+  if (bb1 == bb2)
+    return e2->luid <= e1->luid;
+  else
+    return dominated_by_p (CDI_DOMINATORS, bb1, bb2);
+}
+
+/* Confirm that every insn in the equivalence class that is
+   represented by the EQUIVALENCE_HASH [INDEX] linked list can be
+   represented using d-form memory addresses.  For any insn that
+   cannot be so represented, decrement the value of the
+   REPLACEMENT_COUNT argument.  Return the adjusted value of
+   REPLACEMENT_COUNT.
+
+   Complexity is linear in the size of the equivalence class.  */
+static unsigned int
+confirm_dform_insns (indexing_web_entry *insn_entry,
+     unsigned int *equivalence_hash, unsigned int index,
+     unsigned int the_dominator, unsigned replacement_count)
+{
+  unsigned int uid;
+  long long int dominator_delta = (insn_entry [the_dominator].base_delta
+   + insn_entry [the_dominator].index_delta);
+  for (uid = equivalence_hash [index]; uid != UINT_MAX;
+       uid = insn_entry [uid].equivalent_sibling)
+    {
+      if (uid != the_dominator)
+ {
+  long long int dominated_delta = (insn_entry [uid].base_delta
+   + insn_entry [uid].index_delta);
+  dominated_delta -= dominator_delta;
+
+  rtx_insn *insn = insn_entry [uid].insn;
+  rtx body = PATTERN (insn);
+  rtx mem;
+
+  gcc_assert (GET_CODE (body) == SET);
+
+  if (MEM_P (SET_SRC (body))) /* load */
+    {
+      mem = SET_SRC (body);
+      if (!rs6000_target_supports_dform_offset_p
+  (GET_MODE (mem), dominated_delta))
+ replacement_count--;
+    }
+  else
+    {
+      mem = SET_DEST (body); /* store */
+      if (!rs6000_target_supports_dform_offset_p
+  (GET_MODE (mem), dominated_delta))
+ replacement_count--;
+    }
+ }
+    }
+  return replacement_count;
+}
+
+/* Replace all xform insns in equivalence class with dform insns.
+
+   Complexity is linear in the size of the equivalence class.  */
+void replace_xform_insns (indexing_web_entry *insn_entry,
+  unsigned int *equivalence_hash,
+  unsigned int index,
+  unsigned int the_dominator)
+{
+  /* First, fix up the_dominator instruction.  */
+  rtx derived_ptr_reg = gen_reg_rtx (Pmode);
+  rtx_insn *insn = insn_entry [the_dominator].insn;
+  rtx body = PATTERN (insn);
+  rtx base_reg, index_reg;
+  rtx addr, mem;
+  rtx new_init_expr;
+
+  if (dump_file)
+    {
+      fprintf (dump_file,
+       "Endeavoring to replace originating insn %d: ", the_dominator);
+      print_inline_rtx (dump_file, insn, 2);
+      fprintf (dump_file, "\n");
+    }
+
+  gcc_assert (GET_CODE (body) == SET);
+  if (MEM_P (SET_SRC (body)))
+    {
+      /* originating instruction is a load */
+      mem = SET_SRC (body);
+      addr = XEXP (SET_SRC (body), 0);
+    }
+  else
+    { /* originating instruction is a store */
+      gcc_assert (MEM_P (SET_DEST (body)));
+      mem = SET_DEST (body);
+      addr = XEXP (SET_DEST (body), 0);
+    }
+
+  enum rtx_code code = GET_CODE (addr);
+  gcc_assert ((code == PLUS) || (code == MINUS));
+  base_reg = XEXP (addr, 0);
+  index_reg = XEXP (addr, 1);
+
+  if (code == PLUS)
+    new_init_expr = gen_rtx_PLUS (Pmode, base_reg, index_reg);
+  else
+    new_init_expr = gen_rtx_MINUS (Pmode, base_reg, index_reg);
+  new_init_expr = gen_rtx_SET (derived_ptr_reg, new_init_expr);
+
+  rtx_insn *new_insn = emit_insn_before (new_init_expr, insn);
+  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn));
+  INSN_CODE (new_insn) = -1; /* force re-recogniition. */
+  df_insn_rescan (new_insn);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "with insn %d: ", INSN_UID (new_insn));
+      print_inline_rtx (dump_file, new_insn, 2);
+      fprintf (dump_file, "\n");
+    }
+
+  /* If dominator_delta != 0, we need to make adjustments for dominator_delta
+     in the D-form constant offsets associated with the propagating
+     instructions.  */
+
+  rtx new_mem = gen_rtx_MEM (GET_MODE (mem), derived_ptr_reg);
+  MEM_COPY_ATTRIBUTES (new_mem, mem);
+  rtx new_expr;
+  if (insn_entry [the_dominator].is_load)
+    new_expr = gen_rtx_SET (SET_DEST (body), new_mem);
+  else
+    new_expr = gen_rtx_SET (new_mem, SET_SRC (body));
+
+  if (!validate_change (insn, &PATTERN(insn), new_expr, false))
+    { /* proposed change was rejected */
+      if (dump_file)
+ {
+  fprintf (dump_file,
+   "Dform optimization rejected by validate_change\n");
+  print_inline_rtx (dump_file, new_insn, 2);
+  fprintf (dump_file, "\n");
+ }
+    }
+  else if (dump_file)
+    {
+      fprintf (dump_file, "and with insn %d: ", INSN_UID (insn));
+      print_inline_rtx (dump_file, insn, 2);
+      fprintf (dump_file, "\n");
+    }
+
+  for (unsigned int uid = equivalence_hash [index]; uid != UINT_MAX;
+       uid = insn_entry [uid].equivalent_sibling)
+    {
+      if (uid != the_dominator)
+ {
+  long long int dominated_delta = (insn_entry [uid].base_delta
+   + insn_entry [uid].index_delta);
+  long long int dominator_delta
+    = (insn_entry [the_dominator].base_delta
+       + insn_entry [the_dominator].index_delta);
+  dominated_delta -= dominator_delta;
+
+  rtx_insn *insn = insn_entry [uid].insn;
+  rtx body = PATTERN (insn);
+  rtx mem;
+
+  if (dump_file)
+    {
+      fprintf (dump_file,
+       "Endeavoring to replace propagating insn %d: ", uid);
+      print_inline_rtx (dump_file, insn, 2);
+      fprintf (dump_file, "\n");
+    }
+
+  gcc_assert (GET_CODE (body) == SET);
+  if (MEM_P (SET_SRC (body))) /* load */
+    mem = SET_SRC (body);
+  else
+    mem = SET_DEST (body); /* store */
+
+  rtx ci = gen_rtx_raw_CONST_INT (Pmode, dominated_delta);
+  rtx addr_expr = gen_rtx_PLUS (Pmode,
+ derived_ptr_reg, ci);
+  rtx new_mem = gen_rtx_MEM (GET_MODE (mem), addr_expr);
+  MEM_COPY_ATTRIBUTES (new_mem, mem);
+
+  rtx new_expr;
+  if (insn_entry [uid].is_load)
+    new_expr = gen_rtx_SET (SET_DEST (body), new_mem);
+  else
+    new_expr = gen_rtx_SET (new_mem, SET_SRC (body));
+
+  if (!validate_change (insn, &PATTERN(insn), new_expr, false))
+    { /* proposed change was rejected */
+      if (dump_file)
+ {
+  fprintf (dump_file,
+   "Dform optimization rejected by validate_change\n");
+  print_inline_rtx (dump_file, new_expr, 2);
+  fprintf (dump_file, "\n");
+ }
+    }
+  else if (dump_file)
+    {
+      fprintf (dump_file, "with insn %d: ", INSN_UID (insn));
+      print_inline_rtx (dump_file, insn, 2);
+      fprintf (dump_file, "\n");
+    }
+ }
+    }
+}
+
+/* Organize all "relevant" instructions into equivalence classes.
+   Relevant instructions are instructions that load or store memory
+   where the memory address is represented by a sum or difference of a
+   base address register and an integer offset register.
+
+   An equivalence class holds all of the load and store operations
+   that refer to the same computed base and/or index variables plus or
+   minus some constant value.
+
+   All expressions in each equivalence class are replaced with d-form
+   instructions in the emitted code on P9 or above.
+
+   Calculation of hash functions is linear in the number of
+   definitions that potentially contribute to the computation of a
+   particular variable's value.
+
+   Assume hash table insertion and lookup is constant time (i.e. we
+   normally do not experience collision on hash values).
+
+   Complexity of of this function is O(m*n) where m is number of
+   equivalence classes and n is maximum number of entries in the
+   equivalence class.  Since each insn can be a member of only one
+   equivalence class, this is linear in the number of insns within
+   the function.  */
+static void
+build_and_fixup_equivalence_classes (indexing_web_entry *insn_entry)
+{
+  unsigned int i;
+  /* There can be no more equivalence classes than the total number of
+     instructions in the analyzed function.  Usually, there are far
+     fewer instructions because many of the instructions are not load
+     or store instructions, and some of those that are load and store
+     instructions may end up in the same equivalence class.  */
+  unsigned int *equivalence_hash =
+    (unsigned int *) alloca (max_uid_at_start * sizeof (unsigned int));
+
+  /* Initialize the equivalence_hash array.  */
+  for (i = 0; i < max_uid_at_start; i++)
+    equivalence_hash [i] = UINT_MAX;
+
+  /* Place each relevant instruction into an equivalence class, either
+     a class consisting only of itself, or a class that includes other
+     relevant instructions.
+
+     Complexity of this loop is O(m*n^2) where m is the number of relevant
+     instructions and n is number of definitions of the registers that
+     hold the base or index components of the memory operation's
+     address.  The number of iterations of the slow path through the
+     loop is less than m * (BOUND_MAX_USE_DEFS * BOUND_MAX_USE_DEFS).  */
+  for (unsigned int uid = 0; uid < max_uid_at_start; uid++)
+    {
+      if (insn_entry [uid].is_relevant)
+ {
+  unsigned int hash = ((insn_entry [uid].base_equivalence_hash
+ + insn_entry [uid].index_equivalence_hash)
+       % max_uid_at_start);
+
+  if (equivalence_hash [hash] == UINT_MAX)
+    { /* first mention of this class */
+      equivalence_hash [hash] = uid;
+      insn_entry [uid].equivalent_sibling = UINT_MAX;
+    }
+  else
+    {
+      while ((equivalence_hash [hash] != UINT_MAX)
+     && !equivalent_p (insn_entry, uid,
+       equivalence_hash [hash]))
+ hash = (hash + 1) % max_uid_at_start;
+
+      if (equivalence_hash [hash] != UINT_MAX)
+ {
+  /* Found an equivalent instruction. */
+  insn_entry [uid].equivalent_sibling =
+    equivalence_hash [hash];
+  equivalence_hash [hash] = uid;
+ }
+      else
+ {
+  /* Equivalence class doesn't yet exist.  */
+  equivalence_hash [hash] = uid;
+  insn_entry [uid].equivalent_sibling = UINT_MAX;
+ }
+    }
+ }
+    }
+
+  /* Scrutinize each equivalence class.  For any entries in the
+     equivalence class that are on conditional control flows (such
+     that they do not dominate the other entries), remove these
+     entries from the equivalence class.  */
+  for (unsigned int i = 0; i < max_uid_at_start; i++)
+    {
+      while (equivalence_hash [i] != UINT_MAX)
+ {
+  unsigned int the_dominator = equivalence_hash [i];
+  unsigned int uid;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Equivalence class consists of\n");
+
+  /* Find the dominator for this equivalence class.
+
+     Complexity of following loop body is O(m*n) where m is number
+     of equivalence classes and n is maximum number of entries
+     in the equivalence class.  */
+  for (uid = the_dominator; uid != UINT_MAX;
+       uid = insn_entry [uid].equivalent_sibling)
+    {
+      if (insn_dominated_by_p (&insn_entry [the_dominator],
+       &insn_entry [uid]))
+ the_dominator = uid;
+      if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "  member: %d\n", uid);
+    }
+
+  unsigned int size_of_equivalence = 0;
+  unsigned int removed_partition = UINT_MAX;
+  unsigned int preceding_uid = UINT_MAX;
+  unsigned int next_uid;
+
+  /* Having found a dominator, remove from this equivalence
+     class any element that is not dominated by the_dominator.
+
+     Though this loop appears to be O(m*n), for m being number
+     of equivalence classes and n being maximum number of
+     entries in an equivalence class, it can also be described
+     as O(i), where i is the number of instructions in the
+     functions implementation.  Each iteration of the loop
+     deals with a distinct instruction.  */
+  for (uid = equivalence_hash [i]; uid != UINT_MAX; uid = next_uid)
+    {
+      next_uid = insn_entry [uid].equivalent_sibling;
+      if (!insn_dominated_by_p (&insn_entry [uid],
+ &insn_entry [the_dominator]))
+ {
+  /* insn uid thinks its in this equivalence class, but
+     it's not dominated by the_dominator, so remove it.  */
+  insn_entry [uid].equivalent_sibling = removed_partition;
+  removed_partition = uid;
+  if (preceding_uid == UINT_MAX)
+    equivalence_hash [i] = next_uid;
+  else
+    insn_entry [preceding_uid].equivalent_sibling = next_uid;
+ }
+      else
+ {
+  size_of_equivalence++;
+  preceding_uid = uid;
+ }
+    }
+
+  /* By same argument as above, confirm_dform_insns and
+     replace_xform_insns are also O(i), where i is the number
+     of instructions in the function.  */
+  if (confirm_dform_insns (insn_entry, equivalence_hash, i,
+   the_dominator, size_of_equivalence) > 1)
+    replace_xform_insns (insn_entry, equivalence_hash,
+ i, the_dominator);
+  else if (dump_file)
+    fprintf (dump_file,
+     "Abandoning dform optimization: too few dform insns\n");
+
+  /* if (removed_partition != UINT_MAX), need to reprocess the
+     contents of the removed_partition.  There may be
+     additional opportunity to optimize within the set of
+     insns that were not dominated by the selected dominator.
+
+     Each time through this loop, at least one dominator and
+     any instructions it dominates are "processed".  Anything
+     not dominated by the selected dominator remains in the
+     "removed partition".  The "removed partition" gets
+     smaller on each iteration, assuring eventual termination.  */
+  equivalence_hash [i] = removed_partition;
+ }
+    }
+}
+
+/* Assess whether the instruction represented by uid is relevant,
+   setting *is_relevant to true if so, and setting *is_originating to
+   true if this use is an originating definition.
+
+   If the insn is determined to be relevant and is_base is true, overwrite
+   the base_delta, original_base_reg, original_base_use, and
+   base_equivalence_hash fields of insn_entry[uid].
+
+   Otherwise, if the insn is determined to be relevant and is_base is
+   not true, overwrite the index_delta, original_index_reg,
+   original_index_use, and index_equivalence_hash fields of
+   insn_entry[uid].
+
+   In case the insn is not determined to be relevant, certain fields
+   fields of insn_entry[uid] may be overwritten with scratch values
+   that have no significance as these fields are not consulted
+   subequently in the case that the insn is not relevant.
+
+   Complexity: linear in the length of the number of entries on the
+   use-definition chain, as represented by argument USE.
+*/
+static void
+assess_use_relevance (bool is_base, rtx reg, bool *is_relevant,
+      bool *is_originating, df_ref use,
+      unsigned int uid, indexing_web_entry *insn_entry)
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Found use corresponding to %s\n",
+       is_base? "base": "index");
+      df_ref_debug (use, dump_file);
+    }
+  struct df_link *def_link = DF_REF_CHAIN (use);
+
+  /* If there is no definition, or the definition is artificial, or there
+     are multiple definitions, this is originating use.  */
+  if (!def_link || !def_link->ref
+      || DF_REF_IS_ARTIFICIAL (def_link->ref) || def_link->next)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "Use is originating!\n");
+      *is_relevant = true;
+      *is_originating = true;
+      unsigned int hash = equivalence_hash (def_link);
+
+      if (is_base)
+ {
+  insn_entry[uid].base_delta = 0;
+  insn_entry[uid].original_base_reg = REGNO (reg);
+  insn_entry[uid].original_base_use = uid;
+  insn_entry[uid].base_equivalence_hash = hash;
+ }
+      else
+ {
+  insn_entry[uid].index_delta = 0;
+  insn_entry[uid].original_index_reg = REGNO (reg);
+  insn_entry[uid].original_index_use = uid;
+  insn_entry[uid].index_equivalence_hash = hash;
+ }
+    }
+  else
+    {
+      /* Only one definition.  Dig deeper.  */
+      long long int delta = 0;
+      rtx insn2 = find_true_originator (def_link, &delta);
+      if (insn2)
+ {
+  unsigned uid2 = INSN_UID (insn2);
+  df_ref use2;
+
+  if (dump_file  && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Use may propagate from %d\n", uid2);
+
+  struct df_insn_info *insn_info2 = DF_INSN_INFO_GET (insn2);
+
+  if (insn_info2)
+    use2 = DF_INSN_INFO_USES (insn_info2);
+  else
+    use2 = NULL;
+
+  if (!use2 || !DF_REF_NEXT_LOC (use2))
+    {
+      *is_originating = false;
+
+      rtx body = PATTERN (insn2);
+      gcc_assert (GET_CODE (body) == SET);
+      gcc_assert (REG_P (SET_DEST (body)));
+
+      if (is_base)
+ {
+  insn_entry[uid].original_base_reg = REGNO (SET_DEST (body));
+  insn_entry[uid].original_base_use  = uid2;
+  insn_entry[uid].base_delta = delta;
+ }
+      else /* !is_base means is_index.  */
+ {
+  insn_entry[uid].original_index_reg = REGNO (SET_DEST(body));
+  insn_entry[uid].original_index_use = uid2;
+  insn_entry[uid].index_delta = delta;
+ }
+
+      if (use2)
+ {
+  struct df_link *def_link = DF_REF_CHAIN (use2);
+  unsigned int hash = equivalence_hash (def_link);
+  *is_relevant = true;
+  if (is_base)
+    insn_entry[uid].base_equivalence_hash = hash;
+  else
+    insn_entry[uid].index_equivalence_hash = hash;
+ }
+      /* else, use is not relevant.  */
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ " propagates from originating insn %d"
+ " with delta: %lld\n", uid2, delta);
+    }
+  else if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, " Dependencies are too"
+     " complicated for this optimization\n");
+ }
+    }
+}
+
+/* Given that INSN represents a memory store or load operation, that
+   MEM is the expression that computes the address to or from which
+   memory is transferred and INSN_ENTRY holds the array representing
+   all of the indexing_web_entry structures associated with the
+   instructions of this function, set the is_relevant field of
+   INSN_ENTRY [UID] if the form of the memory address expression is
+   compatible with dform optimization.
+
+   If the insn is considered relevant, this function also initializes
+   the following fields of the corresponding insn_entry:
+
+ is_originating
+
+   If is_relevant and is_originating, we set:
+
+ original_base_reg
+ original_base_use
+ base_delta
+ base_equivalence_hash
+
+ original_index_reg = REGNO (index_reg);
+ original_index_use = uid;
+ index_delta = 0;
+ index_equivalence_hash
+
+   If is_relevant and !is_originating, we set:
+
+ original_index_reg
+ original_index_use
+ index_delta = (non-zero)
+ index_equivalence_hash (only if use2 != NULL)
+
+   Complexity is linear in the number of variables "used" by this
+   instruction, multiplied by the number of definitions of each
+   variable.  Loop iterations are no greater than
+   BOUND_RELEVANT_USES * BOUND_MAX_USE_DEFS.  */
+static void
+assess_relevance (rtx mem, rtx_insn *insn, indexing_web_entry *insn_entry)
+{
+  unsigned int uid = INSN_UID (insn);
+  rtx base_reg = XEXP (mem,0);
+  rtx index_reg = XEXP (mem, 1);
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, " memory is base +/- index, ");
+      fprintf (dump_file, "base: ");
+      print_inline_rtx (dump_file, base_reg, 2);
+      fprintf (dump_file, "\n index: ");
+      print_inline_rtx (dump_file, index_reg, 2);
+      fprintf (dump_file, "\n");
+    }
+
+  if (REG_P (base_reg) && REG_P (index_reg))
+    {
+      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+      /* Since insn is known to represent a sum or difference, this
+ insn is likely to use at least two input variables.  */
+
+      int num_base_defs = 0;
+      int num_index_defs = 0;
+      bool base_is_relevant = false;
+      bool index_is_relevant = false;
+      bool base_is_originating = false;
+      bool index_is_originating = false;
+      df_ref use;
+      int use_count = 0;
+
+      /* Iterate over the number of definitions used by this
+ instruction to find the definitions that correspond to the
+ base register and the index register.  */
+      FOR_EACH_INSN_INFO_USE (use, insn_info)
+ {
+  if (use_count++ >= BOUND_RELEVANT_USES)
+    {
+      /* Relevant insns generally have no more than 3 uses.  */
+      num_base_defs = 0;
+      num_index_defs = 0;
+      break;
+    }
+  else if (rtx_equal_p (DF_REF_REG (use), base_reg))
+    {
+      assess_use_relevance (true, base_reg, &base_is_relevant,
+   &base_is_originating,
+   use, uid, insn_entry);
+      num_base_defs++;
+    }
+  else if (rtx_equal_p (DF_REF_REG (use), index_reg))
+    {
+      assess_use_relevance (false, index_reg, &index_is_relevant,
+   &index_is_originating,
+   use, uid, insn_entry);
+      num_index_defs++;
+    }
+ }
+
+      /* This insn is only relevant if there is exactly one definition of
+ base and one definition of index and they are both considered to
+ be relevant.  */
+      if ((num_base_defs == 1) && (num_index_defs == 1)
+  && base_is_relevant && index_is_relevant
+  && (num_relevant_insns++ < MAX_RELEVANT_INSNS)
+  && (max_use_defs <= BOUND_MAX_USE_DEFS))
+ {
+  insn_entry[uid].is_relevant = true;
+  insn_entry[uid].is_originating =
+    (base_is_originating && index_is_originating);
+ }
+      else if (dump_file)
+ {
+  int rejection_count = 0;
+  fprintf (dump_file,
+   "Rejecting dform optimization of insn %d\n", uid);
+  if (num_base_defs != 1)
+    {
+      fprintf (dump_file, "Too %s (%d) base definitions\n",
+       (num_base_defs > 1)? "many": "few", num_base_defs);
+      rejection_count++;
+    }
+  if (num_index_defs != 1)
+    {
+      fprintf (dump_file, "Too %s (%d) index definitions\n",
+       (num_index_defs > 1)? "many": "few", num_index_defs);
+      rejection_count++;
+    }
+  if (!base_is_relevant)
+    {
+      fprintf (dump_file,
+       "The available base definition is not relevant\n");
+      rejection_count++;
+    }
+  if (!index_is_relevant)
+    {
+      fprintf (dump_file,
+       "The available index definition is not relevant\n");
+      rejection_count++;
+    }
+  if ((rejection_count == 0)
+      && (num_relevant_insns >= MAX_RELEVANT_INSNS))
+    {
+      fprintf (dump_file, "Too many relevant insns\n");
+      rejected_otherwise_relevant++;
+    }
+  else if (max_use_defs >= BOUND_MAX_USE_DEFS)
+    {
+      fprintf (dump_file, "Too many relevant use definitions\n");
+      rejected_otherwise_relevant++;
+    }
+ }
+      else
+ {
+  if ((num_base_defs == 1) && (num_index_defs == 1)
+      && base_is_relevant && index_is_relevant)
+    rejected_otherwise_relevant++;
+ }
+    }
+  else if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, " punting because base or index not registers\n");
+}
+
+
+/* Main entry point for this pass.
+
+   Complexity is linear in the number of instructions in the function
+   plus the complexity of build_and_fixup_equivalence_classes, which
+   is also linear in the number of instruction (since each instruction
+   is in no more than one equivalence class).  */
+unsigned int
+rs6000_insert_dform (function *fun)
+{
+  basic_block bb;
+  rtx_insn *insn, *curr_insn = 0;
+  indexing_web_entry *insn_entry;
+  unsigned int luid = 0;
+
+  num_relevant_insns = 0;
+  max_use_defs = 0;
+  rejected_otherwise_relevant = 0;
+
+  calculate_dominance_info (CDI_DOMINATORS);
+
+  /* Dataflow analysis for use-def chains.  */
+  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
+  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
+  df_analyze ();
+
+  /* Since this pass creates new instructions, get_max_uid () may
+     return different values at different times during this pass.  The
+     insn_entry array represents only the instructions that were
+     present in this function's representation at the start of this
+     pass.  */
+  max_uid_at_start = get_max_uid ();
+  insn_entry = XCNEWVEC (indexing_web_entry, max_uid_at_start);
+
+  if (dump_file)
+    fprintf (dump_file, "Creating insn_entry array with %d entries\n",
+     max_uid_at_start);
+
+  /* The general approach is to:
+
+       1. Look for multiple array indexing expressions that refer to
+          the same array base address such as are represented by the
+  rtl excerpts below..
+
+       2. Group these into subsets for which the indexing expression
+          derives from the same initial_value + some accumulation of
+          constant values added thereto.
+
+      (cinsn 2 (set (reg/v/f:DI <27> [ x ])
+                    (reg:DI 3 [ x ])) "ddot-c.c":12
+       (expr_list:REG_DEAD (reg:DI 3 [ x ])))
+
+      ...
+
+      (cinsn 31 (set (reg:V2DF <35> [ vect__3.7 ])
+                     (mem:V2DF (plus:DI (reg/v/f:DI <27> [ x ])
+                                        (reg:DI <9> [ ivtmp.18 ]))
+                      [1 MEM[base: x_20(D), index: ivtmp.18_35,
+                       offset: 0B]+0 S16 A64])) "ddot-c.c":18)
+
+      ...
+
+      (cinsn 304 (set (reg:DI <70>)
+                      (plus:DI (reg:DI <9> [ ivtmp.18 ])
+                               (const_int 16)))
+       (expr_list:REG_DEAD (reg:DI <9> [ ivtmp.18 ])))
+      (cinsn 34 (set (reg:DI <9> [ ivtmp.18 ])
+                     (reg:DI <70>)))
+   */
+
+  /* Walk the insns to gather basic data: complexity of inner-nested
+     loop body is linear in total number of insns within function.  */
+  FOR_ALL_BB_FN (bb, fun)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file, "Scrutinizing bb %d\n", bb->index);
+
+      FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
+ {
+  unsigned int uid = INSN_UID (insn);
+
+  insn_entry[uid].insn = insn;
+  insn_entry[uid].bb = BLOCK_FOR_INSN (insn);
+  insn_entry[uid].luid = luid++;
+  insn_entry[uid].is_relevant = false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\nLooking at insn: %d\n", uid);
+      df_dump_insn_top (insn, dump_file);
+      dump_insn_slim (dump_file, insn);
+      df_dump_insn_bottom (insn, dump_file);
+    }
+
+  /* First, look for all memory[base + index] expressions.
+   * Then group these by base.
+   * Then for all instructions in each group, scrutinize the index
+   * definition. Partition this group according to the origin
+   * variable upon which the the definitions of i are based.
+   *
+   * How do we define "origin variable"?
+   *
+   *  If i has multiple definitions, it is its own origin
+   *  variable.  Likewise, if i has a single definition and the
+   *  definition is NOT the sum or difference of a constant value
+   *  and some other variable, then i is its own origin variable.
+   *
+   *  Otherwise, i has the same origin variable as the expression
+   *  that represents its definition.
+   *
+   * After we've created these partitions, for each partition
+   * whose size is greater than 1:
+   *
+   *  1. introduce derived_ptr = base + origin_variable
+   *     immediately following the instruction that defines
+   *     origin_variable.
+   *
+   *  2. for each member of the partition, replace the expression
+   *     memory [base + index] with derived_ptr [constant], where
+   *     constant is the sum of all constant values added to the
+   *     origin variable to represent this particular value of i.  */
+  if (NONDEBUG_INSN_P (insn))
+    {
+      rtx body = PATTERN (insn);
+      rtx mem;
+      if ((GET_CODE (body) == SET) && MEM_P (SET_SRC (body)))
+ {
+  mem = XEXP (SET_SRC (body), 0);
+  insn_entry[uid].is_load = true;
+  insn_entry[uid].is_store = false;
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file,
+       " this insn is fetching data from memory: ");
+      print_inline_rtx (dump_file, mem, 2);
+      fprintf (dump_file, "\n");
+    }
+ }
+      else if ((GET_CODE (body) == SET) && MEM_P (SET_DEST (body)))
+ {
+  mem = XEXP (SET_DEST (body), 0);
+  insn_entry[uid].is_load = false;
+  insn_entry[uid].is_store = true;
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file,
+       " this insn is storing data to memory: ");
+      print_inline_rtx (dump_file, mem, 2);
+      fprintf (dump_file, "\n");
+    }
+ }
+      else
+ {
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file,
+     " this insn is neither load nor store\n");
+  continue; /* Not a load or store */
+ }
+
+      enum rtx_code code = GET_CODE (mem);
+      if ((code == PLUS) || (code == MINUS))
+ assess_relevance (mem, insn, insn_entry);
+      else if (dump_file && (dump_flags & TDF_DETAILS))
+ fprintf (dump_file,
+ " address not sum or difference of values\n");
+    }
+  /* else, this is a DEBUG_INSN_P (insn) so ignore it.  */
+ }
+
+ if (dump_file && (dump_flags & TDF_DETAILS))
+  fprintf (dump_file, "\n");
+    }
+
+  build_and_fixup_equivalence_classes (insn_entry);
+  free_dominance_info (CDI_DOMINATORS);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n");
+      fprintf (dump_file,
+       "Total number of potentially relevant instructions: %d\n",
+       num_relevant_insns);
+      fprintf (dump_file,
+       "Maximum number of definitions used by potentially "
+       "relevant instructions: %d\n", max_use_defs);
+      fprintf (dump_file,
+       "Total number of potentially relevant but rejected "
+       "instructions: %d\n\n", rejected_otherwise_relevant);
+    }
+
+  return 0;
+}  // anon namespace
+
+
+const pass_data pass_data_insert_dform =
+{
+  RTL_PASS, /* type */
+  "dform", /* name */
+  OPTGROUP_NONE, /* optinfo_flags, or could use OPTGROUP_LOOP */
+  TV_NONE, /* tv_id, or could use TV_LOOP_UNROLL */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_insert_dform: public rtl_opt_pass
+{
+public:
+  pass_insert_dform(gcc::context *ctxt)
+    : rtl_opt_pass(pass_data_insert_dform, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      // This is most relevant to P9 and subsequent targets since P9
+      // introduces new D-form instructions, but this may pay off on
+      // other architectures as well.  Additional experimentation with
+      // other targets may be worthwhile.
+      return (optimize > 0 && !BYTES_BIG_ENDIAN && TARGET_VSX
+      && TARGET_P9_VECTOR);
+    }
+
+  virtual unsigned int execute (function *fun)
+    {
+      return rs6000_insert_dform (fun);
+    }
+
+  opt_pass *clone ()
+    {
+      return new pass_insert_dform (m_ctxt);
+    }
+
+}; // class pass_insert_dform
+
+rtl_opt_pass *make_pass_insert_dform (gcc::context *ctxt)
+{
+  return new pass_insert_dform (ctxt);
+}
Index: gcc/config/rs6000/rs6000-passes.def
===================================================================
--- gcc/config/rs6000/rs6000-passes.def (revision 275051)
+++ gcc/config/rs6000/rs6000-passes.def (working copy)
@@ -22,6 +22,8 @@ along with GCC; see the file COPYING3.  If not see
    INSERT_PASS_AFTER (PASS, INSTANCE, TGT_PASS)
    INSERT_PASS_BEFORE (PASS, INSTANCE, TGT_PASS)
    REPLACE_PASS (PASS, INSTANCE, TGT_PASS)
+   Be advised: gawk program does not parse C comments if inserted below.
  */
 
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
+  INSERT_PASS_AFTER (pass_loop2, 1, pass_insert_dform);
Index: gcc/config/rs6000/rs6000-protos.h
===================================================================
--- gcc/config/rs6000/rs6000-protos.h (revision 275051)
+++ gcc/config/rs6000/rs6000-protos.h (working copy)
@@ -47,6 +47,8 @@ extern bool legitimate_indirect_address_p (rtx, in
 extern bool legitimate_indexed_address_p (rtx, int);
 extern bool avoiding_indexed_address_p (machine_mode);
 extern rtx rs6000_force_indexed_or_indirect_mem (rtx x);
+extern bool rs6000_target_supports_dform_offset_p (machine_mode,
+   HOST_WIDE_INT);
 
 extern rtx rs6000_got_register (rtx);
 extern rtx find_addr_reg (rtx);
@@ -246,6 +248,8 @@ namespace gcc { class context; }
 class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
+extern rtl_opt_pass *make_pass_insert_dform (gcc::context *);
+
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
 extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c (revision 275051)
+++ gcc/config/rs6000/rs6000.c (working copy)
@@ -8725,6 +8725,152 @@ rs6000_debug_legitimate_address_p (machine_mode mo
   return ret;
 }
 
+/* This function provides an approximation of which d-form addressing
+   expressions are valid on any given target configuration.  This
+   approximation guides optimization choices.  Secondary validation
+   of the addressing mode is performed before code generation.
+
+   Return true iff target has instructions to perform a memory
+   operation at the specified BYTE_OFFSET from an address held
+   in a general purpose register.  */
+bool
+rs6000_target_supports_dform_offset_p (machine_mode mode,
+       HOST_WIDE_INT byte_offset)
+{
+  const HOST_WIDE_INT max_16bit_signed = (0x7fff);
+  const HOST_WIDE_INT min_16bit_signed = -1 - max_16bit_signed;
+
+  /* Available d-form instructions with P1 (the original Power architecture):
+
+     lbz RT,D(RA) - load byte and zero d-form
+     lhz RT,D(RA) - load half word and zero d-form
+     lha RT,D(RA) - load half word algebraic d-form
+     lwz RT,D(RA) - load word and zero d-form
+     lfs FRT,D(RA) - load floating-point single d-form
+     lfd FRT,D(RA) - load floating-point double d-form
+
+     stb RS,D(RA) - store byte d-form
+     sth RS,D(RA) - store half word d-form
+     stfs FRS,D(RA) - store floating point single d-form
+     stfd FRS,D(RA) - store floating point double d-form  */
+
+  /* Available d-form instructions with PPC (prior to v2.00):
+     (option mpowerpc "existed in the past" but is now "always on"
+
+     lwa RT,DS(RA) - load word algebraic ds-form (2 bottom bits zero)
+     ld RT,DS(RA) - load double word ds-form (2 bottom bits zero)
+
+     std RS,DS(RA) - store double word ds-form (2 bottom bits zero)
+
+     Consider lwa redundant with insn available in prior processors.  */
+  switch (mode)
+    {
+    case E_QImode:
+    case E_HImode:
+    case E_SImode:
+    case E_SFmode:
+    case E_DFmode:
+      if (IN_RANGE (byte_offset, min_16bit_signed, max_16bit_signed))
+ return true;
+      break;
+
+    case E_DImode:
+      if (IN_RANGE (byte_offset, min_16bit_signed, max_16bit_signed)
+  && ((byte_offset & 0x03) == 0))
+ return true;
+      break;
+
+    default:
+      ;   /* Fall through to see if other instructions will work.  */
+
+    }
+
+  /* Available d-form instructions with v2.03:
+
+     lq RTp,DQ(RA) - load quadword dq-form (4 bottom bits zero)
+
+     stq RSp,DS(RA) - store quadword ds-form (2 bottom bits zero)
+
+     These instructions are not recommended for general use as they
+     are expected to be very inefficient.  Their design was apparently
+     motivated by a need to support atomic quad-word access, which is
+     difficult to implement even in hardware on some architectures.
+     Furthermore, the design of these instructions apparently does the
+     "wrong" thing with regards to swapping of double words on load
+     and store for little-endian targets.
+
+     Therefore, this routine assumes v2.03 does NOT support quadword
+     d-form addressing.  */
+
+  /* Available d-form instructions with v2.05
+
+     (There are some floating-point load and store double-pair
+      instructions.  Consider them "not available".  There are
+      described as phasing out, which means they are expected
+      to have poor performance.)  */
+
+  /* Available d-form instructions with 3.0
+
+     lxsd VRT,DS(RA) - Load VSX scalar double word ds-form (2 bottom bits zero)
+                       (redundant with lfd from P1)
+     lxssp VRT,DS(RA) - Load VSX scalar single precision ds-form
+                        (bottom 2 bits zero)
+                        (redundant with lfs from P1)
+     lxv XT,DQ(RA) - Load VSX Vector dq-form (4 bottom bits zero)
+                     (Works on little endian for any element type, but
+      does not preserve lanes.)
+
+     stxsd VRS,DS(RA) - Store VSX scalar double-word DS form
+                        (bottom 2 bits zero)
+                        (redundant with stfd from P1)
+     stxssp VRS,DS(RA) - Store VSX scalar single precision DS-form
+                         (bottom 2 bits zero)
+                         (redundant with stfs from P1)
+     stxv XS,DQ(RA) - Store VSX vector dq-form (4 bottom bits zero)
+                      (Works on little endian for any element type,
+       but does not preserve lanes.)
+
+     lxv and stxv load/store to/from any VSX register, including
+     registers that overlay with floating point and altivec register
+     sets.  */
+
+  if (rs6000_isa_flags & OPTION_MASK_MODULO) /* ISA 3.0 */
+    {
+      switch (mode)
+ {
+ case E_V16QImode:
+ case E_V8HImode:
+ case E_V4SFmode:
+ case E_V4SImode:
+ case E_V2DFmode:
+ case E_V2DImode:
+ case E_V1TImode:
+ case E_TImode:
+
+ case E_KFmode: /* ieee 754 128-bit floating point */
+ case E_IFmode: /* IBM extended 128-bit double */
+ case E_TFmode: /* 128-bit double (form depends on
+   gcc command line, which may be
+   either -mabi=ieeelongdouble (KF)
+   or -mabi=ibmlongdouble (IF). */
+  /* All 128-bit loads and stores are handled by lxv and stxv.  */
+  if (IN_RANGE (byte_offset, min_16bit_signed, max_16bit_signed)
+      && ((byte_offset & 0x0f) == 0))
+    return true;
+  break;
+
+ default:
+  ; /* fall through to see if other instructions will work.  */
+ }
+    }
+
+  /* Todo: add support for any new instructions provided by future
+     archictures when support for those future architectures is
+     enabled.  */
+
+  return false;
+}
+
 /* Implement TARGET_MODE_DEPENDENT_ADDRESS_P.  */
 
 static bool
Index: gcc/config/rs6000/t-rs6000
===================================================================
--- gcc/config/rs6000/t-rs6000 (revision 275051)
+++ gcc/config/rs6000/t-rs6000 (working copy)
@@ -47,6 +47,10 @@ rs6000-call.o: $(srcdir)/config/rs6000/rs6000-call
  $(COMPILE) $<
  $(POSTCOMPILE)
 
+rs6000-p9dform.o: $(srcdir)/config/rs6000/rs6000-p9dform.c
+ $(COMPILE) $<
+ $(POSTCOMPILE)
+
 $(srcdir)/config/rs6000/rs6000-tables.opt: $(srcdir)/config/rs6000/genopt.sh \
   $(srcdir)/config/rs6000/rs6000-cpus.def
  $(SHELL) $(srcdir)/config/rs6000/genopt.sh $(srcdir)/config/rs6000 > \
Index: gcc/config.gcc
===================================================================
--- gcc/config.gcc (revision 275051)
+++ gcc/config.gcc (working copy)
@@ -499,7 +499,7 @@ or1k*-*-*)
  ;;
 powerpc*-*-*)
  cpu_type=rs6000
- extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-call.o"
+ extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p9dform.o rs6000-logue.o rs6000-call.o"
  extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
  extra_headers="${extra_headers} bmi2intrin.h bmiintrin.h"
  extra_headers="${extra_headers} xmmintrin.h mm_malloc.h emmintrin.h"
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-0.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-0.c (working copy)
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+    sacc = 0.00;
+    for (unsigned long long int i = 0; i < N; i++) {
+      sacc += x[i] * y[i];
+    }
+    dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-1.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-1.c (working copy)
@@ -0,0 +1,56 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
+
+/* This test confirms that the dform instructions are selected in the
+   translation of this main program.  */
+
+extern void first_dummy ();
+extern void dummy (double sacc, int n);
+extern void other_dummy ();
+
+extern float opt_value;
+extern char *opt_desc;
+
+#define M 128
+#define N 512
+
+double x [N];
+double y [N];
+double z [N];
+
+int main (int argc, char *argv []) {
+  double sacc;
+
+  first_dummy ();
+  for (int j = 0; j < M; j++) {
+
+    sacc = 0.00;
+    for (unsigned long long int i = 0; i < N; i++) {
+      z[i] = x[i] * y[i];
+      sacc += z[i];
+    }
+    dummy (sacc, N);
+  }
+  opt_value = ((float) N) * 2 * ((float) M);
+  opt_desc = "flops";
+  other_dummy ();
+}
+
+
+
+/* At time the dform optimization pass was merged with trunk, 12
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+
+/* At time the dform optimization pass was merged with trunk, 6
+   stxv instructions were emitted in place of the same number of stxvx
+   instructions.  No need to require exactly this number, as it may
+   change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mstxv\M} } } */
+
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-10.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-10.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-10.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE signed int
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-11.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-11.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-11.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE unsigned long long
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mld\M} } } */
+/* { dg-final { scan-assembler {\mstd\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-12.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-12.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-12.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE signed long long
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mld\M} } } */
+/* { dg-final { scan-assembler {\mstd\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-13.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-13.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-13.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE unsigned __int128
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mld\M} } } */
+/* { dg-final { scan-assembler {\mstd\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-14.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-14.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-14.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE signed __int128
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mld\M} } } */
+/* { dg-final { scan-assembler {\mstd\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-15.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-15.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-15.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie -mfloat128" } */
+
+#define TYPE __float128
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-2.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-2.c (working copy)
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+
+#define TYPE float
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-3.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-3.c (working copy)
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE double
+#include "p9-dform-generic.h"
+
+/* At time the dform optimization pass was merged with trunk, 6
+   lxv instructions were emitted in place of the same number of lxvx
+   instructions and 8 stxv instructions replace the same number of
+   stxvx instructions.  No need to require exactly this number, as it
+   may change when other optimization passes evolve.  */
+
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-4.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-4.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE long double
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlfd\M} } } */
+/* { dg-final { scan-assembler {\mstfd\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-5.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-5.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-5.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE unsigned char
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-6.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-6.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-6.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE signed char
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-7.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-7.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-7.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE unsigned short
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-8.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-8.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-8.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE signed short
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-9.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-9.c (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-9.c (working copy)
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
+
+#define TYPE unsigned int
+#include "p9-dform-generic.h"
+
+/* The precise number of lxv and stxv instructions may be impacted by
+   complex interactions between optimization passes, but we expect at
+   least one of each.  */
+/* { dg-final { scan-assembler {\mlxv\M} } } */
+/* { dg-final { scan-assembler {\mstxv\M} } } */
Index: gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h (nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h (working copy)
@@ -0,0 +1,34 @@
+
+#define ITERATIONS 1000000
+
+#define SIZE (16384/sizeof(TYPE))
+
+static TYPE x[SIZE] __attribute__ ((aligned (16)));
+static TYPE y[SIZE] __attribute__ ((aligned (16)));
+static TYPE a;
+
+void obfuscate(void *a, ...);
+
+static void __attribute__((noinline)) do_one(void)
+{
+  unsigned long i;
+
+  obfuscate(x, y, &a);
+
+  for (i = 0; i < SIZE; i++)
+    y[i] = a * x[i];
+
+  obfuscate(x, y, &a);
+
+}
+
+int main(void)
+{
+  unsigned long i;
+
+  for (i = 0; i < ITERATIONS; i++)
+    do_one();
+
+  return 0;
+
+}
Reply | Threaded
Open this post in threaded view
|

Re: Ping: [PATCH v4, rs6000] Replace X-form addressing with D-form addressing in new pass for Power9

Kelvin Nilsen-2


On 10/25/19 8:30 PM, Kelvin Nilsen wrote:

>
> This patch adds a new optimization pass for rs6000 targets.
>
> This new pass scans existing rtl expressions and replaces X-form loads and stores with rtl expressions that favor selection of the D-form instructions in contexts for which the D-form instructions are preferred.  The new pass runs after the RTL loop optimizations since loop unrolling often introduces opportunities for beneficial replacements of X-form addressing instructions.
>
> For each of the new tests, multiple X-form instructions are replaced with D-form instructions, some addi instructions are replaced with add instructions, and some addi instructions are eliminated.  The typical improvement for the included tests is a decrease of 4.28% to 12.12% in the number of instructions executed on each iteration of the loop.  The optimization has not shown measurable improvement on specmark tests, presumably because the typical loops that are benefited by this optimization are memory bounded and this optimization does not eliminate memory loads or stores.  However, it is anticipated that multi-threaded workloads and measurements of total power and cooling costs for heavy server workloads would benefit.
>
> This version 4 patch responds to feedback and numerous suggestions by Segher:
>
>   1. Further improvements to comments and discussion of computational complexity.
>
>   2. Changed the name of insn_sequence_no to luid.
>
>   3. Fixed some typos in comments.
>
>   4. Added macro-defined constants to enforce upper bounds on the sizes (and number of required iterations) for certain data structures.  The intent is to bound compile time for programs that represent large numbers of opportunities for D-form replacements.  This optimization pass ignores  parts of a source program that exceed these macro-defined size limits.
>
> In a separate mail, I have sent discussion regarding the behavior of preceding passes and how this behavior relates to this new pass.
>
> I have built and regression tested this patch on powerpc64le-unknown-linux target with no regressions.
>
> Is this ok for trunk?
>
> gcc/ChangeLog:
>
> 2019-10-25  Kelvin Nilsen  <[hidden email]>
>
> * config/rs6000/rs6000-p9dform.c: New file.
> * config/rs6000/rs6000-passes.def: Add pass_insert_dform.
> * config/rs6000/rs6000-protos.h
> (rs6000_target_supports_dform_offset_p): New function prototype.
> (make_pass_insert_dform): Likewise.
> * config/rs6000/rs6000.c (rs6000_target_supports_dform_offset_p):
> New function.
> * config/rs6000/t-rs6000 (rs6000-p9dform.o): New build target.
> * config.gcc: Add rs6000-p9dform.o object file.
>
> gcc/testsuite/ChangeLog:
>
> 2019-10-25  Kelvin Nilsen  <[hidden email]>
>
> * gcc.target/powerpc/p9-dform-0.c: New test.
> * gcc.target/powerpc/p9-dform-1.c: New test.
> * gcc.target/powerpc/p9-dform-10.c: New test.
> * gcc.target/powerpc/p9-dform-11.c: New test.
> * gcc.target/powerpc/p9-dform-12.c: New test.
> * gcc.target/powerpc/p9-dform-13.c: New test.
> * gcc.target/powerpc/p9-dform-14.c: New test.
> * gcc.target/powerpc/p9-dform-15.c: New test.
> * gcc.target/powerpc/p9-dform-2.c: New test.
> * gcc.target/powerpc/p9-dform-3.c: New test.
> * gcc.target/powerpc/p9-dform-4.c: New test.
> * gcc.target/powerpc/p9-dform-5.c: New test.
> * gcc.target/powerpc/p9-dform-6.c: New test.
> * gcc.target/powerpc/p9-dform-7.c: New test.
> * gcc.target/powerpc/p9-dform-8.c: New test.
> * gcc.target/powerpc/p9-dform-9.c: New test.
> * gcc.target/powerpc/p9-dform-generic.h: New test.
>
> Index: gcc/config/rs6000/rs6000-p9dform.c
> ===================================================================
> --- gcc/config/rs6000/rs6000-p9dform.c (nonexistent)
> +++ gcc/config/rs6000/rs6000-p9dform.c (working copy)
> @@ -0,0 +1,1763 @@
> +/* Subroutines used to transform array subscripting expressions into
> +   forms that are more amenable to d-form instruction selection for p9
> +   little-endian VSX code.
> +   Copyright (C) 1991-2019 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published
> +   by the Free Software Foundation; either version 3, or (at your
> +   option) any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but WITHOUT
> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> +   License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "tree.h"
> +#include "memmodel.h"
> +#include "df.h"
> +#include "tm_p.h"
> +#include "ira.h"
> +#include "print-tree.h"
> +#include "varasm.h"
> +#include "explow.h"
> +#include "expr.h"
> +#include "output.h"
> +#include "tree-pass.h"
> +#include "rtx-vector-builder.h"
> +#include "cfgloop.h"
> +
> +#include "insn-config.h"
> +#include "recog.h"
> +
> +#include "print-rtl.h"
> +#include "tree-pretty-print.h"
> +
> +#include "genrtl.h"
> +
> +/* This pass transforms array indexing expressions from a form that
> +   favors selection of X-form instructions into a form that favors
> +   selection of D-form instructions.
> +
> +   Showing favor for D-form instructions is especially important when
> +   targeting Power9, as the Power9 architecture added a number of new
> +   D-form instruction capabilities.
> +
> +   Consider, for example, the following loop, excerpted from an actual
> +   program:
> +
> +    double sacc, x[], y[], z[];
> +    sacc = 0.00;
> +    for (unsigned long long int i = 0; i < N; i++) {
> +      z[i] = x[i] * y[i];
> +      sacc += z[i];
> +    }
> +
> +   Compile this program with the following gcc options which enable both
> +   vectorization and loop unrolling:
> +    -m64 -fdump-rtl-all-details -mcpu=power9 -mtune=power9 -funroll-loops -O3
> +
> +   Without this pass, this loop is represented by the following:
> +
> +   lxvx:       16
> + addi: 8
> + xvmuldp: 8
> + stxvx: 8
> + fmr: 8
> + xxpermdi: 8
> + fadd:       16
> + bdnz: 1
> +      ___
> +      total:   73 instructions
> +
> +.L3:
> + lxvx 0,29,11
> + lxvx 12,30,11
> + addi 12,11,16
> + addi 0,11,48
> + addi 5,11,64
> + addi 9,11,32
> + addi 6,11,80
> + addi 7,11,96
> + addi 8,11,112
> + lxvx 2,29,12
> + lxvx 3,30,12
> + lxvx 4,29,0
> + lxvx 5,30,0
> + lxvx 10,30,9
> + lxvx 11,29,5
> + xvmuldp 6,0,12
> + lxvx 13,30,5
> + lxvx 8,29,9
> + lxvx 27,29,6
> + lxvx 28,30,6
> + xvmuldp 7,2,3
> + lxvx 29,29,7
> + lxvx 30,30,7
> + xvmuldp 9,4,5
> + lxvx 3,30,8
> + lxvx 0,29,8
> + xvmuldp 8,8,10
> + xvmuldp 10,11,13
> + xvmuldp 11,27,28
> + xxpermdi 26,6,6,3
> + fmr 2,6
> + stxvx 6,31,11
> + xvmuldp 12,29,30
> + addi 11,11,128
> + fadd 1,26,1
> + xxpermdi 26,7,7,3
> + stxvx 7,31,12
> + fmr 27,7
> + xvmuldp 0,0,3
> + xxpermdi 30,9,9,3
> + fmr 31,9
> + stxvx 8,31,9
> + xxpermdi 13,10,10,3
> + xxpermdi 28,8,8,3
> + stxvx 9,31,0
> + stxvx 10,31,5
> + fadd 1,2,1
> + fmr 2,10
> + xxpermdi 3,11,11,3
> + stxvx 11,31,6
> + fmr 4,11
> + fmr 29,8
> + xxpermdi 5,12,12,3
> + fmr 6,12
> + stxvx 12,31,7
> + xxpermdi 9,0,0,3
> + fmr 8,0
> + fadd 7,26,1
> + stxvx 0,31,8
> + fadd 10,27,7
> + fadd 11,28,10
> + fadd 12,29,11
> + fadd 26,30,12
> + fadd 27,31,26
> + fadd 30,13,27
> + fadd 31,2,30
> + fadd 0,3,31
> + fadd 28,4,0
> + fadd 29,5,28
> + fadd 13,6,29
> + fadd 1,9,13
> + fadd 1,8,1
> + bdnz .L3
> +
> +   With this pass, the same loop is represented by:
> +
> +   lxvx:        4
> + lxv:       12
> + addi: 2
> + add: 3
> + xvmuldp: 8
> + stxvx: 2
> + stxv: 6
> + fmr: 8
> + xxpermdi: 8
> + fadd:       16
> + bdnz: 1
> +      ___
> +      total:   70 instructions
> +
> +.L3:
> + lxvx 0,29,6
> + lxvx 12,30,6
> + addi 10,6,16
> + add 7,29,10
> + add 8,30,10
> + lxvx 2,29,10
> + lxvx 3,30,10
> + add 11,31,10
> + lxv 4,16(7)
> + lxv 9,16(8)
> + xvmuldp 6,0,12
> + lxv 13,64(8)
> + lxv 5,32(7)
> + lxv 28,32(8)
> + lxv 11,64(7)
> + xvmuldp 7,2,3
> + lxv 30,80(7)
> + lxv 12,80(8)
> + lxv 31,48(8)
> + lxv 10,48(7)
> + xvmuldp 8,4,9
> + lxv 3,96(8)
> + lxv 0,96(7)
> + xvmuldp 9,5,28
> + xvmuldp 11,11,13
> + xxpermdi 29,6,6,3
> + fmr 4,6
> + stxvx 6,31,6
> + xvmuldp 12,30,12
> + addi 6,6,128
> + fadd 1,29,1
> + xxpermdi 28,7,7,3
> + fmr 29,7
> + stxvx 7,31,10
> + xvmuldp 10,10,31
> + xxpermdi 30,8,8,3
> + fmr 31,8
> + stxv 8,16(11)
> + xvmuldp 0,0,3
> + xxpermdi 5,11,11,3
> + fmr 6,11
> + xxpermdi 13,9,9,3
> + fadd 1,4,1
> + stxv 11,64(11)
> + xxpermdi 7,12,12,3
> + fmr 8,12
> + stxv 12,80(11)
> + fmr 2,9
> + xxpermdi 3,10,10,3
> + fmr 4,10
> + stxv 9,32(11)
> + stxv 10,48(11)
> + fadd 28,28,1
> + xxpermdi 9,0,0,3
> + fmr 10,0
> + stxv 0,96(11)
> + fadd 11,29,28
> + fadd 29,30,11
> + fadd 12,31,29
> + fadd 30,13,12
> + fadd 31,2,30
> + fadd 13,3,31
> + fadd 2,4,13
> + fadd 1,5,2
> + fadd 0,6,1
> + fadd 3,7,0
> + fadd 4,8,3
> + fadd 5,9,4
> + fadd 1,10,5
> + bdnz .L3
> +
> +   The optimized loop body replaces 12 lxvx instructions with lxv
> +   instructions, 6 stxvx instructions with stxv, and has 3 fewer add
> +   operations.
> +
> +   This pass runs immediately after pass_loop2.  Loops have already
> +   been unrolled.  The pass searches for sequences of code of the following
> +   form.  These code sequences often appear within the expanded loop bodies
> +   that result from unrolling.  The memory access patterns below match
> +   both load and store instructions.  The set of memory operations
> +   that derive from the same originating expression are grouped together
> +   by this algorithm into a collection identified within the code as an
> +   equivalence class.
> +
> +   A0: *(array_base + offset)
> +   ;; The above is known as an originating access to memory.
> +
> +   Aij: offset += constant (i, j)
> +   ;; Between consecutive accesses to memory, there may appear zero or
> +   ;; more constant adjustments to the memory offset subexpression.
> +
> +   Ai: *(array_base + offset)
> +   ;; The memory address for each subsequent access to memory differs
> +   ;; from the originating memory access by a constant offset, which
> +   ;; is computed by adding together all of the preceding constant
> +   ;; (i,j) values.
> +
> +   ;; In any given equivalence class, there may be multiple subsequent
> +   ;; memory accesses, identifed as A2, A3, ... AN, and there may be
> +   ;; multiple constant adjustments to the offset expression between
> +   ;; each pair Ai-1 and Ai where the N intervening constant
> +   ;; adjustments are identified as Aij for j ranging from 0 to N-1.
> +
> +   ;; It is required that each element of the matched pattern dominate
> +   ;; the element that follows.  In other words, the flow through the
> +   ;; various matched elements must be unconditional.  Otherwise, the
> +   ;; matched elements cannot be considered to reside within the same
> +   ;; equivalence class for purposes of this optimization.
> +
> +   This pass replaces the above-matched sequences with:
> +
> +   Ai: derived_pointer = array_base + offset
> +       *(derived_pointer)
> +
> +   Aij: leave these alone.  expect that subsequent optimization deletes
> +        this code as it may become dead (since we don't use the
> +        indexing expression following our code transformations.)
> +
> +   Ai:
> +   *(derived_pointer + constant_i)
> +     (where constant_i equals sum of constant (n,j) for all n from 1
> +      to i paired with all j from 1 to Kn,
> +
> +   For example, if the original code is of the form
> +     int offset = some_var;
> +     x0 = base[offset];
> +     offset -= 8;
> +     x1 = base[offset];
> +     other_offset = offset + 16;
> +     x2 = base[other_offset]
> +
> +   This is transformed into
> +
> +     iv_ptr = base+some_var;
> +     x0 = iv_ptr[0];
> +     x1 = iv_ptr[-8];
> +     x2 = iv_ptr[8];
> +
> +   This pass makes no effort to assure that the calculated iv_ptr
> +   value points into the middle of the range of addresses to be
> +   spanned by the pointer.  We simply choose the value of the
> +   addressing expression that dominates all the other addressing
> +   instructions within the same equivalence class.  In theory, a more
> +   judicious choice of the calculated iv_ptr value can result in fewer
> +   instructions not matching the d-form addressing mode because the
> +   constant offset exceeds the reach of the instruction's encoding
> +   limits.  However, this is very rare and the extra effort required
> +   to optimize the choice of the computed iv_ptr value is deemed not
> +   worth the effort at the current time.
> +
> +   Note that there may be multiple equivalence classes, each
> +   associated with the same or possibly a different array_base value
> +   within each function that is processed by this optimization pass.  */
> +
> +/* This is based on the union-find logic in web.c.  web_entry_base is
> +   defined in df.h.  */
> +class indexing_web_entry: public web_entry_base
> +{
> + public:
> +  rtx_insn *insn;
> +  basic_block bb;
> +
> +  /* A unique sequence number, identified as the local unique id, or
> +     luid, is assigned to each instruction for the purpose of
> +     simplifying domination tests.  Within each basic block, local
> +     unique id numbers are assigned in strictly increasing order.
> +     Thus, for any two instructions known to reside in the same basic
> +     block, the instruction with a lower luid is known to dominate the
> +     instruction with a higher luid.  */
> +  unsigned int luid;
> +
> +  /* If this insn is relevant, it is a load or store with a memory
> +     address that is comprised of a base pointer (e.g. the address of
> +     an array or array slice) and an index expression (e.g. an index
> +     within the array).  By convention, the base expression is assumed
> +     to be the left argument to the addressing arithmetic plus
> +     operator and the index expression is assumed to be the right
> +     argument.  Symmetric presentation of the same two arguments of
> +     the plus operator will not be recognized as belonging to the same
> +     equivalence class.  The original_base_use and original_index_use
> +     fields represent the numbers of the instructions that define the
> +     base and index values which are summed together with a constant
> +     value to determine the value of this instruction's memory
> +     address.  */
> +  unsigned int original_base_use;
> +  unsigned int original_index_use;
> +
> +  /* If this insn is relevant, the register assigned by insn
> +     original_base_use is original_base_reg.  The insn assigned by insn
> +     original_index_use is original_index_reg.  */
> +  unsigned int original_base_reg;
> +  unsigned int original_index_reg;
> +
> +  /* If this insn is_relevant, this is the constant that is added to
> +     the originating expression to calculate the value of this insn's
> +     memory address.  */
> +  int base_delta;
> +  int index_delta;
> +
> +  /* If this insn is relevant, it belongs to an equivalence class.
> +     The equivalence classes are identified by the definitions that
> +     define the inputs to this insn.   */
> +  unsigned int base_equivalence_hash;
> +  unsigned int index_equivalence_hash;
> +
> +  /* When multiple insns fall within the same equivalence class, they
> +     are linked together through this field.  The value UINT_MAX
> +     represents the end of this list.  */
> +  unsigned int equivalent_sibling;
> +
> +  /* Only instructions that represent loads or stores for which the
> +     memory address computation is in a particular simple form are
> +     considered relevant to this d-form optimization pass.
> +
> +     If a particular entry is identified as is_relevant == false, the
> +     values of the following fields are all undefined: is_load,
> +     is_store, is_originating, original_base_use, original_index_use,
> +     original_base_reg, original_index_reg, base_delta, index_delta,
> +     base_equivalence_hash, index_equivalence_hash, and
> +     equivalent_sibling.  */
> +  unsigned int is_relevant : 1;
> +  unsigned int is_load : 1;
> +  unsigned int is_store : 1;
> +  unsigned int is_originating : 1;
> +};
> +
> +/* How many relevant uses are contained within this function?  A
> +   relevant insn is a memory store or load insn for which the memory
> +   address is computed as the sum or difference of a base address and
> +   an index expression and there is exactly one definition of the base
> +   expression and there is exactly one definition of the index
> +   expression.  The complexity of this pass is determined in part by
> +   this number. */
> +static unsigned int num_relevant_insns;
> +
> +/* What is the largest number of definitions that reach relevant
> +   uses within this function?  The complexity of this pass
> +   is determined in part by this number.  */
> +static unsigned int max_use_defs;
> +
> +/* How many otherwise relevant instructions were rejected due to
> +   overflow of the MAX_RELEVANT_INSNS or BOUND_MAX_USE_DEFS
> +   limitations?  */
> +static unsigned int rejected_otherwise_relevant;
> +
> +/* The complexity of this pass is O (num_relevant_insns *
> +   max_use_defs).  If num_relevant_insns > MAX_RELEVANT_INSNS, suspend
> +   further optimization so as to manage compilation complexity.  */
> +#define MAX_RELEVANT_INSNS 128
> +
> +/* The complexity of this pass is O (num_relevant_insns *
> +   max_use_defs).  If max_use_defs > BOUND_MAX_USE_DEFS, suspend
> +   further optimization so as to manage compilation complexity.  */
> +#define BOUND_MAX_USE_DEFS 8
> +
> +/* A relevant instruction is expected to make use of no more than 3
> +   inputs: the base register, the index register, and, for store
> +   operations, the register holding the value to be stored.  If an
> +   instruction has more uses, consider that instruction to be not
> +   relevant.  */
> +#define BOUND_RELEVANT_USES 3
> +
> +/* Count how many definitions reach the use that is represented by the
> +   DEF_LINK argument.  Update max_use_defs if returned count exceeds
> +   previously encountered maximum count value.  */
> +static unsigned int
> +count_links (struct df_link *def_link)
> +{
> +  unsigned int count;
> +  for (count = 0; def_link != NULL; count++)
> +    def_link = def_link->next;
> +  if (count > max_use_defs)
> +    max_use_defs = count;
> +  return count;
> +}
> +
> +/* Helper comparison function for use by qsort.  */
> +static int
> +int_compare (const void *v1, const void *v2)
> +{
> +  const int *i1 = (const int *) v1;
> +  const int *i2 = (const int *) v2;
> +  return *i1 - *i2;
> +}
> +
> +/* Calculate the hash value for the use represented by DEF_LINK, given
> +   that COUNT definitions are known to reach this use.
> +
> +   Complexity is n*log(n) (for qsort) where n is COUNT.  */
> +static unsigned int
> +help_hash (unsigned int count, struct df_link *def_link)
> +{
> +  int *ids;
> +  int i = 0;
> +
> +  ids = (int *) alloca (count * sizeof (ids[0]));
> +
> +  while (def_link != NULL)
> +    {
> +      ids[i++] = DF_REF_ID (def_link->ref);
> +      def_link = def_link->next;
> +    }
> +
> +  /* sort to put ids in ascending order. */
> +  qsort ((void *) ids, count, sizeof (ids[0]), int_compare);
> +
> +  unsigned int result = 0;
> +  for (unsigned i = 0; i < count; i++)
> +    {
> +      result = (result << 6) ^ ((result >> 28) & 0x0f);
> +      result += ids[i];
> +    }
> +  return result;
> +}
> +
> +/* Calculate a hash code that represents all of the definitions that
> +   contribute to a given variable's computed value.  */
> +static unsigned int
> +equivalence_hash (struct df_link *def_link)
> +{
> +  unsigned int count = count_links (def_link);
> +  return help_hash (count, def_link);
> +}
> +
> +/* Set *header to point to the chain of originating definitions
> +   corresponding to USE.  */
> +static void
> +overwrite_defs_header (indexing_web_entry *insn_entry, unsigned int uid,
> +       df_ref use, struct df_link **header)
> +{
> +  struct df_link *def_link = DF_REF_CHAIN (use);
> +  unsigned int uid2;
> +
> +  /* If there is no def, or the definition is artificial, or if there
> +     are multiple definitions, this is an originating use.  */
> +  if (!def_link || !def_link->ref
> +      || DF_REF_IS_ARTIFICIAL (def_link->ref) || def_link->next)
> +    *header = def_link;
> +  else if ((uid2 = insn_entry[uid].original_base_use) > 0)
> +    {
> +      df_ref use2;
> +      rtx_insn *insn2 = insn_entry[uid2].insn;
> +      struct df_insn_info *insn_info2 = DF_INSN_INFO_GET (insn2);
> +      use2 = DF_INSN_INFO_USES (insn_info2);
> +      if (use2)
> + *header = DF_REF_CHAIN (use2);
> +      else
> + *header = NULL;
> +    }
> +  else
> +    *header = NULL;
> +}
> +
> +/* Given that UID represents a "relevant" instruction, overwrite
> +   *insn_base with a pointer to the chain of originating definitions
> +   for the base expression.  Overwrite *insn_index with a pointer
> +   to the chain of originating definitions for the index expression.  */
> +static void
> +help_find_defs (indexing_web_entry *insn_entry,
> + unsigned int uid, rtx base_reg, rtx index_reg,
> + struct df_link **insn_base, struct df_link **insn_index)
> +{
> +  rtx_insn *insn = insn_entry[uid].insn;
> +  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> +  df_ref use;
> +
> +  /* Iterate over the uses consumed by this insn: loop iterates at
> +     most 3 times, once for use representing the base register, once
> +     for use representing the index register, and, for store
> +     operations, once for the use representing the value to be stored
> +     to memory.  */
> +  FOR_EACH_INSN_INFO_USE (use, insn_info)
> +    {
> +      if (rtx_equal_p (DF_REF_REG (use), base_reg))
> + overwrite_defs_header (insn_entry, uid, use, insn_base);
> +      else if (rtx_equal_p (DF_REF_REG (use), index_reg))
> + overwrite_defs_header (insn_entry, uid, use, insn_index);
> +    }
> +}
> +
> +/* Given that INSN is known to be relevant, find the linked list of
> +   definitions that define the base register and the the linked list
> +   of definitions that define the index register, and overwrite
> +   *INSN_BASE and *INSN_INDEX with pointers to the two linked lists
> +   respectively.
> +
> +   Since INSN is "relevant", it is known to represent a load or store
> +   operation and the associated memory address is known to be
> +   represented by a sum or difference of a base register and an index
> +   offset register.  */
> +static void
> +find_defs (indexing_web_entry *insn_entry, rtx_insn *insn,
> +   struct df_link **insn_base, struct df_link **insn_index)
> +{
> +  unsigned int uid = INSN_UID (insn);
> +  rtx body = PATTERN (insn);
> +  rtx mem = NULL;
> +
> +  gcc_assert (GET_CODE (body) == SET);
> +
> +  if (MEM_P (SET_SRC (body)))
> +    mem = XEXP (SET_SRC (body), 0);
> +  else if (MEM_P (SET_DEST (body)))
> +    mem = XEXP (SET_DEST (body), 0);
> +  else
> +    gcc_unreachable ();
> +
> +  enum rtx_code code = GET_CODE (mem);
> +  gcc_assert ((code == PLUS) || (code == MINUS));
> +
> +  rtx base_reg = XEXP (mem, 0);
> +  rtx index_reg = XEXP (mem, 1);
> +  if (REG_P (base_reg) && REG_P (index_reg))
> +    help_find_defs (insn_entry, uid,
> +    base_reg, index_reg, insn_base, insn_index);
> +}
> +
> +/* Return non-zero if and only if USE represents a compile-time constant.  */
> +static bool
> +represents_constant_p (df_ref use)
> +{
> +  struct df_link *def_link = DF_REF_CHAIN (use);
> +
> +  /* If there is no definition, or the definition is artificial,
> +     or if there are multiple definitions, this is an originating use
> +     which is not a constant.  */
> +  if (!def_link || !def_link->ref
> +      || DF_REF_IS_ARTIFICIAL (def_link->ref) || def_link->next)
> +    return false;
> +  else
> +    {
> +      /* Consult the instruction that provides definition of this use.  */
> +      rtx def_insn = DF_REF_INSN (def_link->ref);
> +      rtx body = PATTERN (def_insn);
> +      if (CONST_INT_P (body))
> + return true;
> +      else if (GET_CODE (body) == SET)
> + {
> +  /* Recurse on the uses that define the value of this definition.  */
> +  struct df_insn_info *inner_insn_info = DF_INSN_INFO_GET (def_insn);
> +  df_ref inner_use;
> +  FOR_EACH_INSN_INFO_USE (inner_use, inner_insn_info)
> +    {
> +      if (!represents_constant_p (inner_use))
> + return false;
> +    }
> +  /* All uses are constant.  */
> +  return true;
> + }
> +      else
> + return false; /* Treat unrecognized codes as not constant.  */
> +    }
> +}
> +
> +/* This function helps analyze opportunities for copy propogation and
> +   constant folding.
> +
> +   An originator represents the first point at which the value of
> +   DEF_LINK is derived from potentially more than one input
> +   definition, or the point at which DEF_LINK's value is defined by an
> +   algebraic expression involving only constants,
> +
> +   If DEF_LINK's value depends on a constant combined with a single
> +   variable or a simple propagation of a single variable, continue
> +   the search for the originator by examining the origin of the source
> +   variable's value.
> +
> +   The value of *ADJUSTMENT is overwritten with the constant value that is
> +   added to the originator expression to obtain the value intended to
> +   be represented by DEF_LINK.  In the case that find_true_originator
> +   returns NULL, the value held in *ADJUSTMENT is undefined.
> +
> +   Returns NULL if there is no single true originator.  In general, the search
> +   for an originator expression only spans SET operations that are
> +   based on simple algebraic expressions, each of which involves no
> +   more than one variable input.
> +
> +   Complexity: Linear in the number of definitions used by this
> +   instruction (<= BOUND_MAX_USE_DEFS) multiplied (for recursive
> +   calls) by the maximum depth of the potential copy propogation
> +   chain.
> +  */
> +static rtx
> +find_true_originator (struct df_link *def_link, long long int *adjustment)
> +{
> +  rtx def_insn = DF_REF_INSN (def_link->ref);
> +
> +  rtx inner_body = PATTERN (def_insn);
> +  if (GET_CODE (inner_body) == SET)
> +    {
> +      struct df_insn_info *inner_insn_info = DF_INSN_INFO_GET (def_insn);
> +      df_ref inner_use;
> +
> +      /* We're only happy with multiple uses if all but one represent
> + constant values.  */
> +      int non_constant_uses = 0;
> +      rtx result = NULL;
> +      FOR_EACH_INSN_INFO_USE (inner_use, inner_insn_info)
> + {
> +  if (!represents_constant_p (inner_use))
> +    {
> +      non_constant_uses++;
> +      /* There should be only one non-constant use, and it should
> + satisfy find_true_originator.  */
> +      struct df_link *def_link = DF_REF_CHAIN (inner_use);
> +
> +      /* If there is no definition, or the definition is artificial,
> + or there are multiple definitions, this is an
> + originating use.  */
> +      if (!def_link || !def_link->ref
> +  || DF_REF_IS_ARTIFICIAL (def_link->ref) || def_link->next)
> + result = def_insn;
> +      else
> + result = find_true_originator (def_link, adjustment);
> +    }
> + }
> +
> +      /* If non_constant_uses > 1, the value of result is not well
> + defined because it is overwritten during multiple iterations
> + of the above loop.  In the case that non_constant_uses > 1,
> + we ignore the result value computed above.  */
> +      if (non_constant_uses == 1) {
> +
> + /* If my SET looks like a simple register copy, or if it looks
> +   like PLUS or MINUS of a constant and a register, this is
> +   what we optimize.  Otherwise, punt.  */
> +
> + if (result == NULL)
> +  /* Doing constant arithmetic with unknown originator is not
> +     useful.  */
> +  return def_insn;
> +
> + rtx source_expr = SET_SRC (inner_body);
> + int source_code = GET_CODE (source_expr);
> + if (source_code == PLUS)
> +  {
> +    rtx op1 = XEXP (source_expr, 0);
> +    rtx op2 = XEXP (source_expr, 1);
> +
> +    if (CONST_INT_P (op1) && CONST_INT_P (op2))
> +      *adjustment += INTVAL (op1);
> +    else if (!CONST_INT_P (op1) && CONST_INT_P (op2))
> +      *adjustment += INTVAL (op2);
> +  }
> + else if (source_code == MINUS)
> +  {
> +    rtx op1 = XEXP (source_expr, 0);
> +    rtx op2 = XEXP (source_expr, 1);
> +
> +    if (!CONST_INT_P (op1) && CONST_INT_P (op2))
> +      *adjustment -= INTVAL (op1);
> +    else
> +      /* assumption is that *adjustment is added to a positive variable
> + expression, so don't optimize this rare condition.  */
> +      result = def_insn;
> +  }
> + else if (source_code != REG)
> +  /* We don't handle ashift, UNSPEC, etc..  */
> +  result = def_insn;
> + /* else, register copy expression does not impact adjustment.  */
> +
> + return result;
> +      }
> +      else
> + /* Same behavior if there are too many non-constant inputs or if
> +   all inputs are constant.  */
> + return def_insn;
> +    }
> +  else
> +    /* This is not a SET.  It does not serve as a true originator. */
> +    return NULL;
> +}
> +
> +/* The size of the insn_entry array.  Note that this array does not
> +   represent instructions created during this optimization pass.  */
> +static unsigned int max_uid_at_start;
> +
> +/* Return true if and only if ELEMENT is on LIST.  This is linear in
> +   the length of the list.  The list length is no longer than
> +   BOUND_MAX_USE_DEFS.  */
> +static bool
> +in_use_list (struct df_link *list, struct df_link *element)
> +{
> +  while (list != NULL)
> +    {
> +      if (element->ref == list->ref)
> + return true;
> +      list = list->next;
> +    }
> +  /* Got to end of list without finding element.  */
> +  return false;
> +}
> +
> +/* Return true iff the instruction represented by uid_1 is in the same
> +   equivalence class as the instruction represented by uid_2.
> +
> +   Returns false generally in constant time (based on unequal hash
> +   values).  Returning true, as currently implemented, requires
> +   quadratic time in the number of definitions that contribute to
> +   either the base or index expressions of the equivalence class.
> +   Number of iterations is no greater than (BOUND_MAX_USE_DEFS *
> +   BOUND_MAX_USE_DEFS).
> +
> +   To improve complexity to linear in the number of contributing
> +   definitions, allocate and remember the sorted list of all
> +   definitions for each relevant value.   */
> +static bool
> +equivalent_p (indexing_web_entry *insn_entry,
> +      unsigned int uid_1, unsigned int uid_2)
> +{
> +  if ((insn_entry[uid_1].base_equivalence_hash !=
> +       insn_entry[uid_2].base_equivalence_hash) ||
> +      (insn_entry[uid_1].index_equivalence_hash !=
> +       insn_entry[uid_2].index_equivalence_hash))
> +    return false;
> +
> +  /* Hash codes match.  Check details.  */
> +  rtx_insn *insn_1, *insn_2;
> +  insn_1 = insn_entry[uid_1].insn;
> +  insn_2 = insn_entry[uid_2].insn;
> +
> +  struct df_link *insn1_base_defs, *insn1_index_defs;
> +  struct df_link *insn2_base_defs, *insn2_index_defs;
> +
> +  find_defs (insn_entry, insn_1, &insn1_base_defs, &insn1_index_defs);
> +  find_defs (insn_entry, insn_2, &insn2_base_defs, &insn2_index_defs);
> +
> +  int base_count_1 = count_links (insn1_base_defs);
> +  int index_count_1 = count_links (insn1_index_defs);
> +  int base_count_2 = count_links (insn2_base_defs);
> +  int index_count_2 = count_links (insn2_index_defs);
> +
> +  if ((base_count_1 != base_count_2) || (index_count_1 != index_count_2))
> +    return false;
> +
> +  if (insn_entry [uid_1].original_base_reg
> +      != insn_entry [uid_2].original_base_reg)
> +    return false;
> +  else if (insn_entry [uid_1].original_index_reg
> +   != insn_entry [uid_2].original_index_reg)
> +    return false;
> +
> +  /* Counts are the same.  Make sure elements match.  */
> +  /* The following comparison code is quadratic in counts.  Improve to
> +     n*log(n) by sorting the four arrays and comparing eleents pairwise.
> +     Improve to linear by remembering the sorted contents of each of
> +     the four arrays from when the hash values were first computed.
> +     Since the typical sizes of the count variables are fairly small,
> +     leave as is unless performance measurements justify increased
> +     complexity.  */
> +  while (insn1_base_defs != NULL)
> +    {
> +      if (!in_use_list (insn2_base_defs, insn1_base_defs))
> + return false;
> +      insn1_base_defs = insn1_base_defs->next;
> +    }
> +  /* base patterns match, but stil need to consider index matches.  */
> +  while (insn1_index_defs != NULL)
> +    {
> +      if (!in_use_list (insn2_index_defs, insn1_index_defs))
> + return false;
> +      insn1_index_defs = insn1_index_defs->next;
> +    }
> +
> +  return true;
> +}
> +
> +/* Return true iff instruction E2 dominates instruction E1.  Note
> +   that insn_dominated_by_p defined in ira.c is declared static and
> +   requires initialization of auxilary data not computed in this
> +   context. */
> +static bool
> +insn_dominated_by_p (indexing_web_entry *e1, indexing_web_entry *e2)
> +{
> +  basic_block bb1 = e1->bb;
> +  basic_block bb2 = e2->bb;
> +
> +  if (bb1 == bb2)
> +    return e2->luid <= e1->luid;
> +  else
> +    return dominated_by_p (CDI_DOMINATORS, bb1, bb2);
> +}
> +
> +/* Confirm that every insn in the equivalence class that is
> +   represented by the EQUIVALENCE_HASH [INDEX] linked list can be
> +   represented using d-form memory addresses.  For any insn that
> +   cannot be so represented, decrement the value of the
> +   REPLACEMENT_COUNT argument.  Return the adjusted value of
> +   REPLACEMENT_COUNT.
> +
> +   Complexity is linear in the size of the equivalence class.  */
> +static unsigned int
> +confirm_dform_insns (indexing_web_entry *insn_entry,
> +     unsigned int *equivalence_hash, unsigned int index,
> +     unsigned int the_dominator, unsigned replacement_count)
> +{
> +  unsigned int uid;
> +  long long int dominator_delta = (insn_entry [the_dominator].base_delta
> +   + insn_entry [the_dominator].index_delta);
> +  for (uid = equivalence_hash [index]; uid != UINT_MAX;
> +       uid = insn_entry [uid].equivalent_sibling)
> +    {
> +      if (uid != the_dominator)
> + {
> +  long long int dominated_delta = (insn_entry [uid].base_delta
> +   + insn_entry [uid].index_delta);
> +  dominated_delta -= dominator_delta;
> +
> +  rtx_insn *insn = insn_entry [uid].insn;
> +  rtx body = PATTERN (insn);
> +  rtx mem;
> +
> +  gcc_assert (GET_CODE (body) == SET);
> +
> +  if (MEM_P (SET_SRC (body))) /* load */
> +    {
> +      mem = SET_SRC (body);
> +      if (!rs6000_target_supports_dform_offset_p
> +  (GET_MODE (mem), dominated_delta))
> + replacement_count--;
> +    }
> +  else
> +    {
> +      mem = SET_DEST (body); /* store */
> +      if (!rs6000_target_supports_dform_offset_p
> +  (GET_MODE (mem), dominated_delta))
> + replacement_count--;
> +    }
> + }
> +    }
> +  return replacement_count;
> +}
> +
> +/* Replace all xform insns in equivalence class with dform insns.
> +
> +   Complexity is linear in the size of the equivalence class.  */
> +void replace_xform_insns (indexing_web_entry *insn_entry,
> +  unsigned int *equivalence_hash,
> +  unsigned int index,
> +  unsigned int the_dominator)
> +{
> +  /* First, fix up the_dominator instruction.  */
> +  rtx derived_ptr_reg = gen_reg_rtx (Pmode);
> +  rtx_insn *insn = insn_entry [the_dominator].insn;
> +  rtx body = PATTERN (insn);
> +  rtx base_reg, index_reg;
> +  rtx addr, mem;
> +  rtx new_init_expr;
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file,
> +       "Endeavoring to replace originating insn %d: ", the_dominator);
> +      print_inline_rtx (dump_file, insn, 2);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  gcc_assert (GET_CODE (body) == SET);
> +  if (MEM_P (SET_SRC (body)))
> +    {
> +      /* originating instruction is a load */
> +      mem = SET_SRC (body);
> +      addr = XEXP (SET_SRC (body), 0);
> +    }
> +  else
> +    { /* originating instruction is a store */
> +      gcc_assert (MEM_P (SET_DEST (body)));
> +      mem = SET_DEST (body);
> +      addr = XEXP (SET_DEST (body), 0);
> +    }
> +
> +  enum rtx_code code = GET_CODE (addr);
> +  gcc_assert ((code == PLUS) || (code == MINUS));
> +  base_reg = XEXP (addr, 0);
> +  index_reg = XEXP (addr, 1);
> +
> +  if (code == PLUS)
> +    new_init_expr = gen_rtx_PLUS (Pmode, base_reg, index_reg);
> +  else
> +    new_init_expr = gen_rtx_MINUS (Pmode, base_reg, index_reg);
> +  new_init_expr = gen_rtx_SET (derived_ptr_reg, new_init_expr);
> +
> +  rtx_insn *new_insn = emit_insn_before (new_init_expr, insn);
> +  set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn));
> +  INSN_CODE (new_insn) = -1; /* force re-recogniition. */
> +  df_insn_rescan (new_insn);
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "with insn %d: ", INSN_UID (new_insn));
> +      print_inline_rtx (dump_file, new_insn, 2);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  /* If dominator_delta != 0, we need to make adjustments for dominator_delta
> +     in the D-form constant offsets associated with the propagating
> +     instructions.  */
> +
> +  rtx new_mem = gen_rtx_MEM (GET_MODE (mem), derived_ptr_reg);
> +  MEM_COPY_ATTRIBUTES (new_mem, mem);
> +  rtx new_expr;
> +  if (insn_entry [the_dominator].is_load)
> +    new_expr = gen_rtx_SET (SET_DEST (body), new_mem);
> +  else
> +    new_expr = gen_rtx_SET (new_mem, SET_SRC (body));
> +
> +  if (!validate_change (insn, &PATTERN(insn), new_expr, false))
> +    { /* proposed change was rejected */
> +      if (dump_file)
> + {
> +  fprintf (dump_file,
> +   "Dform optimization rejected by validate_change\n");
> +  print_inline_rtx (dump_file, new_insn, 2);
> +  fprintf (dump_file, "\n");
> + }
> +    }
> +  else if (dump_file)
> +    {
> +      fprintf (dump_file, "and with insn %d: ", INSN_UID (insn));
> +      print_inline_rtx (dump_file, insn, 2);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  for (unsigned int uid = equivalence_hash [index]; uid != UINT_MAX;
> +       uid = insn_entry [uid].equivalent_sibling)
> +    {
> +      if (uid != the_dominator)
> + {
> +  long long int dominated_delta = (insn_entry [uid].base_delta
> +   + insn_entry [uid].index_delta);
> +  long long int dominator_delta
> +    = (insn_entry [the_dominator].base_delta
> +       + insn_entry [the_dominator].index_delta);
> +  dominated_delta -= dominator_delta;
> +
> +  rtx_insn *insn = insn_entry [uid].insn;
> +  rtx body = PATTERN (insn);
> +  rtx mem;
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file,
> +       "Endeavoring to replace propagating insn %d: ", uid);
> +      print_inline_rtx (dump_file, insn, 2);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  gcc_assert (GET_CODE (body) == SET);
> +  if (MEM_P (SET_SRC (body))) /* load */
> +    mem = SET_SRC (body);
> +  else
> +    mem = SET_DEST (body); /* store */
> +
> +  rtx ci = gen_rtx_raw_CONST_INT (Pmode, dominated_delta);
> +  rtx addr_expr = gen_rtx_PLUS (Pmode,
> + derived_ptr_reg, ci);
> +  rtx new_mem = gen_rtx_MEM (GET_MODE (mem), addr_expr);
> +  MEM_COPY_ATTRIBUTES (new_mem, mem);
> +
> +  rtx new_expr;
> +  if (insn_entry [uid].is_load)
> +    new_expr = gen_rtx_SET (SET_DEST (body), new_mem);
> +  else
> +    new_expr = gen_rtx_SET (new_mem, SET_SRC (body));
> +
> +  if (!validate_change (insn, &PATTERN(insn), new_expr, false))
> +    { /* proposed change was rejected */
> +      if (dump_file)
> + {
> +  fprintf (dump_file,
> +   "Dform optimization rejected by validate_change\n");
> +  print_inline_rtx (dump_file, new_expr, 2);
> +  fprintf (dump_file, "\n");
> + }
> +    }
> +  else if (dump_file)
> +    {
> +      fprintf (dump_file, "with insn %d: ", INSN_UID (insn));
> +      print_inline_rtx (dump_file, insn, 2);
> +      fprintf (dump_file, "\n");
> +    }
> + }
> +    }
> +}
> +
> +/* Organize all "relevant" instructions into equivalence classes.
> +   Relevant instructions are instructions that load or store memory
> +   where the memory address is represented by a sum or difference of a
> +   base address register and an integer offset register.
> +
> +   An equivalence class holds all of the load and store operations
> +   that refer to the same computed base and/or index variables plus or
> +   minus some constant value.
> +
> +   All expressions in each equivalence class are replaced with d-form
> +   instructions in the emitted code on P9 or above.
> +
> +   Calculation of hash functions is linear in the number of
> +   definitions that potentially contribute to the computation of a
> +   particular variable's value.
> +
> +   Assume hash table insertion and lookup is constant time (i.e. we
> +   normally do not experience collision on hash values).
> +
> +   Complexity of of this function is O(m*n) where m is number of
> +   equivalence classes and n is maximum number of entries in the
> +   equivalence class.  Since each insn can be a member of only one
> +   equivalence class, this is linear in the number of insns within
> +   the function.  */
> +static void
> +build_and_fixup_equivalence_classes (indexing_web_entry *insn_entry)
> +{
> +  unsigned int i;
> +  /* There can be no more equivalence classes than the total number of
> +     instructions in the analyzed function.  Usually, there are far
> +     fewer instructions because many of the instructions are not load
> +     or store instructions, and some of those that are load and store
> +     instructions may end up in the same equivalence class.  */
> +  unsigned int *equivalence_hash =
> +    (unsigned int *) alloca (max_uid_at_start * sizeof (unsigned int));
> +
> +  /* Initialize the equivalence_hash array.  */
> +  for (i = 0; i < max_uid_at_start; i++)
> +    equivalence_hash [i] = UINT_MAX;
> +
> +  /* Place each relevant instruction into an equivalence class, either
> +     a class consisting only of itself, or a class that includes other
> +     relevant instructions.
> +
> +     Complexity of this loop is O(m*n^2) where m is the number of relevant
> +     instructions and n is number of definitions of the registers that
> +     hold the base or index components of the memory operation's
> +     address.  The number of iterations of the slow path through the
> +     loop is less than m * (BOUND_MAX_USE_DEFS * BOUND_MAX_USE_DEFS).  */
> +  for (unsigned int uid = 0; uid < max_uid_at_start; uid++)
> +    {
> +      if (insn_entry [uid].is_relevant)
> + {
> +  unsigned int hash = ((insn_entry [uid].base_equivalence_hash
> + + insn_entry [uid].index_equivalence_hash)
> +       % max_uid_at_start);
> +
> +  if (equivalence_hash [hash] == UINT_MAX)
> +    { /* first mention of this class */
> +      equivalence_hash [hash] = uid;
> +      insn_entry [uid].equivalent_sibling = UINT_MAX;
> +    }
> +  else
> +    {
> +      while ((equivalence_hash [hash] != UINT_MAX)
> +     && !equivalent_p (insn_entry, uid,
> +       equivalence_hash [hash]))
> + hash = (hash + 1) % max_uid_at_start;
> +
> +      if (equivalence_hash [hash] != UINT_MAX)
> + {
> +  /* Found an equivalent instruction. */
> +  insn_entry [uid].equivalent_sibling =
> +    equivalence_hash [hash];
> +  equivalence_hash [hash] = uid;
> + }
> +      else
> + {
> +  /* Equivalence class doesn't yet exist.  */
> +  equivalence_hash [hash] = uid;
> +  insn_entry [uid].equivalent_sibling = UINT_MAX;
> + }
> +    }
> + }
> +    }
> +
> +  /* Scrutinize each equivalence class.  For any entries in the
> +     equivalence class that are on conditional control flows (such
> +     that they do not dominate the other entries), remove these
> +     entries from the equivalence class.  */
> +  for (unsigned int i = 0; i < max_uid_at_start; i++)
> +    {
> +      while (equivalence_hash [i] != UINT_MAX)
> + {
> +  unsigned int the_dominator = equivalence_hash [i];
> +  unsigned int uid;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "Equivalence class consists of\n");
> +
> +  /* Find the dominator for this equivalence class.
> +
> +     Complexity of following loop body is O(m*n) where m is number
> +     of equivalence classes and n is maximum number of entries
> +     in the equivalence class.  */
> +  for (uid = the_dominator; uid != UINT_MAX;
> +       uid = insn_entry [uid].equivalent_sibling)
> +    {
> +      if (insn_dominated_by_p (&insn_entry [the_dominator],
> +       &insn_entry [uid]))
> + the_dominator = uid;
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, "  member: %d\n", uid);
> +    }
> +
> +  unsigned int size_of_equivalence = 0;
> +  unsigned int removed_partition = UINT_MAX;
> +  unsigned int preceding_uid = UINT_MAX;
> +  unsigned int next_uid;
> +
> +  /* Having found a dominator, remove from this equivalence
> +     class any element that is not dominated by the_dominator.
> +
> +     Though this loop appears to be O(m*n), for m being number
> +     of equivalence classes and n being maximum number of
> +     entries in an equivalence class, it can also be described
> +     as O(i), where i is the number of instructions in the
> +     functions implementation.  Each iteration of the loop
> +     deals with a distinct instruction.  */
> +  for (uid = equivalence_hash [i]; uid != UINT_MAX; uid = next_uid)
> +    {
> +      next_uid = insn_entry [uid].equivalent_sibling;
> +      if (!insn_dominated_by_p (&insn_entry [uid],
> + &insn_entry [the_dominator]))
> + {
> +  /* insn uid thinks its in this equivalence class, but
> +     it's not dominated by the_dominator, so remove it.  */
> +  insn_entry [uid].equivalent_sibling = removed_partition;
> +  removed_partition = uid;
> +  if (preceding_uid == UINT_MAX)
> +    equivalence_hash [i] = next_uid;
> +  else
> +    insn_entry [preceding_uid].equivalent_sibling = next_uid;
> + }
> +      else
> + {
> +  size_of_equivalence++;
> +  preceding_uid = uid;
> + }
> +    }
> +
> +  /* By same argument as above, confirm_dform_insns and
> +     replace_xform_insns are also O(i), where i is the number
> +     of instructions in the function.  */
> +  if (confirm_dform_insns (insn_entry, equivalence_hash, i,
> +   the_dominator, size_of_equivalence) > 1)
> +    replace_xform_insns (insn_entry, equivalence_hash,
> + i, the_dominator);
> +  else if (dump_file)
> +    fprintf (dump_file,
> +     "Abandoning dform optimization: too few dform insns\n");
> +
> +  /* if (removed_partition != UINT_MAX), need to reprocess the
> +     contents of the removed_partition.  There may be
> +     additional opportunity to optimize within the set of
> +     insns that were not dominated by the selected dominator.
> +
> +     Each time through this loop, at least one dominator and
> +     any instructions it dominates are "processed".  Anything
> +     not dominated by the selected dominator remains in the
> +     "removed partition".  The "removed partition" gets
> +     smaller on each iteration, assuring eventual termination.  */
> +  equivalence_hash [i] = removed_partition;
> + }
> +    }
> +}
> +
> +/* Assess whether the instruction represented by uid is relevant,
> +   setting *is_relevant to true if so, and setting *is_originating to
> +   true if this use is an originating definition.
> +
> +   If the insn is determined to be relevant and is_base is true, overwrite
> +   the base_delta, original_base_reg, original_base_use, and
> +   base_equivalence_hash fields of insn_entry[uid].
> +
> +   Otherwise, if the insn is determined to be relevant and is_base is
> +   not true, overwrite the index_delta, original_index_reg,
> +   original_index_use, and index_equivalence_hash fields of
> +   insn_entry[uid].
> +
> +   In case the insn is not determined to be relevant, certain fields
> +   fields of insn_entry[uid] may be overwritten with scratch values
> +   that have no significance as these fields are not consulted
> +   subequently in the case that the insn is not relevant.
> +
> +   Complexity: linear in the length of the number of entries on the
> +   use-definition chain, as represented by argument USE.
> +*/
> +static void
> +assess_use_relevance (bool is_base, rtx reg, bool *is_relevant,
> +      bool *is_originating, df_ref use,
> +      unsigned int uid, indexing_web_entry *insn_entry)
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "Found use corresponding to %s\n",
> +       is_base? "base": "index");
> +      df_ref_debug (use, dump_file);
> +    }
> +  struct df_link *def_link = DF_REF_CHAIN (use);
> +
> +  /* If there is no definition, or the definition is artificial, or there
> +     are multiple definitions, this is originating use.  */
> +  if (!def_link || !def_link->ref
> +      || DF_REF_IS_ARTIFICIAL (def_link->ref) || def_link->next)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, "Use is originating!\n");
> +      *is_relevant = true;
> +      *is_originating = true;
> +      unsigned int hash = equivalence_hash (def_link);
> +
> +      if (is_base)
> + {
> +  insn_entry[uid].base_delta = 0;
> +  insn_entry[uid].original_base_reg = REGNO (reg);
> +  insn_entry[uid].original_base_use = uid;
> +  insn_entry[uid].base_equivalence_hash = hash;
> + }
> +      else
> + {
> +  insn_entry[uid].index_delta = 0;
> +  insn_entry[uid].original_index_reg = REGNO (reg);
> +  insn_entry[uid].original_index_use = uid;
> +  insn_entry[uid].index_equivalence_hash = hash;
> + }
> +    }
> +  else
> +    {
> +      /* Only one definition.  Dig deeper.  */
> +      long long int delta = 0;
> +      rtx insn2 = find_true_originator (def_link, &delta);
> +      if (insn2)
> + {
> +  unsigned uid2 = INSN_UID (insn2);
> +  df_ref use2;
> +
> +  if (dump_file  && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "Use may propagate from %d\n", uid2);
> +
> +  struct df_insn_info *insn_info2 = DF_INSN_INFO_GET (insn2);
> +
> +  if (insn_info2)
> +    use2 = DF_INSN_INFO_USES (insn_info2);
> +  else
> +    use2 = NULL;
> +
> +  if (!use2 || !DF_REF_NEXT_LOC (use2))
> +    {
> +      *is_originating = false;
> +
> +      rtx body = PATTERN (insn2);
> +      gcc_assert (GET_CODE (body) == SET);
> +      gcc_assert (REG_P (SET_DEST (body)));
> +
> +      if (is_base)
> + {
> +  insn_entry[uid].original_base_reg = REGNO (SET_DEST (body));
> +  insn_entry[uid].original_base_use  = uid2;
> +  insn_entry[uid].base_delta = delta;
> + }
> +      else /* !is_base means is_index.  */
> + {
> +  insn_entry[uid].original_index_reg = REGNO (SET_DEST(body));
> +  insn_entry[uid].original_index_use = uid2;
> +  insn_entry[uid].index_delta = delta;
> + }
> +
> +      if (use2)
> + {
> +  struct df_link *def_link = DF_REF_CHAIN (use2);
> +  unsigned int hash = equivalence_hash (def_link);
> +  *is_relevant = true;
> +  if (is_base)
> +    insn_entry[uid].base_equivalence_hash = hash;
> +  else
> +    insn_entry[uid].index_equivalence_hash = hash;
> + }
> +      /* else, use is not relevant.  */
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file,
> + " propagates from originating insn %d"
> + " with delta: %lld\n", uid2, delta);
> +    }
> +  else if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, " Dependencies are too"
> +     " complicated for this optimization\n");
> + }
> +    }
> +}
> +
> +/* Given that INSN represents a memory store or load operation, that
> +   MEM is the expression that computes the address to or from which
> +   memory is transferred and INSN_ENTRY holds the array representing
> +   all of the indexing_web_entry structures associated with the
> +   instructions of this function, set the is_relevant field of
> +   INSN_ENTRY [UID] if the form of the memory address expression is
> +   compatible with dform optimization.
> +
> +   If the insn is considered relevant, this function also initializes
> +   the following fields of the corresponding insn_entry:
> +
> + is_originating
> +
> +   If is_relevant and is_originating, we set:
> +
> + original_base_reg
> + original_base_use
> + base_delta
> + base_equivalence_hash
> +
> + original_index_reg = REGNO (index_reg);
> + original_index_use = uid;
> + index_delta = 0;
> + index_equivalence_hash
> +
> +   If is_relevant and !is_originating, we set:
> +
> + original_index_reg
> + original_index_use
> + index_delta = (non-zero)
> + index_equivalence_hash (only if use2 != NULL)
> +
> +   Complexity is linear in the number of variables "used" by this
> +   instruction, multiplied by the number of definitions of each
> +   variable.  Loop iterations are no greater than
> +   BOUND_RELEVANT_USES * BOUND_MAX_USE_DEFS.  */
> +static void
> +assess_relevance (rtx mem, rtx_insn *insn, indexing_web_entry *insn_entry)
> +{
> +  unsigned int uid = INSN_UID (insn);
> +  rtx base_reg = XEXP (mem,0);
> +  rtx index_reg = XEXP (mem, 1);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, " memory is base +/- index, ");
> +      fprintf (dump_file, "base: ");
> +      print_inline_rtx (dump_file, base_reg, 2);
> +      fprintf (dump_file, "\n index: ");
> +      print_inline_rtx (dump_file, index_reg, 2);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  if (REG_P (base_reg) && REG_P (index_reg))
> +    {
> +      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
> +      /* Since insn is known to represent a sum or difference, this
> + insn is likely to use at least two input variables.  */
> +
> +      int num_base_defs = 0;
> +      int num_index_defs = 0;
> +      bool base_is_relevant = false;
> +      bool index_is_relevant = false;
> +      bool base_is_originating = false;
> +      bool index_is_originating = false;
> +      df_ref use;
> +      int use_count = 0;
> +
> +      /* Iterate over the number of definitions used by this
> + instruction to find the definitions that correspond to the
> + base register and the index register.  */
> +      FOR_EACH_INSN_INFO_USE (use, insn_info)
> + {
> +  if (use_count++ >= BOUND_RELEVANT_USES)
> +    {
> +      /* Relevant insns generally have no more than 3 uses.  */
> +      num_base_defs = 0;
> +      num_index_defs = 0;
> +      break;
> +    }
> +  else if (rtx_equal_p (DF_REF_REG (use), base_reg))
> +    {
> +      assess_use_relevance (true, base_reg, &base_is_relevant,
> +   &base_is_originating,
> +   use, uid, insn_entry);
> +      num_base_defs++;
> +    }
> +  else if (rtx_equal_p (DF_REF_REG (use), index_reg))
> +    {
> +      assess_use_relevance (false, index_reg, &index_is_relevant,
> +   &index_is_originating,
> +   use, uid, insn_entry);
> +      num_index_defs++;
> +    }
> + }
> +
> +      /* This insn is only relevant if there is exactly one definition of
> + base and one definition of index and they are both considered to
> + be relevant.  */
> +      if ((num_base_defs == 1) && (num_index_defs == 1)
> +  && base_is_relevant && index_is_relevant
> +  && (num_relevant_insns++ < MAX_RELEVANT_INSNS)
> +  && (max_use_defs <= BOUND_MAX_USE_DEFS))
> + {
> +  insn_entry[uid].is_relevant = true;
> +  insn_entry[uid].is_originating =
> +    (base_is_originating && index_is_originating);
> + }
> +      else if (dump_file)
> + {
> +  int rejection_count = 0;
> +  fprintf (dump_file,
> +   "Rejecting dform optimization of insn %d\n", uid);
> +  if (num_base_defs != 1)
> +    {
> +      fprintf (dump_file, "Too %s (%d) base definitions\n",
> +       (num_base_defs > 1)? "many": "few", num_base_defs);
> +      rejection_count++;
> +    }
> +  if (num_index_defs != 1)
> +    {
> +      fprintf (dump_file, "Too %s (%d) index definitions\n",
> +       (num_index_defs > 1)? "many": "few", num_index_defs);
> +      rejection_count++;
> +    }
> +  if (!base_is_relevant)
> +    {
> +      fprintf (dump_file,
> +       "The available base definition is not relevant\n");
> +      rejection_count++;
> +    }
> +  if (!index_is_relevant)
> +    {
> +      fprintf (dump_file,
> +       "The available index definition is not relevant\n");
> +      rejection_count++;
> +    }
> +  if ((rejection_count == 0)
> +      && (num_relevant_insns >= MAX_RELEVANT_INSNS))
> +    {
> +      fprintf (dump_file, "Too many relevant insns\n");
> +      rejected_otherwise_relevant++;
> +    }
> +  else if (max_use_defs >= BOUND_MAX_USE_DEFS)
> +    {
> +      fprintf (dump_file, "Too many relevant use definitions\n");
> +      rejected_otherwise_relevant++;
> +    }
> + }
> +      else
> + {
> +  if ((num_base_defs == 1) && (num_index_defs == 1)
> +      && base_is_relevant && index_is_relevant)
> +    rejected_otherwise_relevant++;
> + }
> +    }
> +  else if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, " punting because base or index not registers\n");
> +}
> +
> +
> +/* Main entry point for this pass.
> +
> +   Complexity is linear in the number of instructions in the function
> +   plus the complexity of build_and_fixup_equivalence_classes, which
> +   is also linear in the number of instruction (since each instruction
> +   is in no more than one equivalence class).  */
> +unsigned int
> +rs6000_insert_dform (function *fun)
> +{
> +  basic_block bb;
> +  rtx_insn *insn, *curr_insn = 0;
> +  indexing_web_entry *insn_entry;
> +  unsigned int luid = 0;
> +
> +  num_relevant_insns = 0;
> +  max_use_defs = 0;
> +  rejected_otherwise_relevant = 0;
> +
> +  calculate_dominance_info (CDI_DOMINATORS);
> +
> +  /* Dataflow analysis for use-def chains.  */
> +  df_set_flags (DF_RD_PRUNE_DEAD_DEFS);
> +  df_chain_add_problem (DF_DU_CHAIN | DF_UD_CHAIN);
> +  df_analyze ();
> +
> +  /* Since this pass creates new instructions, get_max_uid () may
> +     return different values at different times during this pass.  The
> +     insn_entry array represents only the instructions that were
> +     present in this function's representation at the start of this
> +     pass.  */
> +  max_uid_at_start = get_max_uid ();
> +  insn_entry = XCNEWVEC (indexing_web_entry, max_uid_at_start);
> +
> +  if (dump_file)
> +    fprintf (dump_file, "Creating insn_entry array with %d entries\n",
> +     max_uid_at_start);
> +
> +  /* The general approach is to:
> +
> +       1. Look for multiple array indexing expressions that refer to
> +          the same array base address such as are represented by the
> +  rtl excerpts below..
> +
> +       2. Group these into subsets for which the indexing expression
> +          derives from the same initial_value + some accumulation of
> +          constant values added thereto.
> +
> +      (cinsn 2 (set (reg/v/f:DI <27> [ x ])
> +                    (reg:DI 3 [ x ])) "ddot-c.c":12
> +       (expr_list:REG_DEAD (reg:DI 3 [ x ])))
> +
> +      ...
> +
> +      (cinsn 31 (set (reg:V2DF <35> [ vect__3.7 ])
> +                     (mem:V2DF (plus:DI (reg/v/f:DI <27> [ x ])
> +                                        (reg:DI <9> [ ivtmp.18 ]))
> +                      [1 MEM[base: x_20(D), index: ivtmp.18_35,
> +                       offset: 0B]+0 S16 A64])) "ddot-c.c":18)
> +
> +      ...
> +
> +      (cinsn 304 (set (reg:DI <70>)
> +                      (plus:DI (reg:DI <9> [ ivtmp.18 ])
> +                               (const_int 16)))
> +       (expr_list:REG_DEAD (reg:DI <9> [ ivtmp.18 ])))
> +      (cinsn 34 (set (reg:DI <9> [ ivtmp.18 ])
> +                     (reg:DI <70>)))
> +   */
> +
> +  /* Walk the insns to gather basic data: complexity of inner-nested
> +     loop body is linear in total number of insns within function.  */
> +  FOR_ALL_BB_FN (bb, fun)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file, "Scrutinizing bb %d\n", bb->index);
> +
> +      FOR_BB_INSNS_SAFE (bb, insn, curr_insn)
> + {
> +  unsigned int uid = INSN_UID (insn);
> +
> +  insn_entry[uid].insn = insn;
> +  insn_entry[uid].bb = BLOCK_FOR_INSN (insn);
> +  insn_entry[uid].luid = luid++;
> +  insn_entry[uid].is_relevant = false;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "\nLooking at insn: %d\n", uid);
> +      df_dump_insn_top (insn, dump_file);
> +      dump_insn_slim (dump_file, insn);
> +      df_dump_insn_bottom (insn, dump_file);
> +    }
> +
> +  /* First, look for all memory[base + index] expressions.
> +   * Then group these by base.
> +   * Then for all instructions in each group, scrutinize the index
> +   * definition. Partition this group according to the origin
> +   * variable upon which the the definitions of i are based.
> +   *
> +   * How do we define "origin variable"?
> +   *
> +   *  If i has multiple definitions, it is its own origin
> +   *  variable.  Likewise, if i has a single definition and the
> +   *  definition is NOT the sum or difference of a constant value
> +   *  and some other variable, then i is its own origin variable.
> +   *
> +   *  Otherwise, i has the same origin variable as the expression
> +   *  that represents its definition.
> +   *
> +   * After we've created these partitions, for each partition
> +   * whose size is greater than 1:
> +   *
> +   *  1. introduce derived_ptr = base + origin_variable
> +   *     immediately following the instruction that defines
> +   *     origin_variable.
> +   *
> +   *  2. for each member of the partition, replace the expression
> +   *     memory [base + index] with derived_ptr [constant], where
> +   *     constant is the sum of all constant values added to the
> +   *     origin variable to represent this particular value of i.  */
> +  if (NONDEBUG_INSN_P (insn))
> +    {
> +      rtx body = PATTERN (insn);
> +      rtx mem;
> +      if ((GET_CODE (body) == SET) && MEM_P (SET_SRC (body)))
> + {
> +  mem = XEXP (SET_SRC (body), 0);
> +  insn_entry[uid].is_load = true;
> +  insn_entry[uid].is_store = false;
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file,
> +       " this insn is fetching data from memory: ");
> +      print_inline_rtx (dump_file, mem, 2);
> +      fprintf (dump_file, "\n");
> +    }
> + }
> +      else if ((GET_CODE (body) == SET) && MEM_P (SET_DEST (body)))
> + {
> +  mem = XEXP (SET_DEST (body), 0);
> +  insn_entry[uid].is_load = false;
> +  insn_entry[uid].is_store = true;
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file,
> +       " this insn is storing data to memory: ");
> +      print_inline_rtx (dump_file, mem, 2);
> +      fprintf (dump_file, "\n");
> +    }
> + }
> +      else
> + {
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file,
> +     " this insn is neither load nor store\n");
> +  continue; /* Not a load or store */
> + }
> +
> +      enum rtx_code code = GET_CODE (mem);
> +      if ((code == PLUS) || (code == MINUS))
> + assess_relevance (mem, insn, insn_entry);
> +      else if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file,
> + " address not sum or difference of values\n");
> +    }
> +  /* else, this is a DEBUG_INSN_P (insn) so ignore it.  */
> + }
> +
> + if (dump_file && (dump_flags & TDF_DETAILS))
> +  fprintf (dump_file, "\n");
> +    }
> +
> +  build_and_fixup_equivalence_classes (insn_entry);
> +  free_dominance_info (CDI_DOMINATORS);
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "\n");
> +      fprintf (dump_file,
> +       "Total number of potentially relevant instructions: %d\n",
> +       num_relevant_insns);
> +      fprintf (dump_file,
> +       "Maximum number of definitions used by potentially "
> +       "relevant instructions: %d\n", max_use_defs);
> +      fprintf (dump_file,
> +       "Total number of potentially relevant but rejected "
> +       "instructions: %d\n\n", rejected_otherwise_relevant);
> +    }
> +
> +  return 0;
> +}  // anon namespace
> +
> +
> +const pass_data pass_data_insert_dform =
> +{
> +  RTL_PASS, /* type */
> +  "dform", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags, or could use OPTGROUP_LOOP */
> +  TV_NONE, /* tv_id, or could use TV_LOOP_UNROLL */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_insert_dform: public rtl_opt_pass
> +{
> +public:
> +  pass_insert_dform(gcc::context *ctxt)
> +    : rtl_opt_pass(pass_data_insert_dform, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      // This is most relevant to P9 and subsequent targets since P9
> +      // introduces new D-form instructions, but this may pay off on
> +      // other architectures as well.  Additional experimentation with
> +      // other targets may be worthwhile.
> +      return (optimize > 0 && !BYTES_BIG_ENDIAN && TARGET_VSX
> +      && TARGET_P9_VECTOR);
> +    }
> +
> +  virtual unsigned int execute (function *fun)
> +    {
> +      return rs6000_insert_dform (fun);
> +    }
> +
> +  opt_pass *clone ()
> +    {
> +      return new pass_insert_dform (m_ctxt);
> +    }
> +
> +}; // class pass_insert_dform
> +
> +rtl_opt_pass *make_pass_insert_dform (gcc::context *ctxt)
> +{
> +  return new pass_insert_dform (ctxt);
> +}
> Index: gcc/config/rs6000/rs6000-passes.def
> ===================================================================
> --- gcc/config/rs6000/rs6000-passes.def (revision 275051)
> +++ gcc/config/rs6000/rs6000-passes.def (working copy)
> @@ -22,6 +22,8 @@ along with GCC; see the file COPYING3.  If not see
>     INSERT_PASS_AFTER (PASS, INSTANCE, TGT_PASS)
>     INSERT_PASS_BEFORE (PASS, INSTANCE, TGT_PASS)
>     REPLACE_PASS (PASS, INSTANCE, TGT_PASS)
> +   Be advised: gawk program does not parse C comments if inserted below.
>   */
>
>    INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
> +  INSERT_PASS_AFTER (pass_loop2, 1, pass_insert_dform);
> Index: gcc/config/rs6000/rs6000-protos.h
> ===================================================================
> --- gcc/config/rs6000/rs6000-protos.h (revision 275051)
> +++ gcc/config/rs6000/rs6000-protos.h (working copy)
> @@ -47,6 +47,8 @@ extern bool legitimate_indirect_address_p (rtx, in
>  extern bool legitimate_indexed_address_p (rtx, int);
>  extern bool avoiding_indexed_address_p (machine_mode);
>  extern rtx rs6000_force_indexed_or_indirect_mem (rtx x);
> +extern bool rs6000_target_supports_dform_offset_p (machine_mode,
> +   HOST_WIDE_INT);
>
>  extern rtx rs6000_got_register (rtx);
>  extern rtx find_addr_reg (rtx);
> @@ -246,6 +248,8 @@ namespace gcc { class context; }
>  class rtl_opt_pass;
>
>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
> +extern rtl_opt_pass *make_pass_insert_dform (gcc::context *);
> +
>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
>  extern rtx rs6000_gen_lvx (enum machine_mode, rtx, rtx);
> Index: gcc/config/rs6000/rs6000.c
> ===================================================================
> --- gcc/config/rs6000/rs6000.c (revision 275051)
> +++ gcc/config/rs6000/rs6000.c (working copy)
> @@ -8725,6 +8725,152 @@ rs6000_debug_legitimate_address_p (machine_mode mo
>    return ret;
>  }
>
> +/* This function provides an approximation of which d-form addressing
> +   expressions are valid on any given target configuration.  This
> +   approximation guides optimization choices.  Secondary validation
> +   of the addressing mode is performed before code generation.
> +
> +   Return true iff target has instructions to perform a memory
> +   operation at the specified BYTE_OFFSET from an address held
> +   in a general purpose register.  */
> +bool
> +rs6000_target_supports_dform_offset_p (machine_mode mode,
> +       HOST_WIDE_INT byte_offset)
> +{
> +  const HOST_WIDE_INT max_16bit_signed = (0x7fff);
> +  const HOST_WIDE_INT min_16bit_signed = -1 - max_16bit_signed;
> +
> +  /* Available d-form instructions with P1 (the original Power architecture):
> +
> +     lbz RT,D(RA) - load byte and zero d-form
> +     lhz RT,D(RA) - load half word and zero d-form
> +     lha RT,D(RA) - load half word algebraic d-form
> +     lwz RT,D(RA) - load word and zero d-form
> +     lfs FRT,D(RA) - load floating-point single d-form
> +     lfd FRT,D(RA) - load floating-point double d-form
> +
> +     stb RS,D(RA) - store byte d-form
> +     sth RS,D(RA) - store half word d-form
> +     stfs FRS,D(RA) - store floating point single d-form
> +     stfd FRS,D(RA) - store floating point double d-form  */
> +
> +  /* Available d-form instructions with PPC (prior to v2.00):
> +     (option mpowerpc "existed in the past" but is now "always on"
> +
> +     lwa RT,DS(RA) - load word algebraic ds-form (2 bottom bits zero)
> +     ld RT,DS(RA) - load double word ds-form (2 bottom bits zero)
> +
> +     std RS,DS(RA) - store double word ds-form (2 bottom bits zero)
> +
> +     Consider lwa redundant with insn available in prior processors.  */
> +  switch (mode)
> +    {
> +    case E_QImode:
> +    case E_HImode:
> +    case E_SImode:
> +    case E_SFmode:
> +    case E_DFmode:
> +      if (IN_RANGE (byte_offset, min_16bit_signed, max_16bit_signed))
> + return true;
> +      break;
> +
> +    case E_DImode:
> +      if (IN_RANGE (byte_offset, min_16bit_signed, max_16bit_signed)
> +  && ((byte_offset & 0x03) == 0))
> + return true;
> +      break;
> +
> +    default:
> +      ;   /* Fall through to see if other instructions will work.  */
> +
> +    }
> +
> +  /* Available d-form instructions with v2.03:
> +
> +     lq RTp,DQ(RA) - load quadword dq-form (4 bottom bits zero)
> +
> +     stq RSp,DS(RA) - store quadword ds-form (2 bottom bits zero)
> +
> +     These instructions are not recommended for general use as they
> +     are expected to be very inefficient.  Their design was apparently
> +     motivated by a need to support atomic quad-word access, which is
> +     difficult to implement even in hardware on some architectures.
> +     Furthermore, the design of these instructions apparently does the
> +     "wrong" thing with regards to swapping of double words on load
> +     and store for little-endian targets.
> +
> +     Therefore, this routine assumes v2.03 does NOT support quadword
> +     d-form addressing.  */
> +
> +  /* Available d-form instructions with v2.05
> +
> +     (There are some floating-point load and store double-pair
> +      instructions.  Consider them "not available".  There are
> +      described as phasing out, which means they are expected
> +      to have poor performance.)  */
> +
> +  /* Available d-form instructions with 3.0
> +
> +     lxsd VRT,DS(RA) - Load VSX scalar double word ds-form (2 bottom bits zero)
> +                       (redundant with lfd from P1)
> +     lxssp VRT,DS(RA) - Load VSX scalar single precision ds-form
> +                        (bottom 2 bits zero)
> +                        (redundant with lfs from P1)
> +     lxv XT,DQ(RA) - Load VSX Vector dq-form (4 bottom bits zero)
> +                     (Works on little endian for any element type, but
> +      does not preserve lanes.)
> +
> +     stxsd VRS,DS(RA) - Store VSX scalar double-word DS form
> +                        (bottom 2 bits zero)
> +                        (redundant with stfd from P1)
> +     stxssp VRS,DS(RA) - Store VSX scalar single precision DS-form
> +                         (bottom 2 bits zero)
> +                         (redundant with stfs from P1)
> +     stxv XS,DQ(RA) - Store VSX vector dq-form (4 bottom bits zero)
> +                      (Works on little endian for any element type,
> +       but does not preserve lanes.)
> +
> +     lxv and stxv load/store to/from any VSX register, including
> +     registers that overlay with floating point and altivec register
> +     sets.  */
> +
> +  if (rs6000_isa_flags & OPTION_MASK_MODULO) /* ISA 3.0 */
> +    {
> +      switch (mode)
> + {
> + case E_V16QImode:
> + case E_V8HImode:
> + case E_V4SFmode:
> + case E_V4SImode:
> + case E_V2DFmode:
> + case E_V2DImode:
> + case E_V1TImode:
> + case E_TImode:
> +
> + case E_KFmode: /* ieee 754 128-bit floating point */
> + case E_IFmode: /* IBM extended 128-bit double */
> + case E_TFmode: /* 128-bit double (form depends on
> +   gcc command line, which may be
> +   either -mabi=ieeelongdouble (KF)
> +   or -mabi=ibmlongdouble (IF). */
> +  /* All 128-bit loads and stores are handled by lxv and stxv.  */
> +  if (IN_RANGE (byte_offset, min_16bit_signed, max_16bit_signed)
> +      && ((byte_offset & 0x0f) == 0))
> +    return true;
> +  break;
> +
> + default:
> +  ; /* fall through to see if other instructions will work.  */
> + }
> +    }
> +
> +  /* Todo: add support for any new instructions provided by future
> +     archictures when support for those future architectures is
> +     enabled.  */
> +
> +  return false;
> +}
> +
>  /* Implement TARGET_MODE_DEPENDENT_ADDRESS_P.  */
>
>  static bool
> Index: gcc/config/rs6000/t-rs6000
> ===================================================================
> --- gcc/config/rs6000/t-rs6000 (revision 275051)
> +++ gcc/config/rs6000/t-rs6000 (working copy)
> @@ -47,6 +47,10 @@ rs6000-call.o: $(srcdir)/config/rs6000/rs6000-call
>   $(COMPILE) $<
>   $(POSTCOMPILE)
>
> +rs6000-p9dform.o: $(srcdir)/config/rs6000/rs6000-p9dform.c
> + $(COMPILE) $<
> + $(POSTCOMPILE)
> +
>  $(srcdir)/config/rs6000/rs6000-tables.opt: $(srcdir)/config/rs6000/genopt.sh \
>    $(srcdir)/config/rs6000/rs6000-cpus.def
>   $(SHELL) $(srcdir)/config/rs6000/genopt.sh $(srcdir)/config/rs6000 > \
> Index: gcc/config.gcc
> ===================================================================
> --- gcc/config.gcc (revision 275051)
> +++ gcc/config.gcc (working copy)
> @@ -499,7 +499,7 @@ or1k*-*-*)
>   ;;
>  powerpc*-*-*)
>   cpu_type=rs6000
> - extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o rs6000-call.o"
> + extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-p9dform.o rs6000-logue.o rs6000-call.o"
>   extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
>   extra_headers="${extra_headers} bmi2intrin.h bmiintrin.h"
>   extra_headers="${extra_headers} xmmintrin.h mm_malloc.h emmintrin.h"
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-0.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-0.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-0.c (working copy)
> @@ -0,0 +1,44 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
> +
> +/* This test confirms that the dform instructions are selected in the
> +   translation of this main program.  */
> +
> +extern void first_dummy ();
> +extern void dummy (double sacc, int n);
> +extern void other_dummy ();
> +
> +extern float opt_value;
> +extern char *opt_desc;
> +
> +#define M 128
> +#define N 512
> +
> +double x [N];
> +double y [N];
> +
> +int main (int argc, char *argv []) {
> +  double sacc;
> +
> +  first_dummy ();
> +  for (int j = 0; j < M; j++) {
> +
> +    sacc = 0.00;
> +    for (unsigned long long int i = 0; i < N; i++) {
> +      sacc += x[i] * y[i];
> +    }
> +    dummy (sacc, N);
> +  }
> +  opt_value = ((float) N) * 2 * ((float) M);
> +  opt_desc = "flops";
> +  other_dummy ();
> +}
> +
> +/* At time the dform optimization pass was merged with trunk, 12
> +   lxv instructions were emitted in place of the same number of lxvx
> +   instructions.  No need to require exactly this number, as it may
> +   change when other optimization passes evolve.  */
> +
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-1.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-1.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-1.c (working copy)
> @@ -0,0 +1,56 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops" } */
> +
> +/* This test confirms that the dform instructions are selected in the
> +   translation of this main program.  */
> +
> +extern void first_dummy ();
> +extern void dummy (double sacc, int n);
> +extern void other_dummy ();
> +
> +extern float opt_value;
> +extern char *opt_desc;
> +
> +#define M 128
> +#define N 512
> +
> +double x [N];
> +double y [N];
> +double z [N];
> +
> +int main (int argc, char *argv []) {
> +  double sacc;
> +
> +  first_dummy ();
> +  for (int j = 0; j < M; j++) {
> +
> +    sacc = 0.00;
> +    for (unsigned long long int i = 0; i < N; i++) {
> +      z[i] = x[i] * y[i];
> +      sacc += z[i];
> +    }
> +    dummy (sacc, N);
> +  }
> +  opt_value = ((float) N) * 2 * ((float) M);
> +  opt_desc = "flops";
> +  other_dummy ();
> +}
> +
> +
> +
> +/* At time the dform optimization pass was merged with trunk, 12
> +   lxv instructions were emitted in place of the same number of lxvx
> +   instructions.  No need to require exactly this number, as it may
> +   change when other optimization passes evolve.  */
> +
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +
> +/* At time the dform optimization pass was merged with trunk, 6
> +   stxv instructions were emitted in place of the same number of stxvx
> +   instructions.  No need to require exactly this number, as it may
> +   change when other optimization passes evolve.  */
> +
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> +
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-10.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-10.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-10.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE signed int
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-11.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-11.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-11.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE unsigned long long
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mld\M} } } */
> +/* { dg-final { scan-assembler {\mstd\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-12.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-12.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-12.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE signed long long
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mld\M} } } */
> +/* { dg-final { scan-assembler {\mstd\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-13.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-13.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-13.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE unsigned __int128
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mld\M} } } */
> +/* { dg-final { scan-assembler {\mstd\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-14.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-14.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-14.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE signed __int128
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mld\M} } } */
> +/* { dg-final { scan-assembler {\mstd\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-15.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-15.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-15.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie -mfloat128" } */
> +
> +#define TYPE __float128
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-2.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-2.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-2.c (working copy)
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +
> +#define TYPE float
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-3.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-3.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-3.c (working copy)
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE double
> +#include "p9-dform-generic.h"
> +
> +/* At time the dform optimization pass was merged with trunk, 6
> +   lxv instructions were emitted in place of the same number of lxvx
> +   instructions and 8 stxv instructions replace the same number of
> +   stxvx instructions.  No need to require exactly this number, as it
> +   may change when other optimization passes evolve.  */
> +
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-4.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-4.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-4.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile { target { powerpc*-*-* } } } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE long double
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mlfd\M} } } */
> +/* { dg-final { scan-assembler {\mstfd\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-5.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-5.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-5.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE unsigned char
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-6.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-6.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-6.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE signed char
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-7.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-7.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-7.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE unsigned short
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-8.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-8.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-8.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE signed short
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-9.c
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-9.c (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-9.c (working copy)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target powerpc_p9vector_ok } */
> +/* { dg-skip-if "" { powerpc*-*-aix* } } */
> +/* { dg-options "-O3 -mdejagnu-cpu=power9 -funroll-loops -Wall -no-pie" } */
> +
> +#define TYPE unsigned int
> +#include "p9-dform-generic.h"
> +
> +/* The precise number of lxv and stxv instructions may be impacted by
> +   complex interactions between optimization passes, but we expect at
> +   least one of each.  */
> +/* { dg-final { scan-assembler {\mlxv\M} } } */
> +/* { dg-final { scan-assembler {\mstxv\M} } } */
> Index: gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h
> ===================================================================
> --- gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h (nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h (working copy)
> @@ -0,0 +1,34 @@
> +
> +#define ITERATIONS 1000000
> +
> +#define SIZE (16384/sizeof(TYPE))
> +
> +static TYPE x[SIZE] __attribute__ ((aligned (16)));
> +static TYPE y[SIZE] __attribute__ ((aligned (16)));
> +static TYPE a;
> +
> +void obfuscate(void *a, ...);
> +
> +static void __attribute__((noinline)) do_one(void)
> +{
> +  unsigned long i;
> +
> +  obfuscate(x, y, &a);
> +
> +  for (i = 0; i < SIZE; i++)
> +    y[i] = a * x[i];
> +
> +  obfuscate(x, y, &a);
> +
> +}
> +
> +int main(void)
> +{
> +  unsigned long i;
> +
> +  for (i = 0; i < ITERATIONS; i++)
> +    do_one();
> +
> +  return 0;
> +
> +}
>