[Bug c/81127] New: Complex division misses vectorisation opportunity

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug c/81127] New: Complex division misses vectorisation opportunity

jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81127

            Bug ID: 81127
           Summary: Complex division misses vectorisation opportunity
           Product: gcc
           Version: 7.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: drraph at gmail dot com
  Target Milestone: ---

This report has two parts. The first is about complex float division and the
second about complex double division.

--- Part 1 ---

Consider:

#include <complex.h>
complex float f(complex float x, complex float y) {
  return x/y;
}

In gcc 7.1 with -O3 -march=core-avx2 -ffast-math you get:

f:
        vmovq   QWORD PTR [rsp-16], xmm1
        vmovss  xmm5, DWORD PTR [rsp-12]
        vmovss  xmm4, DWORD PTR [rsp-16]
        vmovq   QWORD PTR [rsp-8], xmm0
        vmovss  xmm0, DWORD PTR [rsp-4]
        vmovss  xmm3, DWORD PTR [rsp-8]
        vmulss  xmm2, xmm5, xmm5
        vmulss  xmm1, xmm0, xmm5
        vfmadd231ss     xmm2, xmm4, xmm4
        vfmadd231ss     xmm1, xmm3, xmm4
        vmulss  xmm3, xmm3, xmm5
        vdivss  xmm1, xmm1, xmm2
        vfmsub132ss     xmm0, xmm3, xmm4
        vdivss  xmm0, xmm0, xmm2
        vmovss  DWORD PTR [rsp-24], xmm1
        vmovss  DWORD PTR [rsp-20], xmm0
        vmovq   xmm0, QWORD PTR [rsp-24]
        ret

Note three calls to vmulss and two calls to vdivss

ICC on the other hand gives:

f:
        vcvtps2pd xmm2, xmm1                                    #3.12
        vcvtps2pd xmm4, xmm0                                    #3.12
        vmulpd    xmm8, xmm2, xmm2                              #3.12
        vunpckhpd xmm3, xmm2, xmm2                              #3.12
        vmulpd    xmm6, xmm3, xmm4                              #3.12
        vmovddup  xmm7, xmm2                                    #3.12
        vshufpd   xmm5, xmm4, xmm4, 1                           #3.12
        vshufpd   xmm9, xmm8, xmm8, 1                           #3.12
        vfmaddsub213pd xmm7, xmm5, xmm6                         #3.12
        vaddpd    xmm11, xmm8, xmm9                             #3.12
        vshufpd   xmm10, xmm7, xmm7, 1                          #3.12
        vdivpd    xmm12, xmm10, xmm11                           #3.12
        vcvtpd2ps xmm0, xmm12                                   #3.12
        ret  

Note two calls to vmulpd and one call to vdivpd.

Just for interest,if you increase the optimisation level (using -fp-model
fast=2) ICC also offers this alternative:

f:
        vmovlhps  xmm2, xmm1, xmm1                              #3.12
        vmulps    xmm8, xmm2, xmm2                              #3.12
        vshufps   xmm9, xmm8, xmm8, 177                         #3.12
        vmovlhps  xmm4, xmm0, xmm0                              #3.12
        vaddps    xmm10, xmm8, xmm9                             #3.12
        vrcpps    xmm11, xmm10                                  #3.12
        vmovshdup xmm3, xmm2                                    #3.12
        vaddps    xmm12, xmm11, xmm11                           #3.12
        vmulps    xmm6, xmm4, xmm3                              #3.12
        vmulps    xmm14, xmm11, xmm10                           #3.12
        vmovsldup xmm7, xmm2                                    #3.12
        vshufps   xmm5, xmm4, xmm4, 177                         #3.12
        vfmaddsub213ps xmm7, xmm5, xmm6                         #3.12
        vfnmadd213ps xmm14, xmm11, xmm12                        #3.12
        vshufps   xmm13, xmm7, xmm7, 177                        #3.12
        vmulps    xmm0, xmm13, xmm14                            #3.12
        ret  

Note one call to vrcpps and four calls to vmulps and zero calls to vdivpd.

--- Part 2 ---

Consider:

#include <complex.h>
complex double f(complex double x, complex double y) {
  return x/y;
}

In gcc 7.1 with -O3 -march=core-avx2 -ffast-math you get:

f:
        vmulsd  xmm4, xmm1, xmm3
        vmovapd xmm6, xmm0
        vmulsd  xmm5, xmm3, xmm3
        vmulsd  xmm6, xmm6, xmm3
        vfmadd231sd     xmm4, xmm0, xmm2
        vfmadd231sd     xmm5, xmm2, xmm2
        vfmsub132sd     xmm1, xmm6, xmm2
        vdivsd  xmm0, xmm4, xmm5
        vdivsd  xmm1, xmm1, xmm5
        ret

In ICC you get with -fp-model fast=2:

f:
        vunpcklpd xmm4, xmm2, xmm3                              #2.54
        vunpcklpd xmm6, xmm0, xmm1                              #2.54
        vunpckhpd xmm5, xmm4, xmm4                              #3.12
        vmulpd    xmm10, xmm4, xmm4                             #3.12
        vmulpd    xmm8, xmm5, xmm6                              #3.12
        vmovddup  xmm9, xmm4                                    #3.12
        vshufpd   xmm7, xmm6, xmm6, 1                           #3.12
        vshufpd   xmm11, xmm10, xmm10, 1                        #3.12
        vfmaddsub213pd xmm9, xmm7, xmm8                         #3.12
        vaddpd    xmm13, xmm10, xmm11                           #3.12
        vshufpd   xmm12, xmm9, xmm9, 1                          #3.12
        vdivpd    xmm0, xmm12, xmm13                            #3.12
        vunpckhpd xmm1, xmm0, xmm0                              #3.12
        ret  

This reduces the number of multiplications to two and the number of divisions
to one again.  

It would be great to have benchmarks for all of this but I don't have a copy of
ICC to test.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/81127] Complex division misses BB vectorisation opportunity

jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81127

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2017-06-19
                 CC|                            |rguenth at gcc dot gnu.org
          Component|c                           |tree-optimization
             Blocks|                            |53947
            Summary|Complex division misses     |Complex division misses BB
                   |vectorisation opportunity   |vectorisation opportunity
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.

Note we lower complex division with -ffast-math early.  Note ICC seems to use
the trick of widening the FP ops float->double double->long double to apply the
simple lowering even with standard conforming complex evaluation method.

Note that more target control of the lowering process, eventually lowering to
vector GIMPLE would run into the loop vectorizer not handling vector code in
case this happens inside a loop.

Note that the libgcc implementation could benefit from the above widening trick
and vectorization as well (just use generic vectors?).

Note for complex float the ABI and the middle-end arg passing code results in
the awkward stack pushes (generating CONCAT and initializing that from
piecewise
DImode via assign_parm_remove_parallels, gen_reg_rtx creating the
CONCAT and emit_group_store initializing that by pushing it to the stack
explicitely).

There might be several related bugs already.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations