[Bug c/88271] New: Omit test instruction after add

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug c/88271] New: Omit test instruction after add

mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

            Bug ID: 88271
           Summary: Omit test instruction after add
           Product: gcc
           Version: 8.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: [hidden email]
  Target Milestone: ---

[code]
int data[8];

void test(int k)
{
    int level = 0;
    int val = 1;
    while (1)
    {
        if (val)
        {
            val = data[level] << 1;
            ++level;
        }
        else
        {
            --level;
            val = data[level];
        }
    }
}
[/code]

This code compiled using gcc 8.2 with options -O3 -march=skylake-avx512
produces this:

[asm]
test(int):
        mov     edx, 1
        xor     eax, eax
.L2:
        test    edx, edx
        je      .L3
.L6:
        movsx   rdx, eax
        mov     edx, DWORD PTR data[0+rdx*4]
        inc     eax
        add     edx, edx
        test    edx, edx
        jne     .L6
.L3:
        dec     eax
        movsx   rdx, eax
        mov     edx, DWORD PTR data[0+rdx*4]
        jmp     .L2
data:
        .zero   32
[/asm]

I checked that add instruction updates CPU flags, so test instruction before
"jne .L6" could be omitted.
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #1 from Daniel Fruzynski <[hidden email]> ---
I checked that in simple case when bit shift is used in "if", it is optimized:

[code]
void f();
void g();

void test(int n)
{
    if (n << 1)
        f();
    else
        g();
}
[/code]

[asm]
test(int):
        add     edi, edi
        je      .L2
        jmp     f()
.L2:
        jmp     g()
[/asm]
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
GCC can handle these in the combiner, but for that needs the two instructions
(add, etc. or in this case left shift by 1) be in the same basic block as the
comparison.  That is not the case here, during combine the shift by 1 is in one
bb, which jumps to another bb in which is the comparison, but that other bb is
preceded also by the bb with just val = data[level];  We only duplicate the
comparison late and we really can't use the combiner late after RA.
So, I'm afraid it is too hard to do anything here and the benefit would be
small.
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #3 from Daniel Fruzynski <[hidden email]> ---
What about adding new pass at the end? It would look for various possible
optimizations, which were missed earlier because they are cross-basic block.

In my case this example code is part of tight loop. From previous experiences
with it I expect that this optimization could improve speed by something like
0.5%-1%. If you want to look on real code, is it at link below. CPU spends
about 60% of time in this one function. This app runs on BOINC platform, so
such microoptimization would be worthwhile there.

https://github.com/sirzooro/RakeSearch/blob/optimizations2/RakeDiagSearch/RakeDiagSearch/MovePairSearch.cpp#L583
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #4 from UroŇ° Bizjak <ubizjak at gmail dot com> ---
(In reply to Daniel Fruzynski from comment #3)
> What about adding new pass at the end? It would look for various possible
> optimizations, which were missed earlier because they are cross-basic block.

We do have post-reload compare elimination that works cross-BB. However, it
runs before BB-reordering pass (which duplicates the compare), so this is what
compare elimination pass sees:

    1: NOTE_INSN_DELETED
    6: NOTE_INSN_BASIC_BLOCK 2
    3: NOTE_INSN_FUNCTION_BEG
    4: dx:SI=0x1
      REG_EQUAL 0x1
    5: ax:SI=0
      REG_EQUAL 0
   11: L11:
   12: NOTE_INSN_BASIC_BLOCK 3
   13: flags:CCZ=cmp(dx:SI,0)         <--- here is the compare
   14: pc={(flags:CCZ==0)?L24:pc}
      REG_BR_PROB 536870916
   15: NOTE_INSN_BASIC_BLOCK 4
   17: dx:DI=sign_extend(ax:SI)
   18: dx:SI=[dx:DI*0x4+`data']
   19: {dx:SI=dx:SI<<0x1;clobber flags:CC;}  <--- here is the "add"
   20: {ax:SI=ax:SI+0x1;clobber flags:CC;}
   35: pc=L11
   36: barrier
   24: L24:
   25: NOTE_INSN_BASIC_BLOCK 5
   26: {ax:SI=ax:SI-0x1;clobber flags:CC;}
   28: dx:DI=sign_extend(ax:SI)
   29: dx:SI=[dx:DI*0x4+`data']
   37: pc=L11
   38: barrier
   39: NOTE_INSN_DELETED

The pass can't do anything in this case, several edges are going into BB3.

FYI, BB-reorder pass creates:

    6: NOTE_INSN_BASIC_BLOCK 2
   41: NOTE_INSN_PROLOGUE_END
    3: NOTE_INSN_FUNCTION_BEG
    4: dx:SI=0x1
      REG_EQUAL 0x1
   46: {ax:DI=0;clobber flags:CC;}

   11: L11:
   12: NOTE_INSN_BASIC_BLOCK 3
   13: flags:CCZ=cmp(dx:SI,0)
      REG_DEAD dx:SI
   14: pc={(flags:CCZ==0)?L24:pc}
      REG_DEAD flags:CCZ
      REG_BR_PROB 536870916

   15: NOTE_INSN_BASIC_BLOCK 4
   17: dx:DI=sign_extend(ax:SI)
   18: dx:SI=[dx:DI*0x4+`data']
   19: {dx:SI=dx:SI<<0x1;clobber flags:CC;}
      REG_UNUSED flags:CC
   20: {ax:SI=ax:SI+0x1;clobber flags:CC;}
      REG_UNUSED flags:CC

   50: NOTE_INSN_BASIC_BLOCK 5
   48: flags:CCZ=cmp(dx:SI,0)
      REG_DEAD dx:SI
   49: pc={(flags:CCZ==0)?L24:pc}
      REG_DEAD flags:CCZ
      REG_BR_PROB 536870916

and then scheduler reorders insn stream to:

   ...
   15: NOTE_INSN_BASIC_BLOCK 4
   17: dx:DI=sign_extend(ax:SI)
   20: {ax:SI=ax:SI+0x1;clobber flags:CC;}
      REG_UNUSED flags:CC
   18: dx:SI=[dx:DI*0x4+`data']
   19: {dx:SI=dx:SI<<0x1;clobber flags:CC;}
      REG_UNUSED flags:CC
   48: flags:CCZ=cmp(dx:SI,0)
      REG_DEAD dx:SI
   49: pc={(flags:CCZ!=0)?L51:pc}
      REG_DEAD flags:CCZ
      REG_BR_PROB 536870916
   ...

So, this is indeed an exceptional situation, it is pure coincidence that the
optimization possibility is created. We *can* rerun compare elimination pass
after the scheduler, but I suspect the above case will be the one and only it
will catch.

I'm curious if removing this extra instruction really improves the runtime
above the noise level. Can you profile the loop w/ and w/o compare insn using
perf?
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #5 from Daniel Fruzynski <[hidden email]> ---
How to use perf? I did not have change to use it yet, I usually use time
command or callgrind.

I have run my app compiled with AVX2 instructions on Xeon E5-2683 v3, CentOS
7.6, on idle CPU. I run it 3 times for both versions (w/ and w/o test
instructions). Here are results:

With test: Average 246,3 ms, StdDev 0,198
W/o test: Average 244,013ms, StdDev 0,043

Version with test is 0,94% slower - this is result which I expected.
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #6 from Daniel Fruzynski <[hidden email]> ---
Average for version with test is 246.313ms, I deleted too many digits.
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #7 from Daniel Fruzynski <[hidden email]> ---
One more note: this particular function creates matrices with all possible
permutations of row order of original matrix, which satisfies some additional
criteria. So this optimization may be applicable to other algorithms which
generates permutations.
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #8 from Daniel Fruzynski <[hidden email]> ---
I have results from Callgrind. Cycle estimation for MoveRows function (without
children) is 58.29%. This is for app without test instruction. So in synthetic
benchmark for this function only speed change would be about 2%.
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #9 from Daniel Fruzynski <[hidden email]> ---
I have idea about alternate approach to this. gcc could try to look for
relations between loop control statement, and other statements which modify
variables used in that control statement. With such knowledge it could try to
reorganize code to better optimize it. This approach would eliminate randomness
here.
Reply | Threaded
Open this post in threaded view
|

[Bug target/88271] Omit test instruction after add

mpolacek at gcc dot gnu.org
In reply to this post by mpolacek at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #10 from Daniel Fruzynski <[hidden email]> ---
Here is possible code transformation to equivalent form, where this
optimization can be simply applied. This change also has a bit surprising side
effect, second nested while loop is unrolled.

[code]
void test2()
{
    int level = 0;
    int val = 1;
    while (1)
    {
        while(1)
        {
            val = data[level] << 1;
            ++level;
            if (val)
                continue;
            else
                break;
        }

        while(1)
        {
            --level;
            val = data[level];
            if (!val)
                continue;
            else
                break;

        }
    }
}
[/code]