[Bug tree-optimization/91776] New: `-fsplit-paths` generates slower code on arm

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/91776] New: `-fsplit-paths` generates slower code on arm

marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91776

            Bug ID: 91776
           Summary: `-fsplit-paths` generates slower code on arm
           Product: gcc
           Version: 8.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: yhr-_-yhr at qq dot com
  Target Milestone: ---

I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7.
Writing a silly program calculating the cycle length of Fibonacci sequence
modulo n.

version: gcc (Raspbian 8.3.0-6+rpi1) 8.3.0

#include <stdio.h>
#include <time.h>
typedef unsigned int uint;
typedef unsigned long long ullong;
int main(){
        uint m;
        ullong cyc=0,lastcyc=0;
        clock_t lastclock=0;
        for(m=2;;m++){
                uint
                        a=0,
                        b=1,
                        n=0;
                do{
                        b+=a;
                        a=b-a;
                        n++;
                        if(b>=m)
                                b-=m;
                }while(
                        a!=0||
                        b!=1
                );
                cyc+=n;
                //if(n>=4*m)
                //      printf("%u: %u %.2f\n",m,n,(double)n/m);
                if(cyc-lastcyc>100000000){
                        clock_t now=clock();
                        printf("~ %.0f
loop/s\n",(double)(cyc-lastcyc)/(now-lastclock)*CLOCKS_PER_SEC);
                        lastclock=now;
                        lastcyc=cyc;
                }
        }
}

(1)
pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2
fibmod.c
pi@rpi:~/Desktop $ ./fibmod
~ 240755135 loop/s
~ 277965738 loop/s
~ 276675919 loop/s
~ 277244469 loop/s
~ 277207289 loop/s
~ 277303633 loop/s
^C

(2)
pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2
-fsplit-paths fibmod.c
pi@rpi:~/Desktop $ ./fibmod
~ 137691044 loop/s
~ 144593838 loop/s
~ 144397428 loop/s
~ 144519131 loop/s
~ 144392500 loop/s
^C

Also tested with `-Ofast -nofsplit-paths`, the speed measured is almost same as
(1).

On other hardware with x86_64 arch, this option doesn't seem to make observable
difference in running time.

btw, clang without `-march=mative -mtune-native` also produces the same speed
as (1), but with these two options, the speed is even higher.

(3)
pi@rpi:~/Desktop $ clang -Wall -march=native -mtune=native -o fibmodclang
-Ofast fibmod.c
pi@rpi:~/Desktop $ ./fibmodclang
~ 291343047 loop/s
~ 347350967 loop/s
~ 349217005 loop/s
~ 349320149 loop/s
~ 349367926 loop/s
~ 349372536 loop/s
^C
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/91776] `-fsplit-paths` generates slower code on arm

marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91776

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #1 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to yhr-_-yhr from comment #0)
> I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7.

I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi?

> pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2
> fibmod.c
> pi@rpi:~/Desktop $ ./fibmod
> ~ 240755135 loop/s
> ~ 277965738 loop/s
> ~ 276675919 loop/s
> ~ 277244469 loop/s
> ~ 277207289 loop/s
> ~ 277303633 loop/s
> ^C
>
> (2)
> pi@rpi:~/Desktop $ gcc -Wall -march=native -mtune=native -o fibmod -O2
> -fsplit-paths fibmod.c
> pi@rpi:~/Desktop $ ./fibmod
> ~ 137691044 loop/s
> ~ 144593838 loop/s
> ~ 144397428 loop/s
> ~ 144519131 loop/s
> ~ 144392500 loop/s
> ^C

Can you list the assembly code for both inner loops please? This doesn't seem
like -fsplit-paths, but more likely related to -mstrict-it in Armv8. I can
reproduce a 2x slowdown with this loop if the subtract is not conditionally
executed. This happens if the register allocator uses a high register:

fast case:
        cmp     r4, r3
        it      ls
        subls   r3, r3, r4

slow case:
        cmp     r10, r3
        bhi     .L2
        sub     r3, r3, r10
.L2:

Can you try using -mno-strict-it on your examples and see whether that helps?
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/91776] `-fsplit-paths` generates slower code on arm

marxin at gcc dot gnu.org
In reply to this post by marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91776

--- Comment #2 from yhr-_-yhr at qq dot com ---
> I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi?
oops you're right, I just got this pointed out when I showed this post to my
friend. I just copied it from `cat /proc/cpuinfo`.

> Can you try using -mno-strict-it on your examples and see whether that helps?
Did you mean -mno-restrict-it? I followed gcc's correction info.

(4)
pi@rpi:~/Desktop $ gcc -v -save-temps -Wall -march=native -mtune=native
-mno-restrict-it -o fibmod -O2 -fsplit-paths fibmod.c
[...]
pi@rpi:~/Desktop $ ./fibmod
~ 129358055 loop/s
~ 144338387 loop/s
~ 143361058 loop/s
~ 143191701 loop/s
~ 143414626 loop/s
~ 143312006 loop/s
^C
[fibmod.S]
.L7:
        mov     r1, #0
        mov     r2, #1
        mov     r0, r1
        b       .L5
.L13:
        sub     r3, r3, r10
        cmp     r2, #0
        cmpeq   r3, #1
        beq     .L4
.L3:
        mov     r0, r2
        mov     r2, r3
.L5:
        add     r3, r0, r2
        add     r1, r1, #1
        cmp     r10, r3
        bls     .L13
        cmp     r3, #1
        cmpeq   r2, #0
        bne     .L3
.L4:
        adds    r4, r4, r1
        adc     r5, r5, #0
        subs    r6, r4, ip
        sbc     r7, r5, lr
        cmp     r7, r9
        cmpeq   r6, r8
        bls     .L6
        bl      clock
        mov     r1, r7
        str     r0, [sp]
        mov     r0, r6
        bl      __aeabi_ul2d
        ldr     r3, [sp]
        vmov    d6, r0, r1
        ldr     r0, [sp, #4]
        sub     r2, r3, fp
        vmov    s14, r2 @ int
        mov     fp, r3
        vcvt.f64.s32    d7, s14
        vdiv.f64        d6, d6, d7
        vmul.f64        d7, d6, d8
        vmov    r2, r3, d7
        bl      printf
        mov     ip, r4
        mov     lr, r5
.L6:
        add     r10, r10, #1
        b       .L7

(5)
pi@rpi:~/Desktop $ gcc -v -save-temps -Wall -march=native -mtune=native
-mno-restrict-it -o fibmod -O2 fibmod.c
[...]
pi@rpi:~/Desktop $ ./fibmod
~ 277312518 loop/s
~ 279153709 loop/s
~ 278075227 loop/s
~ 277919398 loop/s
~ 277167351 loop/s
~ 278028104 loop/s
~ 278017452 loop/s
^C
[fibmod.S]
.L5:
        mov     r1, #0
        mov     r2, #1
        mov     r0, r1
.L3:
        add     r3, r0, r2
        add     r1, r1, #1
        cmp     r10, r3
        mov     r0, r2
        subls   r3, r3, r10
        cmp     r3, #1
        cmpeq   r2, #0
        mov     r2, r3
        bne     .L3
        adds    r4, r4, r1
        adc     r5, r5, #0
        subs    r6, r4, ip
        sbc     r7, r5, lr
        cmp     r7, r9
        cmpeq   r6, r8
        bls     .L4
        bl      clock
        mov     r1, r7
        str     r0, [sp]
        mov     r0, r6
        bl      __aeabi_ul2d
        ldr     r3, [sp]
        vmov    d6, r0, r1
        ldr     r0, [sp, #4]
        sub     r2, r3, fp
        vmov    s14, r2 @ int
        mov     fp, r3
        vcvt.f64.s32    d7, s14
        vdiv.f64        d6, d6, d7
        vmul.f64        d7, d6, d8
        vmov    r2, r3, d7
        bl      printf
        mov     ip, r4
        mov     lr, r5
.L4:
        add     r10, r10, #1
        b       .L5

I also checked the two fibmod.S without `-mno-restrict-it` but it seems to be
no difference.

Oh but I found another that actually makes a little (~7%) difference.. without
`-march=native -mtune=native`

(6)
pi@rpi:~/Desktop $ gcc -v -save-temps -Wall -mno-restrict-it -o fibmod -O2
-fsplit-paths fibmod.c
[...]
pi@rpi:~/Desktop $ ./fibmod
~ 140006573 loop/s
~ 153067683 loop/s
~ 153172437 loop/s
~ 152992126 loop/s
~ 153133548 loop/s
^C
[fibmod.S]
.L7:
        mov     r1, #0
        mov     r0, r1          @ here
        mov     r2, #1          @ here
        b       .L5
.L13:
        sub     r3, r3, r10
        cmp     r2, #0
        cmpeq   r3, #1
        beq     .L4
.L3:
        mov     r0, r2
        mov     r2, r3
.L5:
        add     r3, r0, r2
        cmp     r10, r3         @ here
        add     r1, r1, #1      @ here
        bls     .L13
        cmp     r3, #1
        cmpeq   r2, #0
        bne     .L3
.L4:
        adds    r4, r4, r1
        adc     r5, r5, #0
        subs    r6, r4, ip
        sbc     r7, r5, lr
        cmp     r7, r9
        cmpeq   r6, r8
        bls     .L6
        bl      clock
        mov     r1, r7
        str     r0, [sp, #4]
        mov     r0, r6
        bl      __aeabi_ul2d
        ldr     r3, [sp, #4]
        sub     r2, r3, fp
        mov     fp, r3
        vmov    s14, r2 @ int
        vcvt.f64.s32    d7, s14
        vmov    d6, r0, r1
        ldr     r0, .L14+16
        vdiv.f64        d6, d6, d7
        vmul.f64        d7, d6, d8
        vmov    r2, r3, d7
        bl      printf
        mov     ip, r4
        mov     lr, r5
.L6:
        add     r10, r10, #1
        b       .L7

with neither `-fsplit-paths` nor `-march=native -mtune=native` the speed is
identical to (5).
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/91776] `-fsplit-paths` generates slower code on arm

marxin at gcc dot gnu.org
In reply to this post by marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91776

--- Comment #3 from Richard Earnshaw <rearnsha at gcc dot gnu.org> ---
(In reply to Wilco from comment #1)
> (In reply to yhr-_-yhr from comment #0)
> > I'm doing this test on a Raspberry Pi Model 3B+. The CPU is BCM2835 ARMv7.
>
> I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi?

BCM2835 is the Linux driver name for the BCM2[78]xx and series.  You get the
same on a Pi4 as well, even though it uses a BCM2711.
Reply | Threaded
Open this post in threaded view
|

[Bug tree-optimization/91776] `-fsplit-paths` generates slower code on arm

marxin at gcc dot gnu.org
In reply to this post by marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91776

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2019-09-18
     Ever confirmed|0                           |1

--- Comment #4 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to yhr-_-yhr from comment #2)
> > I think it's BM2837, ie. Cortex-A53. Or did you mean a different Pi?
> oops you're right, I just got this pointed out when I showed this post to my
> friend. I just copied it from `cat /proc/cpuinfo`.
>
> > Can you try using -mno-strict-it on your examples and see whether that helps?
> Did you mean -mno-restrict-it? I followed gcc's correction info.

Yes - but it looks like your compiler defaults to Arm (which is strange), so it
has no effect.

With GCC8 I can reproduce this for Arm, but not on newer compilers. On Thumb-2
it still is an issue due to -mrestrict-it (comment 1). Basically it shows how
important conditional execution is for performance even on modern CPUs.