Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower

Chris Elrod
Here is code:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
processors)

Compiling with:

gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
-funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s

results in this assmebly:
https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s

where
$ gfortran --version
GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


The manually unrolled version runs in about 135 ns, while the for loop
takes just over 1.4 microseconds on my computer.

Looking at the assembly of the manually unrolled version:

There are 13 total vmovapd instructions, 8 of them are moving from one zmm
register to another, while 5 move from a zmm register to %rsp, eg:

vmovapd    %zmm20, 136(%rsp)

Is there a good reason for this? The source of the move is then almost
immediately overwritten by another instruction (and there's no reason to
have 2 copies anyway). So I'd have thought the optimal code would have 0
such instructions.
The only reason I can see is if there's a restriction on where fma
instructions can store their result. For example, if they can't always
store them in the register of the number being summed (ie, if they're not
capable of always doing z = x * y + z, but need to overwrite x or y instead
sometimes for some reason -- like an architectural restriction? )
Assuming it's better not to have them, any way to try and diagnose why
they're generated, and avoid it?

Otherwise, the assembly looks great: repeated blocks containing
2x vmovupd
7x vbroadcastsd
14x vfmadd231pd

(For comparison, ifort produces much slower code (220 ns), with their
assembly annotation noting lots of register spills.)


The unrolled code, on the other hand, has massive piles of instructions
just moving data around between registers, between registers and memory,
etc.

The looped version's assembly is actually over 3 times longer than the
entirely (manually) unrolled version!

I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
been able to save the effort of doing it manually, or also that the plain
loop also generates clean code.

But instead, I'd need a preprocessor or some other means of generating
kernels if I wanted to experiment with optimization.

Is this a bug, expected behaviour that loops manage memory like that, or
something that can easily be worked around another way?

Thanks,
Chris
Reply | Threaded
Open this post in threaded view
|

Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower

jerry DeLisle-3
This is the gfortran list but these optimizations are handled by the gcc
optimizers and not the compiler front-end. Probably need to post to
bugzilla here:

https://gcc.gnu.org/bugzilla/

Jerry


On 07/21/2018 05:44 AM, Chris Elrod wrote:

> Here is code:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
> for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
> processors)
>
> Compiling with:
>
> gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
> -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s
>
> results in this assmebly:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s
>
> where
> $ gfortran --version
> GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>
>
> The manually unrolled version runs in about 135 ns, while the for loop
> takes just over 1.4 microseconds on my computer.
>
> Looking at the assembly of the manually unrolled version:
>
> There are 13 total vmovapd instructions, 8 of them are moving from one zmm
> register to another, while 5 move from a zmm register to %rsp, eg:
>
> vmovapd    %zmm20, 136(%rsp)
>
> Is there a good reason for this? The source of the move is then almost
> immediately overwritten by another instruction (and there's no reason to
> have 2 copies anyway). So I'd have thought the optimal code would have 0
> such instructions.
> The only reason I can see is if there's a restriction on where fma
> instructions can store their result. For example, if they can't always
> store them in the register of the number being summed (ie, if they're not
> capable of always doing z = x * y + z, but need to overwrite x or y instead
> sometimes for some reason -- like an architectural restriction? )
> Assuming it's better not to have them, any way to try and diagnose why
> they're generated, and avoid it?
>
> Otherwise, the assembly looks great: repeated blocks containing
> 2x vmovupd
> 7x vbroadcastsd
> 14x vfmadd231pd
>
> (For comparison, ifort produces much slower code (220 ns), with their
> assembly annotation noting lots of register spills.)
>
>
> The unrolled code, on the other hand, has massive piles of instructions
> just moving data around between registers, between registers and memory,
> etc.
>
> The looped version's assembly is actually over 3 times longer than the
> entirely (manually) unrolled version!
>
> I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
> been able to save the effort of doing it manually, or also that the plain
> loop also generates clean code.
>
> But instead, I'd need a preprocessor or some other means of generating
> kernels if I wanted to experiment with optimization.
>
> Is this a bug, expected behaviour that loops manage memory like that, or
> something that can easily be worked around another way?
>
> Thanks,
> Chris
>

Reply | Threaded
Open this post in threaded view
|

Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower

gcc - fortran mailing list
If the application requires a specific unroll factor,  max unroll times should be set.  It does look like a bug if default unroll is so aggressive as to force spills.


Sent via the Samsung Galaxy S8 Active, an AT&T 4G LTE smartphone
-------- Original message --------From: Jerry DeLisle <[hidden email]> Date: 7/21/18  3:14 PM  (GMT-05:00) To: Chris Elrod <[hidden email]>, [hidden email] Subject: Re: Manual unrolling is fast; funroll-loops doesn't unroll, produces >3x assembly and runs >10x slower
This is the gfortran list but these optimizations are handled by the gcc
optimizers and not the compiler front-end. Probably need to post to
bugzilla here:

https://gcc.gnu.org/bugzilla/

Jerry


On 07/21/2018 05:44 AM, Chris Elrod wrote:

> Here is code:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.f90
> for a 16x32 * 32x14 matrix multiplication kernel (meant for for avx-512
> processors)
>
> Compiling with:
>
> gfortran -Ofast -march=skylake-avx512 -mprefer-vector-width=512
> -funroll-loops -S -shared -fPIC kernels.f90 -o kernels.s
>
> results in this assmebly:
> https://github.com/chriselrod/JuliaToFortran.jl/blob/master/fortran/kernels.s
>
> where
> $ gfortran --version
> GNU Fortran (GCC) 8.1.1 20180712 (Red Hat 8.1.1-5)
> Copyright (C) 2018 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
>
>
> The manually unrolled version runs in about 135 ns, while the for loop
> takes just over 1.4 microseconds on my computer.
>
> Looking at the assembly of the manually unrolled version:
>
> There are 13 total vmovapd instructions, 8 of them are moving from one zmm
> register to another, while 5 move from a zmm register to %rsp, eg:
>
> vmovapd    %zmm20, 136(%rsp)
>
> Is there a good reason for this? The source of the move is then almost
> immediately overwritten by another instruction (and there's no reason to
> have 2 copies anyway). So I'd have thought the optimal code would have 0
> such instructions.
> The only reason I can see is if there's a restriction on where fma
> instructions can store their result. For example, if they can't always
> store them in the register of the number being summed (ie, if they're not
> capable of always doing z = x * y + z, but need to overwrite x or y instead
> sometimes for some reason -- like an architectural restriction? )
> Assuming it's better not to have them, any way to try and diagnose why
> they're generated, and avoid it?
>
> Otherwise, the assembly looks great: repeated blocks containing
> 2x vmovupd
> 7x vbroadcastsd
> 14x vfmadd231pd
>
> (For comparison, ifort produces much slower code (220 ns), with their
> assembly annotation noting lots of register spills.)
>
>
> The unrolled code, on the other hand, has massive piles of instructions
> just moving data around between registers, between registers and memory,
> etc.
>
> The looped version's assembly is actually over 3 times longer than the
> entirely (manually) unrolled version!
>
> I would have hoped that `-funroll-loops` or `-funroll-all-loops` would have
> been able to save the effort of doing it manually, or also that the plain
> loop also generates clean code.
>
> But instead, I'd need a preprocessor or some other means of generating
> kernels if I wanted to experiment with optimization.
>
> Is this a bug, expected behaviour that loops manage memory like that, or
> something that can easily be worked around another way?
>
> Thanks,
> Chris
>