[AArch64 ELF ABI] Vector calls and lazy binding on AArch64

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Szabolcs Nagy-2
The lazy binding code of aarch64 currently only preserves q0-q7 of the
fp registers, but for an SVE call [AAPCS64+SVE] it should preserve p0-p3
and z0-z23, and for an AdvSIMD vector call [VABI64] it should preserve
q0-q23. (Vector calls are extensions of the base PCS [AAPCS64].)

A possible fix is to save and restore the additional register state in
the lazy binding entry code, this was discussed in

  https://sourceware.org/ml/libc-alpha/2018-08/msg00017.html

the main objections were

(1) Linux may optimize the kernel entry code for processes that don't
    use SVE, so lazy binding should avoid accessing SVE registers.

(2) If this is fixed in the dynamic linker, vector calls will not be
    backward compatible with old glibc.

(3) The saved SVE register state can be large (> 8K), so binaries that
    work today may run out of stack space on an SVE system during lazy
    binding (which can e.g. happen in a signal handler on a tiny stack).

and the proposed solution was to force bind now semantics for vector
functions e.g. by not calling them via PLT. This turned out to be harder
than I expected. I no longer think (1) and (2) are critically important,
but (3) is a correctness issue which is hard to argue away (would
require larger stack allocations to accommodate the worst case stack
size increase, but the stack allocation is not always under the control
of glibc, so it cannot provide strict guarantees).

Some approaches to make symbols "bind now" were discussed at

  https://groups.google.com/forum/#!topic/generic-abi/Bfb2CwX-u4M

The ABI change draft is below the notes, it requires marking symbols
in the ELF symbol table that follow the vector PCS (or other variant
PCS conventions). This is most relevant to dynamic linkers with lazy
binding support and to ELF linkers targeting AArch64, but assemblers
will need to be updated too.

Note 1: the dynamic linker may have to run user code during lazy binding
because of ifunc resolvers, so it cannot avoid clobbering fp regs.

Note 2: the tlsdesc entry is also affected by (3), so either the the
initial DTV setup should avoid clobbering fp regs or the SVE register
state should not be callee-preserved by the tlsdesc call ABI (the latter
was chosen, which is backward compatible with old dynamic linkers, but
tls access from SVE code is as expensive as an extern call now: the
caller has to spill).

Note 3: signal frame and SVE register spills in code using SVE can also
lead to variable stack usage (AT_MINSIGSZTKSZ was introduced to address
the former issue on linux) so it is a valid approach to just increase
min stack size limits on aarch64 compared to other targets (this is less
invasive, but does not fix old binaries).

Note 4: the proposal requires marking symbols in asm and elf objects, so
it is not compatible with existing tooling (old as or ld cannot create
valid vector function symbol references or definitions) and it is only
effective with a new dynamic linker.

Note 5: -fno-plt style code generation for vector function calls might
have worked too, but on aarch64 it requires compiler and linker changes
to avoid PLT in position dependent code when that is emitted for the
sake of pointer equality. It also requires tightening the ABI to ensure
the static linker does not introduce PLT when processing certain static
relocations. This approach would generate suboptimal static linked code
(the no-plt code is hard to relax into direct calls on aarch64) fragile
(easy to accidentally introduce a PLT) and hard to diagnose.

Note 6: the proposed solution applies to both SVE calls and AdvSIMD
vector calls, even though some issues only apply to SVE.

Note 7: a separate dynamic linker entry point for variant PCS calls
may be introduced (requires further ELF changes for a PLT0 like stub)
or the dynamic linker may decide to always preserve all registers or
decide to always bind symbols at load time.


AAELF64: in the Symbol Table section add

 st_other Values
     The  st_other  member  of  a symbol table entry specifies the symbol's
     visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
     generic  ELF ABI [SCO-ELF], and while there are no values reserved for
     processor-specific semantics, many other architectures have used these
     bits.

     The  defined  processor-specific  st_other  flag  values are listed in
     Table 4-5-1.

 Table 4-5-1, Processor specific st_other flags
             +------------------------+------+---------------------+
             |Name                    | Mask | Comment             |
             +------------------------+------+---------------------+
             |STO_AARCH64_VARIANT_PCS | 0x80 | The        function |
             |                        |      | associated with the |
             |                        |      | symbol may follow a |
             |                        |      | variant   procedure |
             |                        |      | call  standard with |
             |                        |      | different  register |
             |                        |      | usage convention.   |
             +------------------------+------+---------------------+

     A  symbol  table entry that is marked with the STO_AARCH64_VARIANT_PCS
     flag set in its st_other field may be associated with a function  that
     follows  a  variant  procedure  call  standard with different register
     usage convention from the one  defined  in  the  base  procedure  call
     standard  for  the  list  of  argument,  caller-saved and callee-saved
     registers [AAPCS64].  The rules  in  the  Call  and  Jump  relocations
     section  still  apply to such functions, and if a subroutine is called
     via a symbol reference that  is  marked  with  STO_AARCH64_VARIANT_PCS
     then  code that runs between the calling routine and called subroutine
     must preserve the contents of all registers except IP0,  IP1  and  the
     condition code flags [AAPCS64].

     Static  linkers  must  preserve  the  marking  and propagate it to the
     dynamic symbol table if any reference or definition of the  symbol  is
     marked  with STO_AARCH64_VARIANT_PCS, and add a DT_AARCH64_VARIANT_PCS
     dynamic tag if required by the Dynamic Section section.

     NOTE:
        In particular, when a call is made via the PLT entry  of  a  symbol
        marked with STO_AARCH64_VARIANT_PCS, a dynamic linker cannot assume
        that the call follows the register usage  convention  of  the  base
        procedure call standard.

        An  example  of  a  function  that follows a variant procedure call
        standard with different register usage convention is one that takes
        parameters in scalable vector or predicate registers.


AAELF64: in the Dynamic Section section add

 Table 5-4, AArch64 specific dynamic array tags
   +-----------------------+------------+-------+------------+---------------+
   |Name                   | Value      | d_un  | Executable | Shared Object |
   +-----------------------+------------+-------+------------+---------------+
   |DT_AARCH64_VARIANT_PCS | 0x70000005 | d_val | Platform   | Platform      |
   |                       |            |       | specific   | Specific      |
   +-----------------------+------------+-------+------------+---------------+

     DT_AARCH64_VARIANT_PCS must be present if there are  R_<CLS>_JUMP_SLOT
     relocations     that     reference    symbols    marked    with    the
     STO_AARCH64_VARIANT_PCS flag set in their st_other field.


VABI64: after the Vector Procedure Call Standard section add

 Dynamic linking for AAVPCS
     On ELF platforms with dynamic linking support, symbol definitions  and
     references must be marked with the STO_AARCH64_VARIANT_PCS flag set in
     their st_other field if the following holds:

     1. the symbol is visible outside of its defining component (executable
        file or shared object), and

     2. the  symbol  is  associated  with  a  function following the AAVPCS
        convention.

     For more information on STO_AARCH64_VARIANT_PCS, see AAELF64.

     NOTE:
        Marking all function symbol definitions and references is  a  valid
        way of implementing this requirement.


[AAELF64]: ELF for the Arm 64-bit Architecture (AArch64)
           https://developer.arm.com/docs/ihi0056/latest
[VABI64]:  Vector Function ABI Specification for AArch64
           https://developer.arm.com/tools-and-software/server-and-hpc/arm-architecture-tools/arm-compiler-for-hpc/vector-function-abi
[AAPCS64]: Procedure Call Standard for the Arm 64-bit Architecture (AArch64)
           https://developer.arm.com/docs/ihi0055/latest
[AAPCS64+SVE]: Procedure Call Standard for the ARM 64-bit Architecture
           (AArch64) with SVE support
           https://developer.arm.com/docs/100986/latest
[SCO-ELF]: System V Application Binary Interface
           http://www.sco.com/developers/gabi/
Reply | Threaded
Open this post in threaded view
|

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Florian Weimer-5
* Szabolcs Nagy:

> AAELF64: in the Symbol Table section add
>
>  st_other Values
>      The  st_other  member  of  a symbol table entry specifies the symbol's
>      visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
>      generic  ELF ABI [SCO-ELF], and while there are no values reserved for
>      processor-specific semantics, many other architectures have used these
>      bits.
>
>      The  defined  processor-specific  st_other  flag  values are listed in
>      Table 4-5-1.
>
>  Table 4-5-1, Processor specific st_other flags
>              +------------------------+------+---------------------+
>              |Name                    | Mask | Comment             |
>              +------------------------+------+---------------------+
>              |STO_AARCH64_VARIANT_PCS | 0x80 | The        function |
>              |                        |      | associated with the |
>              |                        |      | symbol may follow a |
>              |                        |      | variant   procedure |
>              |                        |      | call  standard with |
>              |                        |      | different  register |
>              |                        |      | usage convention.   |
>              +------------------------+------+---------------------+
>
>      A  symbol  table entry that is marked with the STO_AARCH64_VARIANT_PCS
>      flag set in its st_other field may be associated with a function  that
>      follows  a  variant  procedure  call  standard with different register
>      usage convention from the one  defined  in  the  base  procedure  call
>      standard  for  the  list  of  argument,  caller-saved and callee-saved
>      registers [AAPCS64].  The rules  in  the  Call  and  Jump  relocations
>      section  still  apply to such functions, and if a subroutine is called
>      via a symbol reference that  is  marked  with  STO_AARCH64_VARIANT_PCS
>      then  code that runs between the calling routine and called subroutine
>      must preserve the contents of all registers except IP0,  IP1  and  the
>      condition code flags [AAPCS64].

Can you clarify if there has to be a valid stack at this point which can
be used during the call transfer?  What about the stack alignment
requirement?

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Szabolcs Nagy-2
On 22/05/2019 16:06, Florian Weimer wrote:

> * Szabolcs Nagy:
>
>> AAELF64: in the Symbol Table section add
>>
>>  st_other Values
>>      The  st_other  member  of  a symbol table entry specifies the symbol's
>>      visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
>>      generic  ELF ABI [SCO-ELF], and while there are no values reserved for
>>      processor-specific semantics, many other architectures have used these
>>      bits.
>>
>>      The  defined  processor-specific  st_other  flag  values are listed in
>>      Table 4-5-1.
>>
>>  Table 4-5-1, Processor specific st_other flags
>>              +------------------------+------+---------------------+
>>              |Name                    | Mask | Comment             |
>>              +------------------------+------+---------------------+
>>              |STO_AARCH64_VARIANT_PCS | 0x80 | The        function |
>>              |                        |      | associated with the |
>>              |                        |      | symbol may follow a |
>>              |                        |      | variant   procedure |
>>              |                        |      | call  standard with |
>>              |                        |      | different  register |
>>              |                        |      | usage convention.   |
>>              +------------------------+------+---------------------+
>>
>>      A  symbol  table entry that is marked with the STO_AARCH64_VARIANT_PCS
>>      flag set in its st_other field may be associated with a function  that
>>      follows  a  variant  procedure  call  standard with different register
>>      usage convention from the one  defined  in  the  base  procedure  call
>>      standard  for  the  list  of  argument,  caller-saved and callee-saved
>>      registers [AAPCS64].  The rules  in  the  Call  and  Jump  relocations
>>      section  still  apply to such functions, and if a subroutine is called
>>      via a symbol reference that  is  marked  with  STO_AARCH64_VARIANT_PCS
>>      then  code that runs between the calling routine and called subroutine
>>      must preserve the contents of all registers except IP0,  IP1  and  the
>>      condition code flags [AAPCS64].
>
> Can you clarify if there has to be a valid stack at this point which can
> be used during the call transfer?  What about the stack alignment
> requirement?

the intention is to only allow 'register usage convention' to be
relaxed compared to the base PCS (which has rules for stack etc),
and even the register usage convention has to be compatible with
the 'Call and Jump relocations section' which essentially says that
veneers inserted by the linker between calls can clobber IP0, IP1
and the condition flags.

i.e. a variant pcs function follows the same rules as base pcs, but
it may use different caller-/callee-saved/argument regiseters.

when SVE pcs is merged into the current AAPCS document, then i hope
the 'variant pcs' term used here will be properly specified so the
ELF ABI will just refer back to that.

Reply | Threaded
Open this post in threaded view
|

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Florian Weimer-5
* Szabolcs Nagy:

> On 22/05/2019 16:06, Florian Weimer wrote:
>> * Szabolcs Nagy:
>>
>>> AAELF64: in the Symbol Table section add
>>>
>>>  st_other Values
>>>      The  st_other  member  of  a symbol table entry specifies the symbol's
>>>      visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
>>>      generic  ELF ABI [SCO-ELF], and while there are no values reserved for
>>>      processor-specific semantics, many other architectures have used these
>>>      bits.
>>>
>>>      The  defined  processor-specific  st_other  flag  values are listed in
>>>      Table 4-5-1.
>>>
>>>  Table 4-5-1, Processor specific st_other flags
>>>              +------------------------+------+---------------------+
>>>              |Name                    | Mask | Comment             |
>>>              +------------------------+------+---------------------+
>>>              |STO_AARCH64_VARIANT_PCS | 0x80 | The        function |
>>>              |                        |      | associated with the |
>>>              |                        |      | symbol may follow a |
>>>              |                        |      | variant   procedure |
>>>              |                        |      | call  standard with |
>>>              |                        |      | different  register |
>>>              |                        |      | usage convention.   |
>>>              +------------------------+------+---------------------+
>>>
>>>      A  symbol  table entry that is marked with the STO_AARCH64_VARIANT_PCS
>>>      flag set in its st_other field may be associated with a function  that
>>>      follows  a  variant  procedure  call  standard with different register
>>>      usage convention from the one  defined  in  the  base  procedure  call
>>>      standard  for  the  list  of  argument,  caller-saved and callee-saved
>>>      registers [AAPCS64].  The rules  in  the  Call  and  Jump  relocations
>>>      section  still  apply to such functions, and if a subroutine is called
>>>      via a symbol reference that  is  marked  with  STO_AARCH64_VARIANT_PCS
>>>      then  code that runs between the calling routine and called subroutine
>>>      must preserve the contents of all registers except IP0,  IP1  and  the
>>>      condition code flags [AAPCS64].
>>
>> Can you clarify if there has to be a valid stack at this point which can
>> be used during the call transfer?  What about the stack alignment
>> requirement?
>
> the intention is to only allow 'register usage convention' to be
> relaxed compared to the base PCS (which has rules for stack etc),
> and even the register usage convention has to be compatible with
> the 'Call and Jump relocations section' which essentially says that
> veneers inserted by the linker between calls can clobber IP0, IP1
> and the condition flags.
>
> i.e. a variant pcs function follows the same rules as base pcs, but
> it may use different caller-/callee-saved/argument regiseters.
>
> when SVE pcs is merged into the current AAPCS document, then i hope
> the 'variant pcs' term used here will be properly specified so the
> ELF ABI will just refer back to that.

My concern is that with the current language, it's not clear whether
it's possible to use the stack as a scratch area during the call
transition, or rely on a valid TCB.  I think this is rather
underspecified.

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Szabolcs Nagy-2
On 22/05/2019 16:34, Florian Weimer wrote:

> * Szabolcs Nagy:
>
>> On 22/05/2019 16:06, Florian Weimer wrote:
>>> * Szabolcs Nagy:
>>>
>>>> AAELF64: in the Symbol Table section add
>>>>
>>>>  st_other Values
>>>>      The  st_other  member  of  a symbol table entry specifies the symbol's
>>>>      visibility in the lowest 2 bits.  The top 6 bits  are  unused  in  the
>>>>      generic  ELF ABI [SCO-ELF], and while there are no values reserved for
>>>>      processor-specific semantics, many other architectures have used these
>>>>      bits.
>>>>
>>>>      The  defined  processor-specific  st_other  flag  values are listed in
>>>>      Table 4-5-1.
>>>>
>>>>  Table 4-5-1, Processor specific st_other flags
>>>>              +------------------------+------+---------------------+
>>>>              |Name                    | Mask | Comment             |
>>>>              +------------------------+------+---------------------+
>>>>              |STO_AARCH64_VARIANT_PCS | 0x80 | The        function |
>>>>              |                        |      | associated with the |
>>>>              |                        |      | symbol may follow a |
>>>>              |                        |      | variant   procedure |
>>>>              |                        |      | call  standard with |
>>>>              |                        |      | different  register |
>>>>              |                        |      | usage convention.   |
>>>>              +------------------------+------+---------------------+
>>>>
>>>>      A  symbol  table entry that is marked with the STO_AARCH64_VARIANT_PCS
>>>>      flag set in its st_other field may be associated with a function  that
>>>>      follows  a  variant  procedure  call  standard with different register
>>>>      usage convention from the one  defined  in  the  base  procedure  call
>>>>      standard  for  the  list  of  argument,  caller-saved and callee-saved
>>>>      registers [AAPCS64].  The rules  in  the  Call  and  Jump  relocations
>>>>      section  still  apply to such functions, and if a subroutine is called
>>>>      via a symbol reference that  is  marked  with  STO_AARCH64_VARIANT_PCS
>>>>      then  code that runs between the calling routine and called subroutine
>>>>      must preserve the contents of all registers except IP0,  IP1  and  the
>>>>      condition code flags [AAPCS64].
>>>
>>> Can you clarify if there has to be a valid stack at this point which can
>>> be used during the call transfer?  What about the stack alignment
>>> requirement?
>>
>> the intention is to only allow 'register usage convention' to be
>> relaxed compared to the base PCS (which has rules for stack etc),
>> and even the register usage convention has to be compatible with
>> the 'Call and Jump relocations section' which essentially says that
>> veneers inserted by the linker between calls can clobber IP0, IP1
>> and the condition flags.
>>
>> i.e. a variant pcs function follows the same rules as base pcs, but
>> it may use different caller-/callee-saved/argument regiseters.
>>
>> when SVE pcs is merged into the current AAPCS document, then i hope
>> the 'variant pcs' term used here will be properly specified so the
>> ELF ABI will just refer back to that.
>
> My concern is that with the current language, it's not clear whether
> it's possible to use the stack as a scratch area during the call
> transition, or rely on a valid TCB.  I think this is rather
> underspecified.

i think that's underspecified in general for normal calls too,
currently the glibc dynamic linker assumes it can use some stack
space and do various async signal safe operations (some of which
may even fail), variant pcs does not change any of this.

it only provides a per symbol escape hatch for functions with a
bit special call convention, and i plan to use the symbol marking
in glibc as 'force bind now for these symbols', because other
behaviour may not be forward compatible if the architecture
changes again (if lazy binding turns out to be very important
for these symbols i'd prefer introducing a second entry point
for them instead of checking the elf flags from the entry asm).

i'll try to post patches implementing this abi soon.
Reply | Threaded
Open this post in threaded view
|

Re: [AArch64 ELF ABI] Vector calls and lazy binding on AArch64

Szabolcs Nagy-2
In reply to this post by Szabolcs Nagy-2
On 22/05/2019 15:42, Szabolcs Nagy wrote:
> [AAELF64]: ELF for the Arm 64-bit Architecture (AArch64)
>            https://developer.arm.com/docs/ihi0056/latest
> [VABI64]:  Vector Function ABI Specification for AArch64
>            https://developer.arm.com/tools-and-software/server-and-hpc/arm-architecture-tools/arm-compiler-for-hpc/vector-function-abi

the new ABI has been published with minor wording changes
compared to the draft version.

the ABI is implemented in gcc, binutils and glibc in a
series of patches listed below.


gcc:

commit 779640c76d37b32f4d8a7b97637ed9e345d750b4
Commit:     nsz <nsz@138bc75d-0d04-0410-961f-82ee72b054a4>
CommitDate: 2019-06-03 13:50:53 +0000

    aarch64: emit .variant_pcs for aarch64_vector_pcs symbol references
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@271869 138bc75d-0d04-0410-961f-82ee72b054a4

commit d403a7711c2cf9a7a4892d76b875a1c99a690f89
Commit:     nsz <nsz@138bc75d-0d04-0410-961f-82ee72b054a4>
CommitDate: 2019-06-04 16:16:52 +0000

    aarch64: fix asm visibility for extern symbols
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@271913 138bc75d-0d04-0410-961f-82ee72b054a4

commit 042371f341a956de8c76557df700ebdc1af9ab4f
Commit:     nsz <nsz@138bc75d-0d04-0410-961f-82ee72b054a4>
CommitDate: 2019-06-18 11:11:07 +0000

    aarch64: fix gcc.target/aarch64/pcs_attribute-2.c on non-gnu targets
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@272414 138bc75d-0d04-0410-961f-82ee72b054a4


binutils:

commit 2301ed1c9af1316b4bad3747d2b03f7d44940f87
Commit:     Szabolcs Nagy <[hidden email]>
CommitDate: 2019-05-24 15:05:57 +0100

    aarch64: add STO_AARCH64_VARIANT_PCS and DT_AARCH64_VARIANT_PCS

commit f166ae0188dcb89c5ae925034260a708a254ab2f
Commit:     Szabolcs Nagy <[hidden email]>
CommitDate: 2019-05-24 15:07:42 +0100

    aarch64: handle .variant_pcs directive in gas

commit 0b4eac57c44ec4c9e13f5201b40936c3b3e6c639
Commit:     Szabolcs Nagy <[hidden email]>
CommitDate: 2019-05-24 15:09:06 +0100

    aarch64: override default elf .set handling in gas

commit 823710d5856996d1f54f04ecb2f7647aeae99b5b
Commit:     Szabolcs Nagy <[hidden email]>
CommitDate: 2019-05-24 15:11:00 +0100

    aarch64: handle STO_AARCH64_VARIANT_PCS in bfd

commit 65f381e729bedb933f3e1376e7f53f0ff63ac9a8
Commit:     Szabolcs Nagy <[hidden email]>
CommitDate: 2019-05-28 12:03:51 +0100

    aarch64: fix variant_pcs ld tests


glibc:

commit 55f82d328d2dd1c7c13c1992f4b9bf9c95b57551
Commit:     Szabolcs Nagy <[hidden email]>
CommitDate: 2019-06-13 09:44:44 +0100

    aarch64: add STO_AARCH64_VARIANT_PCS and DT_AARCH64_VARIANT_PCS

commit 82bc69c012838a381c4167c156a06f4598f34227
Commit:     Szabolcs Nagy <[hidden email]>
CommitDate: 2019-06-13 09:45:00 +0100

    aarch64: handle STO_AARCH64_VARIANT_PCS