Cortex M0 Floating Point Library

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Cortex M0 Floating Point Library

Daniel Engel
Hi,

Over the past couple of years, I have hand-assembled a new floating point library for the ARM Cortex M0 architecture.  I know the M0 is not generally regarded as a number-crunching machine, but I felt it deserved at least some of the attention that has previously been bestowed on the AVR architecture.  As this work has been incidental to my employer's line of business, they have tentatively agreed to assign the copyright and facilitate a release of this library as open source.  

I have efficient implementations of all of the integer and single-precision AEABI functions:

*  clzsi2, clzdi2, umulsidi3, mulsidi3, muldi3 (aeabi_lmul)
*  ashldi3 (aeabi_llsl), lshrdi3 (aeabi_llsr), ashrdi3 (aeabi_lasr)
*  aeabi_lcmp, aeabi_ulcmp
*  udivsi3 (aeabi_uidivmod), divsi3 (aeabi_idivmod), udivdi3 _aeabi_uldivmod), divdi3 (aeabi_ldivmod)
*  addsf3 (aeabi_fadd), subsf3 (aeabi_fsub, aeabi_frsub), mulsf3 (aeabi_fmul), divsf3 (aeabi_fdiv), fdimf
*  cmpsf2 (aeabi_fcmpun), eqsf2 (aeabi_fcmpeq), nesf2 (aeabi_fcmpne), gesf2 (aeabi_fcmpge), gtsf2, unordsf2
*  floatundisf (aeabi_ul2f),floatunsisf (aeabi_ui2f),floatdisf (aeabi_l2f),floatsisf (aeabi_i2f)
*  fixsfdi (aeabi_f2lz), fixunssfdi (aeabi_f2ulz), fixsfsi (aeabi_f2iz), fixunssfsi (aeabi_f2uiz)
*  aeabi_f2d, aeabi_d2f, aeabi_h2f, aeabi_f2h

I also have efficient implementations of several of the simpler libm functions:

*  frexpf, ldexpf, scalbnf
*  fmaxf, fminf
*  rintf, lrintf, ulrintf, llrintf, ullrintf, roundf, lroundf, ulroundf, llroundf, ullroundf
*  truncf, ceilf, floorf
*  fpclassifyf, isnormalf, isnanf, isinff, isfinitef, isposf, isnegf
*  ilogbf, logbf, modff
*  sqrtf, cbrtf
*  log2f, logf, log10f, log1p2f, log1pf, log1p10f, logXf, log1pXf
*  sinf, cosf, sincosf, sinpif, cospif, sincospif
*  tanf, cotf, tanpif, cotpif

Presently, the library comprises about 40 files with about 8000 lines of asm (unified syntax).  The test vectors weigh significantly more.  All of the floating point functions are IEEE754 compliant.  I can provide more complete performance statistics on request, but here are a few highlights:

* Small: Less than 3kb for everything above.  Only 450 bytes for basic addsf3, subsf3, mulsf3, divsf3, and cmpsf2.
* Fast: addsf3 = 75 instruction cycles, subsf3 = 80, mulsf3 = 95, divsf3 = 260 to 360, cmpsf2 = 35.
* Correct: Simultaneous calculation of sincosf() in less than 500 instruction cycles, accurate within +/- 1 ulp, including arbitrarily large values of 'x'.
* Bonus: round10iff(x, n) (a non-standard function) correctly rounds floating point values 'x' to an integer power of 10 'n'; this function simulates conversion to a decimal string, truncation, and conversion back to binary32 without any string-handling overhead.

To date, I have only built this library as part of a user space embedded application.  I have not attempted to build or patch the GCC toolchain itself.  If accepted, I suspect there will be at least a little work to restructure it for inclusion with libgcc.  But, before proceeding with that work, I need to have some idea of direction and goal.  

The first question, then, is what might the best home for this library be?  Many of the lower level functions (e.f. clzsi2, addsf3) replace the generic implementations of libgcc.  However, the higher level functions (e.g. ldexpf, sincosf) traditionally link from libm, which I don't believe is typically distributed with gcc.  The compact nature of this library of course follows from a tight integration between higher and lower level functions.  I have considered a few strategies:

* Add everything into the base libgcc,
* Add everything into libm (newlib?) and rely on link order to supersede libgcc,
* Split the implementation with some magic to ensure that libm functions only link in the presence of the correct libgcc,
* Establish an independent library specific to the Cortex M0 architecture, or
* Something else entirely...

If there is any interest in incorporating this work into GCC, please advise.  

Thanks,
Daniel Engel
Reply | Threaded
Open this post in threaded view
|

Re: Cortex M0 Floating Point Library

Joel Sherrill <joel.sherrill@OARcorp.com>-6
On Tue, Nov 6, 2018, 10:32 PM Daniel Engel <[hidden email] wrote:

> Hi,
>
> Over the past couple of years, I have hand-assembled a new floating point
> library for the ARM Cortex M0 architecture.  I know the M0 is not generally
> regarded as a number-crunching machine, but I felt it deserved at least
> some of the attention that has previously been bestowed on the AVR
> architecture.  As this work has been incidental to my employer's line of
> business, they have tentatively agreed to assign the copyright and
> facilitate a release of this library as open source.
>
> I have efficient implementations of all of the integer and
> single-precision AEABI functions:
>
> *  clzsi2, clzdi2, umulsidi3, mulsidi3, muldi3 (aeabi_lmul)
> *  ashldi3 (aeabi_llsl), lshrdi3 (aeabi_llsr), ashrdi3 (aeabi_lasr)
> *  aeabi_lcmp, aeabi_ulcmp
> *  udivsi3 (aeabi_uidivmod), divsi3 (aeabi_idivmod), udivdi3
> _aeabi_uldivmod), divdi3 (aeabi_ldivmod)
> *  addsf3 (aeabi_fadd), subsf3 (aeabi_fsub, aeabi_frsub), mulsf3
> (aeabi_fmul), divsf3 (aeabi_fdiv), fdimf
> *  cmpsf2 (aeabi_fcmpun), eqsf2 (aeabi_fcmpeq), nesf2 (aeabi_fcmpne),
> gesf2 (aeabi_fcmpge), gtsf2, unordsf2
> *  floatundisf (aeabi_ul2f),floatunsisf (aeabi_ui2f),floatdisf
> (aeabi_l2f),floatsisf (aeabi_i2f)
> *  fixsfdi (aeabi_f2lz), fixunssfdi (aeabi_f2ulz), fixsfsi (aeabi_f2iz),
> fixunssfsi (aeabi_f2uiz)
> *  aeabi_f2d, aeabi_d2f, aeabi_h2f, aeabi_f2h
>
> I also have efficient implementations of several of the simpler libm
> functions:
>
> *  frexpf, ldexpf, scalbnf
> *  fmaxf, fminf
> *  rintf, lrintf, ulrintf, llrintf, ullrintf, roundf, lroundf, ulroundf,
> llroundf, ullroundf
> *  truncf, ceilf, floorf
> *  fpclassifyf, isnormalf, isnanf, isinff, isfinitef, isposf, isnegf
> *  ilogbf, logbf, modff
> *  sqrtf, cbrtf
> *  log2f, logf, log10f, log1p2f, log1pf, log1p10f, logXf, log1pXf
> *  sinf, cosf, sincosf, sinpif, cospif, sincospif
> *  tanf, cotf, tanpif, cotpif
>
> Presently, the library comprises about 40 files with about 8000 lines of
> asm (unified syntax).  The test vectors weigh significantly more.  All of
> the floating point functions are IEEE754 compliant.  I can provide more
> complete performance statistics on request, but here are a few highlights:
>
> * Small: Less than 3kb for everything above.  Only 450 bytes for basic
> addsf3, subsf3, mulsf3, divsf3, and cmpsf2.
> * Fast: addsf3 = 75 instruction cycles, subsf3 = 80, mulsf3 = 95, divsf3 =
> 260 to 360, cmpsf2 = 35.
> * Correct: Simultaneous calculation of sincosf() in less than 500
> instruction cycles, accurate within +/- 1 ulp, including arbitrarily large
> values of 'x'.
> * Bonus: round10iff(x, n) (a non-standard function) correctly rounds
> floating point values 'x' to an integer power of 10 'n'; this function
> simulates conversion to a decimal string, truncation, and conversion back
> to binary32 without any string-handling overhead.
>

This sounds like a nice body of work. Congratukations.

Does paranoia pass?

>
> To date, I have only built this library as part of a user space embedded
> application.  I have not attempted to build or patch the GCC toolchain
> itself.  If accepted, I suspect there will be at least a little work to
> restructure it for inclusion with libgcc.  But, before proceeding with that
> work, I need to have some idea of direction and goal.
>
> The first question, then, is what might the best home for this library
> be?  Many of the lower level functions (e.f. clzsi2, addsf3) replace the
> generic implementations of libgcc.  However, the higher level functions
> (e.g. ldexpf, sincosf) traditionally link from libm, which I don't believe
> is typically distributed with gcc.  The compact nature of this library of
> course follows from a tight integration between higher and lower level
> functions.  I have considered a few strategies:
>
> * Add everything into the base libgcc,
> * Add everything into libm (newlib?) and rely on link order to supersede
> libgcc,
>

This will almost certainly break at some point, for someone, and be hard to
even figure out it happened because the code will work but just be bigger
or slower.

* Split the implementation with some magic to ensure that libm functions
> only link in the presence of the correct libgcc,
>

I think this is the proper solution. It just puts better implementations in
the place the infrastructure already supports having a target specific
option.

* Establish an independent library specific to the Cortex M0 architecture,
> or
>

This is likely to get you the smallest number of users.  People have to
find it and then integrate it on their own. Don't make it hard for folks to
find and use your work.


* Something else entirely...
>
> If there is any interest in incorporating this work into GCC, please
> advise.
>

I think so but I am just one voice from the RTEMS community. But I think
any M0 user would be pleased.

--joel

>
> Thanks,
> Daniel Engel
>
Reply | Threaded
Open this post in threaded view
|

Re: Cortex M0 Floating Point Library

Daniel Engel
On Tue, Nov 6, 2018, at 9:28 PM, Joel Sherrill wrote:
>
> On Tue, Nov 6, 2018, 10:32 PM Daniel Engel <[hidden email] wrote:
>> Hi,
>>  
>>  Over the past couple of years, I have hand-assembled a new floating point library for the ARM Cortex M0 architecture.  I know the M0 is not generally regarded as a number-crunching machine, but I felt it deserved at least some of the attention that has previously been bestowed on the AVR architecture.  As this work has been incidental to my employer's line of business, they have tentatively agreed to assign the copyright and facilitate a release of this library as open source. 
>
> This sounds like a nice body of work. Congratukations.
>
> Does paranoia pass? 

I haven't run paranoia, as it doesn't claim to be a comprehensive test suite.  

Per the allowance of the AEABI, my library only supports round-to-nearest, ties to even.  For the basic operations, I tested the applicable cases of the UCB and ieeeCC754 , plus estensive random testing with the Berkeley TestFloat/SoftFloat implementation.  All of these tests passed in an STM32F0 target environment.  

    <http://www.netlib.org/fp/>
    <http://www.jhauser.us/arithmetic/TestFloat.html>

For other operations not covered by UCB or ieeeCC754, I developed my own cases using the C# floating point library for the reference operations.  Typically, I tested 50 - 500 cases per function, covering both general and special case arguments.  All functions have complete, tested support for INF, NAN, +/-0, and subnormals.  On-target testing was typically limited to about 64kb per group of test cases (the flash memory of the STM32F0).

Additionally, while an assembly language implementation is typically difficult enough by itself, the logf() and sincosf() functions embody somewhat novel algorithms (as far as I can tell).  For these, I proved correctness with an equivalent C implementation and exhaustive simulation on a PC.   The simulation compared the result for each argument with an equivalent double precision calculation using the standard C library where possible, and the ttmath library otherwise.  
 
    <https://www.ttmath.org/>

>>  * Add everything into the base libgcc,
>>  * Add everything into libm (newlib?) and rely on link order to supersede libgcc,
>
> This will almost certainly break at some point, for someone, and be hard to even figure out it happened because the code will work but just be bigger or slower.
>
>> * Split the implementation with some magic to ensure that libm functions only link in the presence of the correct libgcc,
>
> I think this is the proper solution. It just puts better implementations in the place the infrastructure already supports having a target specific option.

There would be some difficult cases in splitting the library, and I haven't yet quantified all the costs.  One problem point might be tanf(), which relies on a routine shared with divsf3() to calculate the sin/cos ratio with >24 bits of precision.  Splitting the library would require exposing such internal routines, which don't naturally conform to any procedure call conventions.  Also, loss of control of linking order would require all short branches in the libm section to be replaced with long branches.  This particularly impacts the exception handling in almost every function.  

>
>> * Establish an independent library specific to the Cortex M0 architecture, or
>
> This is likely to get you the smallest number of users.  People have to find it and then integrate it on their own. Don't make it hard for folks to find and use your work.

Agreed.  Plus, I don't have the resources or experience to be a long-term library maintainer.  It's not as if basic math functions require constant maintenance and updating.  The original Cortex M3 library by Nicolas Pitre has only seen a small handful of changes in past decade.  

>
>> * Something else entirely...
>>  
>>  If there is any interest in incorporating this work into GCC, please advise. 
>
> I think so but I am just one voice from the RTEMS community. But I think any M0 user would be pleased.
>
> --joel
>>
>> Thanks,
>>  Daniel Engel
Reply | Threaded
Open this post in threaded view
|

Re: Cortex M0 Floating Point Library

Richard Henderson
On 11/7/18 6:10 PM, Daniel Engel wrote:
> Also, loss of control of linking order would require all short branches in the libm section to be replaced with long branches.  This particularly impacts the exception handling in almost every function.

You could partially remedy this by placing all the code into a unique section,
e.g. ".text.m0fp".  The default linker script would place all instances of this
section together.  Additional tricks can be played if we're willing to modify
the linker scripts further.


r~
Reply | Threaded
Open this post in threaded view
|

Re: Cortex M0 Floating Point Library

Daniel Engel
Hi Richard,

I've only used custom linker scripts with my embedded work, so I don't know much about the GCC default.  

Presently, every library function is already in its own section to facilitate --gc-sections optimization.  However, I #include every file together to ensure that all of the sections are still built into a single object file.  So far the linker has done what I've expected (i.e., placed all of the used functions/sections in a contiguous block of memory and discarded the unused functions).  

Is the linker aware of section hierarchy, such that using a common section prefix (e.g. ".text.m0fp.*") would gather the appropriate sections together from multiple object files?  I didn't think it was so aware; rather, that such prefixes were just a convention to help organize custom linker scripts.  Adding such rules to the default linker script wouldn't be ideal, as everyone using a custom script might then have library breakage unless they knew to add equivalent rules.  

If the consensus is to split the library, it might help to add a set of intermediate branches (trampolines?) in the libm portion.  This would add execution cycles, but not require as many extra bytes.  

Regards,
Daniel


On Thu, Nov 8, 2018, at 11:19 PM, Richard Henderson wrote:

> On 11/7/18 6:10 PM, Daniel Engel wrote:
> > Also, loss of control of linking order would require all short branches in the libm section to be replaced with long branches.  This particularly impacts the exception handling in almost every function.
>
> You could partially remedy this by placing all the code into a unique section,
> e.g. ".text.m0fp".  The default linker script would place all instances of this
> section together.  Additional tricks can be played if we're willing to modify
> the linker scripts further.
>
>
> r~
Reply | Threaded
Open this post in threaded view
|

Re: Cortex M0 Floating Point Library

Richard Henderson
On 11/9/18 9:58 PM, Daniel Engel wrote:
> Is the linker aware of section hierarchy, such that using a common section prefix (e.g. ".text.m0fp.*") would gather the appropriate sections together from multiple object files?

The linker script is not written like that.  But we could reasonably replace
the current ".text.*" with "SORT(.text.*)" with no ill effects (since the
current ordering is not guaranteed, no one should be depending on it).

> Adding such rules to the default linker script wouldn't be ideal, as everyone using a custom script might then have library breakage unless they knew to add equivalent rules.

*shrug* But what other solution?  At least the failure isn't silent -- the
branches will be out of range and the link will fail.  Anyone using their own
linker script must be willing to adjust to compiler changes over time, and in
this case the fix is trivial.

> If the consensus is to split the library, it might help to add a set of intermediate branches (trampolines?) in the libm portion.  This would add execution cycles, but not require as many extra bytes.

I don't think it's a good idea to put sin() into libgcc.  That really does
belong over in newlib.  Trampolines do sound like a reasonable solution.


r~