Re: [PATCH] Optimize to_chars

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Optimize to_chars

Jonathan Wakely-3
On 30/08/19 17:27 +0300, Antony Polukhin wrote:

>Bunch of micro optimizations for std::to_chars:
>* For base == 8 replacing the lookup in __digits table with arithmetic
>computations leads to a same CPU cycles for a loop (exchanges two
>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
>saves 129 bytes of data and totally avoids a chance of cache misses on
>__digits.
>* For base == 16 replacing the lookup in __digits table with
>arithmetic computations leads to a few additional instructions, but
>totally avoids a chance of cache misses on __digits (- ~9 cache misses
>for worst case) and saves 513 bytes of const data.
>* Replacing __first[pos] and __first[pos - 1] with __first[1] and
>__first[0] on final iterations saves ~2% of code size.
>* Removing trailing '\0' from arrays of digits allows the linker to
>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
>"0123456789abcdef" could share the same address). This improves data
>locality and reduces binary sizes.
>* Using __detail::__to_chars_len_2 instead of a generic
>__detail::__to_chars_len makes the operation O(1) instead of O(N). It
>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
>.
>
>In sum: this significantly reduces the size of a binary (for about
>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
>with latency (CPU cache misses) without changing the iterations count
>and without adding costly instructions into the loops.

This is great, thanks.

Have you tried comparing the improved code to libc++'s implementation?
I believe they use precomputed arrays of digits, but they use larger
arrays that allow 4 bytes to be written at once, which is considerably
faster (and those precomputed arrays libe in libc++.so not in the
header). Would we be better off keeping the precomputed arrays and
expanding them to do 4-byte writes?

Since we don't have a patch to do that, I think I'll commit yours. We
can always go back to precomputed arrays later if somebody does that
work.

My only comments are on the changelog:

>Changelog:
>    * include/std/charconv (__detail::__to_chars_8,
>    __detail::__to_chars_16): Replace array of precomputed digits

When the list of changed functions is split across lines it should be
like this:

    * include/std/charconv (__detail::__to_chars_8)
    (__detail::__to_chars_16): Replace array of precomputed digits

i.e close the parentheses before the line break, and reopen on the
next line.

>    with arithmetic operations to avoid CPU cache misses. Remove
>    zero termination from array of digits to allow symbol merge with
>    generic implementation of __detail::__to_chars. Replace final
>    offsets with constants. Use __detail::__to_chars_len_2 instead
>    of a generic __detail::__to_chars_len.
>    * include/std/charconv (__detail::__to_chars): Remove

Don't repeat the asterisk and filename for later changes in the same
file, i.e.

    (__detail::__to_chars): Remove zero termination from array of digits.
    (__detail::__to_chars_2): Leading digit is always '1'.

There's no changelog entry for the changes to __to_chars_len_8 and
__to_chars_len_16.

Also, please don't include the ChangeLog diff in the patch, because it
just makes it hard to apply the patch (the ChangeLog part will almost
always fail to apply because somebody else will have modified the
ChangeLog file since you created the patch, and so that hunk won't
apply. The ChangeLog text should be sent as a separate (plain text)
attachment, or just in the email body (as you did above).

I'll test this and commit it, thanks!


Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Optimize to_chars

Jonathan Wakely-3
On 30/08/19 17:01 +0100, Jonathan Wakely wrote:

>On 30/08/19 17:27 +0300, Antony Polukhin wrote:
>>Bunch of micro optimizations for std::to_chars:
>>* For base == 8 replacing the lookup in __digits table with arithmetic
>>computations leads to a same CPU cycles for a loop (exchanges two
>>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
>>saves 129 bytes of data and totally avoids a chance of cache misses on
>>__digits.
>>* For base == 16 replacing the lookup in __digits table with
>>arithmetic computations leads to a few additional instructions, but
>>totally avoids a chance of cache misses on __digits (- ~9 cache misses
>>for worst case) and saves 513 bytes of const data.
>>* Replacing __first[pos] and __first[pos - 1] with __first[1] and
>>__first[0] on final iterations saves ~2% of code size.
>>* Removing trailing '\0' from arrays of digits allows the linker to
>>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
>>"0123456789abcdef" could share the same address). This improves data
>>locality and reduces binary sizes.
>>* Using __detail::__to_chars_len_2 instead of a generic
>>__detail::__to_chars_len makes the operation O(1) instead of O(N). It
>>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
>>.
>>
>>In sum: this significantly reduces the size of a binary (for about
>>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
>>with latency (CPU cache misses) without changing the iterations count
>>and without adding costly instructions into the loops.
>
>This is great, thanks.
>
>Have you tried comparing the improved code to libc++'s implementation?
>I believe they use precomputed arrays of digits, but they use larger
>arrays that allow 4 bytes to be written at once, which is considerably
>faster (and those precomputed arrays libe in libc++.so not in the
>header). Would we be better off keeping the precomputed arrays and
>expanding them to do 4-byte writes?
>
>Since we don't have a patch to do that, I think I'll commit yours. We
>can always go back to precomputed arrays later if somebody does that
>work.
>
>My only comments are on the changelog:
>
>>Changelog:
>>   * include/std/charconv (__detail::__to_chars_8,
>>   __detail::__to_chars_16): Replace array of precomputed digits
>
>When the list of changed functions is split across lines it should be
>like this:
>
>   * include/std/charconv (__detail::__to_chars_8)
>   (__detail::__to_chars_16): Replace array of precomputed digits
>
>i.e close the parentheses before the line break, and reopen on the
>next line.
>
>>   with arithmetic operations to avoid CPU cache misses. Remove
>>   zero termination from array of digits to allow symbol merge with
>>   generic implementation of __detail::__to_chars. Replace final
>>   offsets with constants. Use __detail::__to_chars_len_2 instead
>>   of a generic __detail::__to_chars_len.
>>   * include/std/charconv (__detail::__to_chars): Remove
>
>Don't repeat the asterisk and filename for later changes in the same
>file, i.e.
>
>   (__detail::__to_chars): Remove zero termination from array of digits.
>   (__detail::__to_chars_2): Leading digit is always '1'.
>
>There's no changelog entry for the changes to __to_chars_len_8 and
>__to_chars_len_16.

Oh, there's no separate __to_chars_len_16 function anyway, and you did
mention it as "Use __detail::__to_chars_len_2 instead ..." - sorry!

I think we might as well inline __to_chars_len_8 into __to_chars_8,
there's not much benefit to having a separate function. I'll do that
as a follow up patch.



Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Optimize to_chars

Jonathan Wakely-3
On 30/08/19 17:08 +0100, Jonathan Wakely wrote:

>On 30/08/19 17:01 +0100, Jonathan Wakely wrote:
>>On 30/08/19 17:27 +0300, Antony Polukhin wrote:
>>>Bunch of micro optimizations for std::to_chars:
>>>* For base == 8 replacing the lookup in __digits table with arithmetic
>>>computations leads to a same CPU cycles for a loop (exchanges two
>>>movzx with 3 bit ops https://godbolt.org/z/RTui7m ). However this
>>>saves 129 bytes of data and totally avoids a chance of cache misses on
>>>__digits.
>>>* For base == 16 replacing the lookup in __digits table with
>>>arithmetic computations leads to a few additional instructions, but
>>>totally avoids a chance of cache misses on __digits (- ~9 cache misses
>>>for worst case) and saves 513 bytes of const data.
>>>* Replacing __first[pos] and __first[pos - 1] with __first[1] and
>>>__first[0] on final iterations saves ~2% of code size.
>>>* Removing trailing '\0' from arrays of digits allows the linker to
>>>merge the symbols (so that "0123456789abcdefghijklmnopqrstuvwxyz" and
>>>"0123456789abcdef" could share the same address). This improves data
>>>locality and reduces binary sizes.
>>>* Using __detail::__to_chars_len_2 instead of a generic
>>>__detail::__to_chars_len makes the operation O(1) instead of O(N). It
>>>also makes the code two times shorter ( https://godbolt.org/z/Peq_PG)
>>>.
>>>
>>>In sum: this significantly reduces the size of a binary (for about
>>>4KBs only for base-8 conversion https://godbolt.org/z/WPKijS ), deals
>>>with latency (CPU cache misses) without changing the iterations count
>>>and without adding costly instructions into the loops.
>>
>>This is great, thanks.
>>
>>Have you tried comparing the improved code to libc++'s implementation?
>>I believe they use precomputed arrays of digits, but they use larger
>>arrays that allow 4 bytes to be written at once, which is considerably
>>faster (and those precomputed arrays libe in libc++.so not in the
>>header). Would we be better off keeping the precomputed arrays and
>>expanding them to do 4-byte writes?
>>
>>Since we don't have a patch to do that, I think I'll commit yours. We
>>can always go back to precomputed arrays later if somebody does that
>>work.
>>
>>My only comments are on the changelog:
>>
>>>Changelog:
>>>  * include/std/charconv (__detail::__to_chars_8,
>>>  __detail::__to_chars_16): Replace array of precomputed digits
>>
>>When the list of changed functions is split across lines it should be
>>like this:
>>
>>  * include/std/charconv (__detail::__to_chars_8)
>>  (__detail::__to_chars_16): Replace array of precomputed digits
>>
>>i.e close the parentheses before the line break, and reopen on the
>>next line.
>>
>>>  with arithmetic operations to avoid CPU cache misses. Remove
>>>  zero termination from array of digits to allow symbol merge with
>>>  generic implementation of __detail::__to_chars. Replace final
>>>  offsets with constants. Use __detail::__to_chars_len_2 instead
>>>  of a generic __detail::__to_chars_len.
>>>  * include/std/charconv (__detail::__to_chars): Remove
>>
>>Don't repeat the asterisk and filename for later changes in the same
>>file, i.e.
>>
>>  (__detail::__to_chars): Remove zero termination from array of digits.
>>  (__detail::__to_chars_2): Leading digit is always '1'.
>>
>>There's no changelog entry for the changes to __to_chars_len_8 and
>>__to_chars_len_16.
>
>Oh, there's no separate __to_chars_len_16 function anyway, and you did
>mention it as "Use __detail::__to_chars_len_2 instead ..." - sorry!
>
>I think we might as well inline __to_chars_len_8 into __to_chars_8,
>there's not much benefit to having a separate function. I'll do that
>as a follow up patch.
I've committed this patch to simplify the code a bit.

Tested x86_64-linux, committed to trunk.


patch.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Optimize to_chars

Jonathan Wakely-3
On 08/09/19 16:44 +0300, Antony Polukhin wrote:

>We've already beaten this topic to death, so let's put a final nail in
>the coffin:
>
>
>__to_chars_10_impl is quite fast. According to the IACA the main loop
>takes only 6.0 cycles, the whole function with one iteration takes
>10.0 cycles. Replacing the __first[pos] and __first[pos - 1] with
>__first[0] and __first[1] drops the function time to 7.53 cycles.
>
>Changelog:
>
>2019-09-08  Antony Polukhin  <[hidden email]>
>
>    * include/bits/charconv.h (__detail::__to_chars_10_impl): Replace
>    final offsets with constants.

Excellent, thanks for the patch and all the benchmarking!

I've committed this to trunk now.