Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)

Jason Merrill
This is a libstdc++ bug and patch, not the C++ front end.  So I'm
adding the libstdc++ list to CC.

On Fri, Jan 19, 2018 at 3:02 AM, chang jc <[hidden email]> wrote:

> Current std::inplace_merge() suffers from performance issue by inefficient
> logic under limited memory,
>
> It leads to performance downgrade.
>
> Please help to review it.
>
> Index: include/bits/stl_algo.h
> ===================================================================
> --- include/bits/stl_algo.h     (revision 256871)
> +++ include/bits/stl_algo.h     (working copy)
> @@ -2437,7 +2437,7 @@
>           _BidirectionalIterator __second_cut = __middle;
>           _Distance __len11 = 0;
>           _Distance __len22 = 0;
> -         if (__len1 > __len2)
> +         if (__len1 < __len2)
>             {
>               __len11 = __len1 / 2;
>               std::advance(__first_cut, __len11);
> @@ -2539,9 +2539,15 @@
>        const _DistanceType __len1 = std::distance(__first, __middle);
>        const _DistanceType __len2 = std::distance(__middle, __last);
>
> +
>        typedef _Temporary_buffer<_BidirectionalIterator, _ValueType> _TmpBuf;
> -      _TmpBuf __buf(__first, __last);
> -
> +      _BidirectionalIterator __start, __end;
> +      if (__len1 < __len2) {
> +       __start = __first; __end = __middle;
> +      } else {
> +       __start = __middle; __end = __last;
> +      }
> +      _TmpBuf __buf(__start, ___end);
>        if (__buf.begin() == 0)
>         std::__merge_without_buffer
>           (__first, __middle, __last, __len1, __len2, __comp);
> Index: include/bits/stl_tempbuf.h
> ===================================================================
> --- include/bits/stl_tempbuf.h  (revision 256871)
> +++ include/bits/stl_tempbuf.h  (working copy)
> @@ -95,7 +95,7 @@
>                                                         std::nothrow));
>           if (__tmp != 0)
>             return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> -         __len /= 2;
> +         __len = (__len + 1) / 2;
>         }
>        return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>      }
>
>
>
>
> Thanks
Reply | Threaded
Open this post in threaded view
|

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)

chang jc
Hi:

1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
__len = (__len == 1) ? 0 : ((__len + 1) / 2);

2. The coding gain has been shown  PR c++/83938; I re-post here




  21
  20
  19
  18
  17
  16


  0.471136
  0.625695
  0.767262
  0.907461
  1.04838
  1.19508


  0.340845
  0.48651
  0.639139
  0.770133
  0.898454
  1.04632

it means when Merge [0, 4325376, 16777216); A is a sorted integer with
4325376 & B with 12451840 elements, total with 16M integers

The proposed method has the speed up under given buffer size =, ex
2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
given sizof(int)*64K bytes.

3. As your suggestion, _TmpBuf __buf should be rewrite.

4. It represents a fact that the intuitive idea to split from larger
part is wrong.

For example, if you have an input sorted array A & B, A has 8 integers
& B has 24 integers. Given tmp buffer whose capacity = 4 integers.

Current it tries to split from B, right?

Then we have:

A1 | A2  B1 | B2

B1 & B2 has 12 integers each, right?

Current algorithm selects pivot as 13th integer from B, right?

If the corresponding upper bound of A is 6th integer.

Then it split in

A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12

After rotate, we have two arrays to merge

[A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]

Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.

Sadly, [A1 = 5 | B1 = 12] CANNOT.

So we do rotate again, split & merge the two split arrays from [A1 = 5
| B1 = 12] again.


But wait, if we always split from the smaller one instead of larger one.

After rotate, it promises two split arrays both contain ceiling[small/2].

Since tmp buffer also allocate by starting from sizeof(small) &
recursively downgrade by ceiling[small/2^(# of fail allocate)].

It means the allocated tmp buffer promises to be sufficient at the
level of (# of fail allocate).

Instead, you can see if split from large at level (# of fail allocate)
several split array still CANNOT use  tmp buffer to do buffered merge.


As you know, buffered merge is far speed then (split, rotate, and
merge two sub-arrays) (PR c++/83938 gives the profiling figures),

the way should provide speedup.


Thanks.










On 24/01/2018 18:23, François Dumont wrote:

Hi


    It sounds like a very sensitive change to make but nothing worth figures.
Do you have any bench showing the importance of the gain ?

    At least the memory usage optimization is obvious.

On 19/01/2018 10:43, chang jc wrote:

Current std::inplace_merge() suffers from performance issue by inefficient

logic under limited memory,

It leads to performance downgrade.

Please help to review it.

Index: include/bits/stl_algo.h
===================================================================
--- include/bits/stl_algo.h    (revision 256871)
+++ include/bits/stl_algo.h    (working copy)
@@ -2437,7 +2437,7 @@
        _BidirectionalIterator __second_cut = __middle;
        _Distance __len11 = 0;
        _Distance __len22 = 0;
-      if (__len1 > __len2)
+      if (__len1 < __len2)
          {
            __len11 = __len1 / 2;
            std::advance(__first_cut, __len11);
@@ -2539,9 +2539,15 @@
        const _DistanceType __len1 = std::distance(__first, __middle);
        const _DistanceType __len2 = std::distance(__middle, __last);

+

        typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
_TmpBuf;

-      _TmpBuf __buf(__first, __last);
-
+      _BidirectionalIterator __start, __end;
+      if (__len1 < __len2) {
+    __start = __first; __end = __middle;
+      } else {
+    __start = __middle; __end = __last;
+      }
+      _TmpBuf __buf(__start, ___end);

Note another optimization, we could introduce a _Temporary_buffer<> constructor
in order to write:

_TmpBuf __buf(std::min(__len1, __len2), __first);

So that std::distance do not need to be called again.

        if (__buf.begin() == 0)
      std::__merge_without_buffer
        (__first, __middle, __last, __len1, __len2, __comp);
Index: include/bits/stl_tempbuf.h
===================================================================
--- include/bits/stl_tempbuf.h    (revision 256871)
+++ include/bits/stl_tempbuf.h    (working copy)
@@ -95,7 +95,7 @@
                              std::nothrow));
        if (__tmp != 0)
          return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
-      __len /= 2;
+      __len = (__len + 1) / 2;

This part is problematic, if __len is 1 and allocation fails it will loop
forever.

It doesn't seem really necessary for your patch.


2018-01-20 4:05 GMT+08:00 Jason Merrill <[hidden email]>:

> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
> adding the libstdc++ list to CC.
>
> On Fri, Jan 19, 2018 at 3:02 AM, chang jc <[hidden email]> wrote:
> > Current std::inplace_merge() suffers from performance issue by
> inefficient
> > logic under limited memory,
> >
> > It leads to performance downgrade.
> >
> > Please help to review it.
> >
> > Index: include/bits/stl_algo.h
> > ===================================================================
> > --- include/bits/stl_algo.h     (revision 256871)
> > +++ include/bits/stl_algo.h     (working copy)
> > @@ -2437,7 +2437,7 @@
> >           _BidirectionalIterator __second_cut = __middle;
> >           _Distance __len11 = 0;
> >           _Distance __len22 = 0;
> > -         if (__len1 > __len2)
> > +         if (__len1 < __len2)
> >             {
> >               __len11 = __len1 / 2;
> >               std::advance(__first_cut, __len11);
> > @@ -2539,9 +2539,15 @@
> >        const _DistanceType __len1 = std::distance(__first, __middle);
> >        const _DistanceType __len2 = std::distance(__middle, __last);
> >
> > +
> >        typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
> _TmpBuf;
> > -      _TmpBuf __buf(__first, __last);
> > -
> > +      _BidirectionalIterator __start, __end;
> > +      if (__len1 < __len2) {
> > +       __start = __first; __end = __middle;
> > +      } else {
> > +       __start = __middle; __end = __last;
> > +      }
> > +      _TmpBuf __buf(__start, ___end);
> >        if (__buf.begin() == 0)
> >         std::__merge_without_buffer
> >           (__first, __middle, __last, __len1, __len2, __comp);
> > Index: include/bits/stl_tempbuf.h
> > ===================================================================
> > --- include/bits/stl_tempbuf.h  (revision 256871)
> > +++ include/bits/stl_tempbuf.h  (working copy)
> > @@ -95,7 +95,7 @@
> >                                                         std::nothrow));
> >           if (__tmp != 0)
> >             return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> > -         __len /= 2;
> > +         __len = (__len + 1) / 2;
> >         }
> >        return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
> >      }
> >
> >
> >
> >
> > Thanks
>
Reply | Threaded
Open this post in threaded view
|

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR c++/83938)

François Dumont-2
Hi

     If I read you well the main advantage of your approach is that it
limits the number of calls to __rotate_adaptive. So did you try to use
the __buffer_size as the __len11 size when buffer is too small rather
than __len1 / 2 ? That is to say:

       if (__len1 < __len2)
         {
-          __len11 = __len1 / 2;
+          __len11 = __buffer_size;

     I wonder what your bench is showing with this approach.

      Maybe you can post your bench on bugzilla too, we could use it to
create a performance test.

     I also wonder why when heap allocation fails we don't use the stack
? Is that forbidden in our algo implementations ?

François


On 25/01/2018 23:37, chang jc wrote:

> Hi:
>
> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>
> 2. The coding gain has been shown  PR c++/83938; I re-post here
>
>
>
>
>    21
>    20
>    19
>    18
>    17
>    16
>
>
>    0.471136
>    0.625695
>    0.767262
>    0.907461
>    1.04838
>    1.19508
>
>
>    0.340845
>    0.48651
>    0.639139
>    0.770133
>    0.898454
>    1.04632
>
> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
> 4325376 & B with 12451840 elements, total with 16M integers
>
> The proposed method has the speed up under given buffer size =, ex
> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
> given sizof(int)*64K bytes.
>
> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>
> 4. It represents a fact that the intuitive idea to split from larger
> part is wrong.
>
> For example, if you have an input sorted array A & B, A has 8 integers
> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>
> Current it tries to split from B, right?
>
> Then we have:
>
> A1 | A2  B1 | B2
>
> B1 & B2 has 12 integers each, right?
>
> Current algorithm selects pivot as 13th integer from B, right?
>
> If the corresponding upper bound of A is 6th integer.
>
> Then it split in
>
> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>
> After rotate, we have two arrays to merge
>
> [A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>
> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>
> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>
> So we do rotate again, split & merge the two split arrays from [A1 = 5
> | B1 = 12] again.
>
>
> But wait, if we always split from the smaller one instead of larger one.
>
> After rotate, it promises two split arrays both contain ceiling[small/2].
>
> Since tmp buffer also allocate by starting from sizeof(small) &
> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>
> It means the allocated tmp buffer promises to be sufficient at the
> level of (# of fail allocate).
>
> Instead, you can see if split from large at level (# of fail allocate)
> several split array still CANNOT use  tmp buffer to do buffered merge.
>
>
> As you know, buffered merge is far speed then (split, rotate, and
> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>
> the way should provide speedup.
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> On 24/01/2018 18:23, François Dumont wrote:
>
> Hi
>
>
>      It sounds like a very sensitive change to make but nothing worth figures.
> Do you have any bench showing the importance of the gain ?
>
>      At least the memory usage optimization is obvious.
>
> On 19/01/2018 10:43, chang jc wrote:
>
> Current std::inplace_merge() suffers from performance issue by inefficient
>
> logic under limited memory,
>
> It leads to performance downgrade.
>
> Please help to review it.
>
> Index: include/bits/stl_algo.h
> ===================================================================
> --- include/bits/stl_algo.h    (revision 256871)
> +++ include/bits/stl_algo.h    (working copy)
> @@ -2437,7 +2437,7 @@
>          _BidirectionalIterator __second_cut = __middle;
>          _Distance __len11 = 0;
>          _Distance __len22 = 0;
> -      if (__len1 > __len2)
> +      if (__len1 < __len2)
>            {
>              __len11 = __len1 / 2;
>              std::advance(__first_cut, __len11);
> @@ -2539,9 +2539,15 @@
>          const _DistanceType __len1 = std::distance(__first, __middle);
>          const _DistanceType __len2 = std::distance(__middle, __last);
>
> +
>
>          typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
> _TmpBuf;
>
> -      _TmpBuf __buf(__first, __last);
> -
> +      _BidirectionalIterator __start, __end;
> +      if (__len1 < __len2) {
> +    __start = __first; __end = __middle;
> +      } else {
> +    __start = __middle; __end = __last;
> +      }
> +      _TmpBuf __buf(__start, ___end);
>
> Note another optimization, we could introduce a _Temporary_buffer<> constructor
> in order to write:
>
> _TmpBuf __buf(std::min(__len1, __len2), __first);
>
> So that std::distance do not need to be called again.
>
>          if (__buf.begin() == 0)
>        std::__merge_without_buffer
>          (__first, __middle, __last, __len1, __len2, __comp);
> Index: include/bits/stl_tempbuf.h
> ===================================================================
> --- include/bits/stl_tempbuf.h    (revision 256871)
> +++ include/bits/stl_tempbuf.h    (working copy)
> @@ -95,7 +95,7 @@
>                                std::nothrow));
>          if (__tmp != 0)
>            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
> -      __len /= 2;
> +      __len = (__len + 1) / 2;
>
> This part is problematic, if __len is 1 and allocation fails it will loop
> forever.
>
> It doesn't seem really necessary for your patch.
>
>
> 2018-01-20 4:05 GMT+08:00 Jason Merrill <[hidden email]>:
>
>> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>> adding the libstdc++ list to CC.
>>
>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc <[hidden email]> wrote:
>>> Current std::inplace_merge() suffers from performance issue by
>> inefficient
>>> logic under limited memory,
>>>
>>> It leads to performance downgrade.
>>>
>>> Please help to review it.
>>>
>>> Index: include/bits/stl_algo.h
>>> ===================================================================
>>> --- include/bits/stl_algo.h     (revision 256871)
>>> +++ include/bits/stl_algo.h     (working copy)
>>> @@ -2437,7 +2437,7 @@
>>>            _BidirectionalIterator __second_cut = __middle;
>>>            _Distance __len11 = 0;
>>>            _Distance __len22 = 0;
>>> -         if (__len1 > __len2)
>>> +         if (__len1 < __len2)
>>>              {
>>>                __len11 = __len1 / 2;
>>>                std::advance(__first_cut, __len11);
>>> @@ -2539,9 +2539,15 @@
>>>         const _DistanceType __len1 = std::distance(__first, __middle);
>>>         const _DistanceType __len2 = std::distance(__middle, __last);
>>>
>>> +
>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>> _TmpBuf;
>>> -      _TmpBuf __buf(__first, __last);
>>> -
>>> +      _BidirectionalIterator __start, __end;
>>> +      if (__len1 < __len2) {
>>> +       __start = __first; __end = __middle;
>>> +      } else {
>>> +       __start = __middle; __end = __last;
>>> +      }
>>> +      _TmpBuf __buf(__start, ___end);
>>>         if (__buf.begin() == 0)
>>>          std::__merge_without_buffer
>>>            (__first, __middle, __last, __len1, __len2, __comp);
>>> Index: include/bits/stl_tempbuf.h
>>> ===================================================================
>>> --- include/bits/stl_tempbuf.h  (revision 256871)
>>> +++ include/bits/stl_tempbuf.h  (working copy)
>>> @@ -95,7 +95,7 @@
>>>                                                          std::nothrow));
>>>            if (__tmp != 0)
>>>              return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>> -         __len /= 2;
>>> +         __len = (__len + 1) / 2;
>>>          }
>>>         return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>       }
>>>
>>>
>>>
>>>
>>> Thanks


Reply | Threaded
Open this post in threaded view
|

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

François Dumont-2
In reply to this post by chang jc
I missed a test that was failing because of this patch. So I revert a
small part of it and here is the new proposal.

Tested under Linux x86_64, ok to commit ?

François


On 24/07/2018 12:22, François Dumont wrote:

> Hi
>
>     Any chance to review this patch ?
>
> François
>
>
> On 06/06/2018 18:39, François Dumont wrote:
>> Hi
>>
>>     I review and rework this proposal. I noticed that the same idea
>> to limit buffer size within inplace_merge also apply to stable_sort.
>>
>>     I also change the decision when buffer is too small to consider
>> the buffer size rather than going through successive cuts of the
>> original ranges. This way we won't cut the range more than necessary.
>> Note that if you prefer this second part could be committed seperatly.
>>
>>     PR libstdc++/83938
>>     * include/bits/stl_algo.h:
>>     (__stable_partition_adaptive): When buffer to small cut range using
>>     buffer size.
>>     (__inplace_merge): Take temporary buffer length from smallest range.
>>     (__merge_adaptive): When buffer too small consider smallest range
>> first
>>     and cut based on buffer size.
>>     (__stable_sort_adaptive): When buffer too small cut based on buffer
>>     size.
>>     (__stable_sort): Limit temporary buffer length.
>>     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change function
>>     to reduce requested buffer length on allocation failure.
>>
>> Tested under Linux x86_64.
>>
>> Ok to commit ?
>>
>> François
>>
>>
>> On 25/01/2018 23:37, chang jc wrote:
>>> Hi:
>>>
>>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>
>>> 2. The coding gain has been shown  PR c++/83938; I re-post here
>>>
>>>
>>>
>>>
>>>    21
>>>    20
>>>    19
>>>    18
>>>    17
>>>    16
>>>
>>>
>>>    0.471136
>>>    0.625695
>>>    0.767262
>>>    0.907461
>>>    1.04838
>>>    1.19508
>>>
>>>
>>>    0.340845
>>>    0.48651
>>>    0.639139
>>>    0.770133
>>>    0.898454
>>>    1.04632
>>>
>>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>> 4325376 & B with 12451840 elements, total with 16M integers
>>>
>>> The proposed method has the speed up under given buffer size =, ex
>>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>> given sizof(int)*64K bytes.
>>>
>>> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>
>>> 4. It represents a fact that the intuitive idea to split from larger
>>> part is wrong.
>>>
>>> For example, if you have an input sorted array A & B, A has 8 integers
>>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>
>>> Current it tries to split from B, right?
>>>
>>> Then we have:
>>>
>>> A1 | A2  B1 | B2
>>>
>>> B1 & B2 has 12 integers each, right?
>>>
>>> Current algorithm selects pivot as 13th integer from B, right?
>>>
>>> If the corresponding upper bound of A is 6th integer.
>>>
>>> Then it split in
>>>
>>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>
>>> After rotate, we have two arrays to merge
>>>
>>> [A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>>>
>>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>
>>> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>
>>> So we do rotate again, split & merge the two split arrays from [A1 = 5
>>> | B1 = 12] again.
>>>
>>>
>>> But wait, if we always split from the smaller one instead of larger
>>> one.
>>>
>>> After rotate, it promises two split arrays both contain
>>> ceiling[small/2].
>>>
>>> Since tmp buffer also allocate by starting from sizeof(small) &
>>> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>
>>> It means the allocated tmp buffer promises to be sufficient at the
>>> level of (# of fail allocate).
>>>
>>> Instead, you can see if split from large at level (# of fail allocate)
>>> several split array still CANNOT use  tmp buffer to do buffered merge.
>>>
>>>
>>> As you know, buffered merge is far speed then (split, rotate, and
>>> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>
>>> the way should provide speedup.
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 24/01/2018 18:23, François Dumont wrote:
>>>
>>> Hi
>>>
>>>
>>>      It sounds like a very sensitive change to make but nothing
>>> worth figures.
>>> Do you have any bench showing the importance of the gain ?
>>>
>>>      At least the memory usage optimization is obvious.
>>>
>>> On 19/01/2018 10:43, chang jc wrote:
>>>
>>> Current std::inplace_merge() suffers from performance issue by
>>> inefficient
>>>
>>> logic under limited memory,
>>>
>>> It leads to performance downgrade.
>>>
>>> Please help to review it.
>>>
>>> Index: include/bits/stl_algo.h
>>> ===================================================================
>>> --- include/bits/stl_algo.h    (revision 256871)
>>> +++ include/bits/stl_algo.h    (working copy)
>>> @@ -2437,7 +2437,7 @@
>>>          _BidirectionalIterator __second_cut = __middle;
>>>          _Distance __len11 = 0;
>>>          _Distance __len22 = 0;
>>> -      if (__len1 > __len2)
>>> +      if (__len1 < __len2)
>>>            {
>>>              __len11 = __len1 / 2;
>>>              std::advance(__first_cut, __len11);
>>> @@ -2539,9 +2539,15 @@
>>>          const _DistanceType __len1 = std::distance(__first, __middle);
>>>          const _DistanceType __len2 = std::distance(__middle, __last);
>>>
>>> +
>>>
>>>          typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>> _TmpBuf;
>>>
>>> -      _TmpBuf __buf(__first, __last);
>>> -
>>> +      _BidirectionalIterator __start, __end;
>>> +      if (__len1 < __len2) {
>>> +    __start = __first; __end = __middle;
>>> +      } else {
>>> +    __start = __middle; __end = __last;
>>> +      }
>>> +      _TmpBuf __buf(__start, ___end);
>>>
>>> Note another optimization, we could introduce a _Temporary_buffer<>
>>> constructor
>>> in order to write:
>>>
>>> _TmpBuf __buf(std::min(__len1, __len2), __first);
>>>
>>> So that std::distance do not need to be called again.
>>>
>>>          if (__buf.begin() == 0)
>>>        std::__merge_without_buffer
>>>          (__first, __middle, __last, __len1, __len2, __comp);
>>> Index: include/bits/stl_tempbuf.h
>>> ===================================================================
>>> --- include/bits/stl_tempbuf.h    (revision 256871)
>>> +++ include/bits/stl_tempbuf.h    (working copy)
>>> @@ -95,7 +95,7 @@
>>>                                std::nothrow));
>>>          if (__tmp != 0)
>>>            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>> -      __len /= 2;
>>> +      __len = (__len + 1) / 2;
>>>
>>> This part is problematic, if __len is 1 and allocation fails it will
>>> loop
>>> forever.
>>>
>>> It doesn't seem really necessary for your patch.
>>>
>>>
>>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<[hidden email]>:
>>>
>>>> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>>>> adding the libstdc++ list to CC.
>>>>
>>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<[hidden email]>  wrote:
>>>>> Current std::inplace_merge() suffers from performance issue by
>>>> inefficient
>>>>> logic under limited memory,
>>>>>
>>>>> It leads to performance downgrade.
>>>>>
>>>>> Please help to review it.
>>>>>
>>>>> Index: include/bits/stl_algo.h
>>>>> ===================================================================
>>>>> --- include/bits/stl_algo.h     (revision 256871)
>>>>> +++ include/bits/stl_algo.h     (working copy)
>>>>> @@ -2437,7 +2437,7 @@
>>>>>            _BidirectionalIterator __second_cut = __middle;
>>>>>            _Distance __len11 = 0;
>>>>>            _Distance __len22 = 0;
>>>>> -         if (__len1 > __len2)
>>>>> +         if (__len1 < __len2)
>>>>>              {
>>>>>                __len11 = __len1 / 2;
>>>>>                std::advance(__first_cut, __len11);
>>>>> @@ -2539,9 +2539,15 @@
>>>>>         const _DistanceType __len1 = std::distance(__first,
>>>>> __middle);
>>>>>         const _DistanceType __len2 = std::distance(__middle, __last);
>>>>>
>>>>> +
>>>>>         typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>> _TmpBuf;
>>>>> -      _TmpBuf __buf(__first, __last);
>>>>> -
>>>>> +      _BidirectionalIterator __start, __end;
>>>>> +      if (__len1 < __len2) {
>>>>> +       __start = __first; __end = __middle;
>>>>> +      } else {
>>>>> +       __start = __middle; __end = __last;
>>>>> +      }
>>>>> +      _TmpBuf __buf(__start, ___end);
>>>>>         if (__buf.begin() == 0)
>>>>>          std::__merge_without_buffer
>>>>>            (__first, __middle, __last, __len1, __len2, __comp);
>>>>> Index: include/bits/stl_tempbuf.h
>>>>> ===================================================================
>>>>> --- include/bits/stl_tempbuf.h  (revision 256871)
>>>>> +++ include/bits/stl_tempbuf.h  (working copy)
>>>>> @@ -95,7 +95,7 @@
>>>>> std::nothrow));
>>>>>            if (__tmp != 0)
>>>>>              return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>> -         __len /= 2;
>>>>> +         __len = (__len + 1) / 2;
>>>>>          }
>>>>>         return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>>       }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>
>>
>


pr83938.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [C++ PATCH] Speed up inplace_merge algorithm & fix inefficient logic(PR libstdc++/83938)

François Dumont-2
Hi

     Some feedback regarding this patch ?

Thanks,
François

On 8/21/18 10:34 PM, François Dumont wrote:

> I missed a test that was failing because of this patch. So I revert a
> small part of it and here is the new proposal.
>
> Tested under Linux x86_64, ok to commit ?
>
> François
>
>
> On 24/07/2018 12:22, François Dumont wrote:
>> Hi
>>
>>     Any chance to review this patch ?
>>
>> François
>>
>>
>> On 06/06/2018 18:39, François Dumont wrote:
>>> Hi
>>>
>>>     I review and rework this proposal. I noticed that the same idea
>>> to limit buffer size within inplace_merge also apply to stable_sort.
>>>
>>>     I also change the decision when buffer is too small to consider
>>> the buffer size rather than going through successive cuts of the
>>> original ranges. This way we won't cut the range more than
>>> necessary. Note that if you prefer this second part could be
>>> committed seperatly.
>>>
>>>     PR libstdc++/83938
>>>     * include/bits/stl_algo.h:
>>>     (__stable_partition_adaptive): When buffer to small cut range using
>>>     buffer size.
>>>     (__inplace_merge): Take temporary buffer length from smallest
>>> range.
>>>     (__merge_adaptive): When buffer too small consider smallest
>>> range first
>>>     and cut based on buffer size.
>>>     (__stable_sort_adaptive): When buffer too small cut based on buffer
>>>     size.
>>>     (__stable_sort): Limit temporary buffer length.
>>>     * include/bits/stl_tempbuf.h (get_temporary_buffer): Change
>>> function
>>>     to reduce requested buffer length on allocation failure.
>>>
>>> Tested under Linux x86_64.
>>>
>>> Ok to commit ?
>>>
>>> François
>>>
>>>
>>> On 25/01/2018 23:37, chang jc wrote:
>>>> Hi:
>>>>
>>>> 1. The __len = (__len + 1) / 2; is as you suggested, need to modify as
>>>> __len = (__len == 1) ? 0 : ((__len + 1) / 2);
>>>>
>>>> 2. The coding gain has been shown  PR c++/83938; I re-post here
>>>>
>>>>
>>>>
>>>>
>>>>    21
>>>>    20
>>>>    19
>>>>    18
>>>>    17
>>>>    16
>>>>
>>>>
>>>>    0.471136
>>>>    0.625695
>>>>    0.767262
>>>>    0.907461
>>>>    1.04838
>>>>    1.19508
>>>>
>>>>
>>>>    0.340845
>>>>    0.48651
>>>>    0.639139
>>>>    0.770133
>>>>    0.898454
>>>>    1.04632
>>>>
>>>> it means when Merge [0, 4325376, 16777216); A is a sorted integer with
>>>> 4325376 & B with 12451840 elements, total with 16M integers
>>>>
>>>> The proposed method has the speed up under given buffer size =, ex
>>>> 2^16, 2^17, ... 2^21 in unit of sizeof(int), for example, 2^16 means
>>>> given sizof(int)*64K bytes.
>>>>
>>>> 3. As your suggestion, _TmpBuf __buf should be rewrite.
>>>>
>>>> 4. It represents a fact that the intuitive idea to split from larger
>>>> part is wrong.
>>>>
>>>> For example, if you have an input sorted array A & B, A has 8 integers
>>>> & B has 24 integers. Given tmp buffer whose capacity = 4 integers.
>>>>
>>>> Current it tries to split from B, right?
>>>>
>>>> Then we have:
>>>>
>>>> A1 | A2  B1 | B2
>>>>
>>>> B1 & B2 has 12 integers each, right?
>>>>
>>>> Current algorithm selects pivot as 13th integer from B, right?
>>>>
>>>> If the corresponding upper bound of A is 6th integer.
>>>>
>>>> Then it split in
>>>>
>>>> A1 = 5 | A2 = 3 | B1 = 12 | B2 = 12
>>>>
>>>> After rotate, we have two arrays to merge
>>>>
>>>> [A1 = 5 | B1 = 12]  & [A2 = 3 | B2 = 12]
>>>>
>>>> Great, [A2 = 3 | B2 = 12] can use tmp buffer to merge.
>>>>
>>>> Sadly, [A1 = 5 | B1 = 12] CANNOT.
>>>>
>>>> So we do rotate again, split & merge the two split arrays from [A1 = 5
>>>> | B1 = 12] again.
>>>>
>>>>
>>>> But wait, if we always split from the smaller one instead of larger
>>>> one.
>>>>
>>>> After rotate, it promises two split arrays both contain
>>>> ceiling[small/2].
>>>>
>>>> Since tmp buffer also allocate by starting from sizeof(small) &
>>>> recursively downgrade by ceiling[small/2^(# of fail allocate)].
>>>>
>>>> It means the allocated tmp buffer promises to be sufficient at the
>>>> level of (# of fail allocate).
>>>>
>>>> Instead, you can see if split from large at level (# of fail allocate)
>>>> several split array still CANNOT use  tmp buffer to do buffered merge.
>>>>
>>>>
>>>> As you know, buffered merge is far speed then (split, rotate, and
>>>> merge two sub-arrays) (PR c++/83938 gives the profiling figures),
>>>>
>>>> the way should provide speedup.
>>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 24/01/2018 18:23, François Dumont wrote:
>>>>
>>>> Hi
>>>>
>>>>
>>>>      It sounds like a very sensitive change to make but nothing
>>>> worth figures.
>>>> Do you have any bench showing the importance of the gain ?
>>>>
>>>>      At least the memory usage optimization is obvious.
>>>>
>>>> On 19/01/2018 10:43, chang jc wrote:
>>>>
>>>> Current std::inplace_merge() suffers from performance issue by
>>>> inefficient
>>>>
>>>> logic under limited memory,
>>>>
>>>> It leads to performance downgrade.
>>>>
>>>> Please help to review it.
>>>>
>>>> Index: include/bits/stl_algo.h
>>>> ===================================================================
>>>> --- include/bits/stl_algo.h    (revision 256871)
>>>> +++ include/bits/stl_algo.h    (working copy)
>>>> @@ -2437,7 +2437,7 @@
>>>>          _BidirectionalIterator __second_cut = __middle;
>>>>          _Distance __len11 = 0;
>>>>          _Distance __len22 = 0;
>>>> -      if (__len1 > __len2)
>>>> +      if (__len1 < __len2)
>>>>            {
>>>>              __len11 = __len1 / 2;
>>>>              std::advance(__first_cut, __len11);
>>>> @@ -2539,9 +2539,15 @@
>>>>          const _DistanceType __len1 = std::distance(__first,
>>>> __middle);
>>>>          const _DistanceType __len2 = std::distance(__middle, __last);
>>>>
>>>> +
>>>>
>>>>          typedef _Temporary_buffer<_BidirectionalIterator, _ValueType>
>>>> _TmpBuf;
>>>>
>>>> -      _TmpBuf __buf(__first, __last);
>>>> -
>>>> +      _BidirectionalIterator __start, __end;
>>>> +      if (__len1 < __len2) {
>>>> +    __start = __first; __end = __middle;
>>>> +      } else {
>>>> +    __start = __middle; __end = __last;
>>>> +      }
>>>> +      _TmpBuf __buf(__start, ___end);
>>>>
>>>> Note another optimization, we could introduce a _Temporary_buffer<>
>>>> constructor
>>>> in order to write:
>>>>
>>>> _TmpBuf __buf(std::min(__len1, __len2), __first);
>>>>
>>>> So that std::distance do not need to be called again.
>>>>
>>>>          if (__buf.begin() == 0)
>>>>        std::__merge_without_buffer
>>>>          (__first, __middle, __last, __len1, __len2, __comp);
>>>> Index: include/bits/stl_tempbuf.h
>>>> ===================================================================
>>>> --- include/bits/stl_tempbuf.h    (revision 256871)
>>>> +++ include/bits/stl_tempbuf.h    (working copy)
>>>> @@ -95,7 +95,7 @@
>>>>                                std::nothrow));
>>>>          if (__tmp != 0)
>>>>            return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>> -      __len /= 2;
>>>> +      __len = (__len + 1) / 2;
>>>>
>>>> This part is problematic, if __len is 1 and allocation fails it
>>>> will loop
>>>> forever.
>>>>
>>>> It doesn't seem really necessary for your patch.
>>>>
>>>>
>>>> 2018-01-20 4:05 GMT+08:00 Jason Merrill<[hidden email]>:
>>>>
>>>>> This is a libstdc++ bug and patch, not the C++ front end.  So I'm
>>>>> adding the libstdc++ list to CC.
>>>>>
>>>>> On Fri, Jan 19, 2018 at 3:02 AM, chang jc<[hidden email]
>>>>> wrote:
>>>>>> Current std::inplace_merge() suffers from performance issue by
>>>>> inefficient
>>>>>> logic under limited memory,
>>>>>>
>>>>>> It leads to performance downgrade.
>>>>>>
>>>>>> Please help to review it.
>>>>>>
>>>>>> Index: include/bits/stl_algo.h
>>>>>> ===================================================================
>>>>>> --- include/bits/stl_algo.h     (revision 256871)
>>>>>> +++ include/bits/stl_algo.h     (working copy)
>>>>>> @@ -2437,7 +2437,7 @@
>>>>>>            _BidirectionalIterator __second_cut = __middle;
>>>>>>            _Distance __len11 = 0;
>>>>>>            _Distance __len22 = 0;
>>>>>> -         if (__len1 > __len2)
>>>>>> +         if (__len1 < __len2)
>>>>>>              {
>>>>>>                __len11 = __len1 / 2;
>>>>>>                std::advance(__first_cut, __len11);
>>>>>> @@ -2539,9 +2539,15 @@
>>>>>>         const _DistanceType __len1 = std::distance(__first,
>>>>>> __middle);
>>>>>>         const _DistanceType __len2 = std::distance(__middle,
>>>>>> __last);
>>>>>>
>>>>>> +
>>>>>>         typedef _Temporary_buffer<_BidirectionalIterator,
>>>>>> _ValueType>
>>>>> _TmpBuf;
>>>>>> -      _TmpBuf __buf(__first, __last);
>>>>>> -
>>>>>> +      _BidirectionalIterator __start, __end;
>>>>>> +      if (__len1 < __len2) {
>>>>>> +       __start = __first; __end = __middle;
>>>>>> +      } else {
>>>>>> +       __start = __middle; __end = __last;
>>>>>> +      }
>>>>>> +      _TmpBuf __buf(__start, ___end);
>>>>>>         if (__buf.begin() == 0)
>>>>>>          std::__merge_without_buffer
>>>>>>            (__first, __middle, __last, __len1, __len2, __comp);
>>>>>> Index: include/bits/stl_tempbuf.h
>>>>>> ===================================================================
>>>>>> --- include/bits/stl_tempbuf.h  (revision 256871)
>>>>>> +++ include/bits/stl_tempbuf.h  (working copy)
>>>>>> @@ -95,7 +95,7 @@
>>>>>> std::nothrow));
>>>>>>            if (__tmp != 0)
>>>>>>              return std::pair<_Tp*, ptrdiff_t>(__tmp, __len);
>>>>>> -         __len /= 2;
>>>>>> +         __len = (__len + 1) / 2;
>>>>>>          }
>>>>>>         return std::pair<_Tp*, ptrdiff_t>(static_cast<_Tp*>(0), 0);
>>>>>>       }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>
>>>
>>
>