[RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

Aaron Sawdey-3
I've previously posted a patch to add vector/vsx inline expansion of
strcmp/strncmp for the power8/power9 processors. Here are some of the
other items I have in the pipeline that I hope to get into gcc9:

* vector/vsx support for inline expansion of memcmp to non-loop code.
  This improves performance of small memcmp.
* vector/vsx support for inline expansion of memcmp to loop code. This
  will close the performance gap for lengths of about 128-512 bytes
  by making the loop code closer to the performance of the library
  memcmp.
* generate inline expansion to a loop for strcmp/strncmp. This closes
  another performance gap because strcmp/strncmp vector/vsx code
  currently generated is lots faster than the library call but we
  only generate comparison of 64 bytes to avoid exploding code size.
  Similar code in a loop would be compact and allow inline comparison
  of maybe the first 512 bytes inline before dumping to the library
  function.

If anyone has any other input on the inline expansion work I've been
doing for the rs6000 target, please let me know.

Thanks!
    Aaron


--
Aaron Sawdey, Ph.D.  [hidden email]
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain

Reply | Threaded
Open this post in threaded view
|

Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

Florian Weimer
* Aaron Sawdey:

> I've previously posted a patch to add vector/vsx inline expansion of
> strcmp/strncmp for the power8/power9 processors. Here are some of the
> other items I have in the pipeline that I hope to get into gcc9:
>
> * vector/vsx support for inline expansion of memcmp to non-loop code.
>   This improves performance of small memcmp.
> * vector/vsx support for inline expansion of memcmp to loop code. This
>   will close the performance gap for lengths of about 128-512 bytes
>   by making the loop code closer to the performance of the library
>   memcmp.
> * generate inline expansion to a loop for strcmp/strncmp. This closes
>   another performance gap because strcmp/strncmp vector/vsx code
>   currently generated is lots faster than the library call but we
>   only generate comparison of 64 bytes to avoid exploding code size.
>   Similar code in a loop would be compact and allow inline comparison
>   of maybe the first 512 bytes inline before dumping to the library
>   function.
>
> If anyone has any other input on the inline expansion work I've been
> doing for the rs6000 target, please let me know.

The inline expansion of strcmp is problematic for valgrind:

  <https://bugs.kde.org/show_bug.cgi?id=386945>

We currently see around 0.5 KiB of instructions for each call to
strcmp.  I find it hard to believe that this improves general system
performance except in micro-benchmarks.
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

Aaron Sawdey-3
On 10/17/18 4:03 PM, Florian Weimer wrote:

> * Aaron Sawdey:
>
>> I've previously posted a patch to add vector/vsx inline expansion of
>> strcmp/strncmp for the power8/power9 processors. Here are some of the
>> other items I have in the pipeline that I hope to get into gcc9:
>>
>> * vector/vsx support for inline expansion of memcmp to non-loop code.
>>   This improves performance of small memcmp.
>> * vector/vsx support for inline expansion of memcmp to loop code. This
>>   will close the performance gap for lengths of about 128-512 bytes
>>   by making the loop code closer to the performance of the library
>>   memcmp.
>> * generate inline expansion to a loop for strcmp/strncmp. This closes
>>   another performance gap because strcmp/strncmp vector/vsx code
>>   currently generated is lots faster than the library call but we
>>   only generate comparison of 64 bytes to avoid exploding code size.
>>   Similar code in a loop would be compact and allow inline comparison
>>   of maybe the first 512 bytes inline before dumping to the library
>>   function.
>>
>> If anyone has any other input on the inline expansion work I've been
>> doing for the rs6000 target, please let me know.
>
> The inline expansion of strcmp is problematic for valgrind:
>
>   <https://bugs.kde.org/show_bug.cgi?id=386945>

I'm aware of this. One thing that will help is that I believe the vsx
expansion for strcmp/strncmp does not have this problem, so with
current gcc9 trunk the problem should only be seen if one of the strings is
known at compile time to be less than 16 bytes, or if -mcpu=power7, or
if vector/vsx is disabled. My position is that it is valgrind's problem
if it doesn't understand correct code, but I also want valgrind to be a
useful tool so I'm going to take a look and see if I can find a gpr
sequence that is equally fast that it can understand.

> We currently see around 0.5 KiB of instructions for each call to
> strcmp.  I find it hard to believe that this improves general system
> performance except in micro-benchmarks.

The expansion of strcmp where both arguments are strings of unknown
length at compile time will compare 64 bytes then call strcmp on the
remainder if no difference is found. If the gpr sequence is used (p7
or vec/vsx disabled) then the overhead is 91 instructions. If the
p8 vsx sequence is used, the overhead is 59 instructions. If the p9
vsx sequence is used, then the overhead is 41 instructions.

Yes, this will increase the instruction footprint. However the processors
that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some
of the comparison inline makes the common cases of strings being totally
different, or identical and <= 64 bytes in length very much faster, and
also avoiding the plt call means less pressure on the count cache and
better branch prediction elsewhere.

If you are aware of any real world code that is faster when built
with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
so I can look at avoiding those situations.

  Aaron

--
Aaron Sawdey, Ph.D.  [hidden email]
050-2/C113  (507) 253-7520 home: 507/263-0782
IBM Linux Technology Center - PPC Toolchain

Reply | Threaded
Open this post in threaded view
|

Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

Segher Boessenkool
On Thu, Oct 18, 2018 at 09:48:22AM -0500, Aaron Sawdey wrote:
> On 10/17/18 4:03 PM, Florian Weimer wrote:
> I'm aware of this. One thing that will help is that I believe the vsx
> expansion for strcmp/strncmp does not have this problem, so with
> current gcc9 trunk the problem should only be seen if one of the strings is
> known at compile time to be less than 16 bytes, or if -mcpu=power7, or
> if vector/vsx is disabled. My position is that it is valgrind's problem
> if it doesn't understand correct code, but I also want valgrind to be a
> useful tool so I'm going to take a look and see if I can find a gpr
> sequence that is equally fast that it can understand.

If we can do that without losing performance, that is nice of course :-)

> > We currently see around 0.5 KiB of instructions for each call to
> > strcmp.  I find it hard to believe that this improves general system
> > performance except in micro-benchmarks.
>
> The expansion of strcmp where both arguments are strings of unknown
> length at compile time will compare 64 bytes then call strcmp on the
> remainder if no difference is found. If the gpr sequence is used (p7
> or vec/vsx disabled) then the overhead is 91 instructions. If the
> p8 vsx sequence is used, the overhead is 59 instructions. If the p9
> vsx sequence is used, then the overhead is 41 instructions.

That is 0.355kB, 0.230kB, resp. 0.160kB.

> Yes, this will increase the instruction footprint. However the processors
> that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some
> of the comparison inline makes the common cases of strings being totally
> different, or identical and <= 64 bytes in length very much faster, and
> also avoiding the plt call means less pressure on the count cache and
> better branch prediction elsewhere.
>
> If you are aware of any real world code that is faster when built
> with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
> so I can look at avoiding those situations.

+1

Thanks Aaron!  Both for all the original work, and for looking at it
once again.


Segher
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

Florian Weimer-5
In reply to this post by Aaron Sawdey-3
* Aaron Sawdey:

> If you are aware of any real world code that is faster when built
> with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
> so I can look at avoiding those situations.

Sorry, I have not tried to benchmark this.

One more question: There's a hardware erratum on POWER9 DD2.1 related to
VSX load instructions causing memory corruption when accessing
cache-inhibited memory.  It may not be very likely that strncmp is used
on such memory, but memcpy and memmove definitely need to take that into
account, and perhaps memset and memcmp as well.

In the past, I did not receive positive feedback for my suggestion that
we should have a separate family of string functions for device memory.
(This is a general problem that is not specific to POWER.)  So we still
have the problem that at least some of the string functions in glibc
need to be compatible with device memory.

My concern here is that the GCC inline expansion could essentially
disable the workaround we have in glibc memcpy and memmove for the
hardware erratum.

Thanks,
Florian
Reply | Threaded
Open this post in threaded view
|

Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

Bill Schmidt-3
On 12/3/18 8:34 AM, Florian Weimer wrote:

> * Aaron Sawdey:
>
>> If you are aware of any real world code that is faster when built
>> with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
>> so I can look at avoiding those situations.
> Sorry, I have not tried to benchmark this.
>
> One more question: There's a hardware erratum on POWER9 DD2.1 related to
> VSX load instructions causing memory corruption when accessing
> cache-inhibited memory.  It may not be very likely that strncmp is used
> on such memory, but memcpy and memmove definitely need to take that into
> account, and perhaps memset and memcmp as well.
>
> In the past, I did not receive positive feedback for my suggestion that
> we should have a separate family of string functions for device memory.
> (This is a general problem that is not specific to POWER.)  So we still
> have the problem that at least some of the string functions in glibc
> need to be compatible with device memory.
>
> My concern here is that the GCC inline expansion could essentially
> disable the workaround we have in glibc memcpy and memmove for the
> hardware erratum.

I don't think we have a real concern here.  DD2.1 is used in a particular
situation where GCC 4.8.5 is the supported compiler, and not used elsewhere.
So I'd prefer not to cripple the compiler for this specific use case.  If
the customer with DD2.1 hardware chooses to use GCC 8 or later, and runs
into this problem, they can use -mno-builtin-mem{set,cmp} as a workaround.
Do you feel that's satisfactory?

We can also have a private discussion if you feel that's warranted.

Thanks,
Bill

>
> Thanks,
> Florian
>

Reply | Threaded
Open this post in threaded view
|

Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

Florian Weimer-5
* Bill Schmidt:

> I don't think we have a real concern here.  DD2.1 is used in a particular
> situation where GCC 4.8.5 is the supported compiler, and not used elsewhere.
> So I'd prefer not to cripple the compiler for this specific use case.  If
> the customer with DD2.1 hardware chooses to use GCC 8 or later, and runs
> into this problem, they can use -mno-builtin-mem{set,cmp} as a workaround.
> Do you feel that's satisfactory?
>
> We can also have a private discussion if you feel that's warranted.

Okay, let's move this off list.

Thanks,
Florian