libgcc unwinder caching on Android

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

libgcc unwinder caching on Android

gcc - Dev mailing list

I recently experimented with changes to libgcc's DWARF unwinder, aimed
at improving performance on Android, and I'm wondering if upstream
would be interested in a patch.

The most interesting change modified the cache in unwind-dw2-fde-dip.c
to help even when the dlpi_adds and dlpi_subs fields are missing.
Normally, the unwinder uses these fields to invalidate the cache (e.g.
because a module was unloaded), so if the fields are missing, the
cache is disabled. Typically, the fields are present. (e.g. They've
been in glibc since 2.4. FreeBSD, NetBSD, and illumos also have them.)
Unfortunately, while I've added the two fields for the next Android
release, existing releases lack the fields and will be around for a
few years. While I can't update the dynamic linker in old versions of
Android, Android apps bundle their own unwinders, which can be
upgraded relatively quickly.

With my modified libgcc, when the two fields are missing, I use the
cache to provide a "load base hint". I then do the ordinary module
scan, but I exit early if dl_phdr_info::dlpi_addr doesn't match the
hint. Unless a module has been unloaded, this first pass should find a
matching module. Otherwise, I do a second pass that doesn't use the

This limited cache use appears to give most of the benefit of the
cache. An NDK user provided an "extest" benchmark at[1], which loads
100 DSOs and throws 10,000 exceptions. I see run-time numbers like:

    old libgcc, old rtld: ~6600ms (no caching)
    new libgcc, old rtld: ~1500ms (uses load_base hint)
    new libgcc, new rtld: ~900ms (uses dlpi_adds / dlpi_subs)

The patch refactors much of unwind-dw2-fde-dip.c, and the changes are
mostly only useful for targets which lack the dlpi_adds and dlpi_subs
fields, so I'm not sure how interested upstream would be in the patch.
Still, I'd appreciate feedback if anyone wants to share any. I can
send a patch to gcc-patches, but for now, the changes are on Android's
Gerrit tracker[3][4].

Various details:

On a cache miss, I think _Unwind_IteratePhdrCallback is rescanning the
cache for every DSO. I think it needs to set prev_cache_entry and
next_cache_entry, but that could still be done once per
dl_iterate_phdr call, if we assume that dl_iterate_phdr callbacks
aren't interleaved (or overlapped). I think that's the case with
glibc, FreeBSD, and Bionic.

AFAICT, musl's dl_iterate_phdr allows concurrent callbacks, so maybe
the musl+libgcc combination is racy.

_Unwind_IteratePhdrCallback scans a phdr table in one pass looking for
all the segments it cares about (usually PT_LOAD, PT_GNU_EH_FRAME, and
PT_DYNAMIC). I think it makes sense to have an initial pass that only
looks for PT_LOAD, because most modules won't match the PC.

If the PC is less than the load base (dlpi_addr), then I think the
unwinder can ignore the phdr entries. This optimization is present in
libunwind_llvm. It helps with the extest benchmark but introduces a
lot of variance[2]. ASLR then affects exception handling performance.

This work was motivated by an NDK user's complaint that arm64
exceptions (libgcc) were slower than arm32 exceptions[1]. For arm32,
Android uses LLVM's libunwind instead, which uses
dl_unwind_find_exidx/__gnu_Unwind_Find_exidx. Bionic's linker has
optimized versions of these functions.

I'm wondering why libgcc checks both dlpi_adds and dlpi_subs. Isn't
dlpi_subs sufficient?

I'm wondering if there are any tests or benchmarks that would be
useful here. I haven't tried running the gcc testsuite yet.