[Bug 856072] Re: Bionic domain name functions are not thread-safe on pre-3.0 Android

Bug Watch Updater 856072 at bugs.launchpad.net
Thu Sep 22 06:43:34 UTC 2011


Launchpad has imported 10 comments from the remote bug at
https://bugzilla.mozilla.org/show_bug.cgi?id=687367.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2011-09-18T19:34:32+00:00 Jimnchen+bmo wrote:

Offshoot from Bug 662936, which was the same bug occurring on dual-core
Tegras running talos.

In bionic (ugh), domain name functions (getaddrinfo, gethostbyname_r, et
al) are not thread-safe because stdio is not thread-safe... These
functions rely on stdio for reading from the /etc/hosts file.

AFAIK in Gecko, we launch worker threads which call these domain
functions, and we crash inside bionic when we run into such a race
condition in stdio. This is the #2 top crasher in 7.0 Beta and #1 in
6.0.

I haven't seen single-core devices running into these crashes, but on
dual-core devices they are a lot more severe. The ugly fix would be
using locks around getaddrinfo calls. Another workaround would be
setting the CPU affinity on worker threads; doesn't fix the issue, but
should make it a lot less common. Also, might be possible to not use
getaddrinfo, but that seems to have implications in IPv6 support (Bug
626866); I don't know enough to say.

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/0

------------------------------------------------------------------------
On 2011-09-19T03:53:28+00:00 Doug-turner wrote:

lets add locking around the domain resolution and ifdef it for #android.
This should only be done for pre-honeycomb (i think).  jim, want to
throw up a patch?

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/1

------------------------------------------------------------------------
On 2011-09-19T14:28:09+00:00 Patrick McManus wrote:

(In reply to Doug Turner (:dougt) from comment #1)
> lets add locking around the domain resolution and ifdef it for #android. 
> This should only be done for pre-honeycomb (i think).  jim, want to throw up
> a patch?

wow. That could have such a huge negative impact. We need parallelism on
high latency operations.

Do you have a pointer to the android getaddrinfo() implementation in
question? I can't find it because android.git.kernel.org is still
offline. Is the io conditional on anything - a lazy init perhaps?

Maybe we can clone that code and remove the stdio (or just lock it?)?

If we really are forced to go down this path I would prefer
MAX_RESOLVER_THREADS_FOR_ANY_PRIORITY, and
MAX_RESOLVER_THREADS_FOR_HIGH_PRIORITY were changed into prefs that
could be set to {1,0} for this case rather than the lock. But I really
think that's a big step back for phones as modern as gingerbread!

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/2

------------------------------------------------------------------------
On 2011-09-19T14:33:31+00:00 Doug-turner wrote:

slow is better than crashy.  You're right, we also could just move the
bionic code into nspr (or necko), but I am not sure how much that code
fans out.  That would be better than locking for sure.

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/3

------------------------------------------------------------------------
On 2011-09-19T15:32:38+00:00 Jimnchen+bmo wrote:

(In reply to Patrick McManus from comment #2)
> (In reply to Doug Turner (:dougt) from comment #1)
> > lets add locking around the domain resolution and ifdef it for #android. 
> > This should only be done for pre-honeycomb (i think).  jim, want to throw up
> > a patch?
> 
> wow. That could have such a huge negative impact. We need parallelism on
> high latency operations.
> 
> Do you have a pointer to the android getaddrinfo() implementation in
> question? I can't find it because android.git.kernel.org is still offline.
> Is the io conditional on anything - a lazy init perhaps?
> 
Here's a mirror (code is forked from netbsd): https://github.com/android/platform_bionic/blob/master/libc/netbsd/net/getaddrinfo.c

fopen() is called inside _sethtent() and fclose() in _endhtent(). Both
are called by _files_getaddrinfo() which is responsible for reading
/etc/hosts. I don't see a way to skip that (unless you count the
workaround for our testing infra which is to delete /etc/hosts and make
fopen() fail; but obviously a dirty hack used internally)

(In reply to Doug Turner (:dougt) from comment #3)
> slow is better than crashy.  You're right, we also could just move the
> bionic code into nspr (or necko), but I am not sure how much that code fans
> out.  That would be better than locking for sure.

Yeah. Also setting affinity will keep parallelism while reducing chance
of race condition to pre-dual-core levels, which wasn't really an issue
back then (I don't think).

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/4

------------------------------------------------------------------------
On 2011-09-19T15:57:55+00:00 Patrick McManus wrote:


> Here's a mirror (code is forked from netbsd):
> https://github.com/android/platform_bionic/blob/master/libc/netbsd/net/
> getaddrinfo.c
> 
> fopen() is called inside _sethtent() and fclose() in _endhtent(). Both are
> called by _files_getaddrinfo() which is responsible for reading /etc/hosts.

Thanks Jim!

Obviously this code can crash on single cpu devices too, it just doesn't
happen often. right? It seems worth trying to fix that. I don't really
think process affinity is a replacement for a data safety model.

> I don't see a way to skip that

One thing we could do would be to cache the contents of the file in ram
in a read-only threadsafe buffer. Update it under lock for the same
conditions we call res_ninit() right now.

A slightly uglier approach would be to just put a lock around
_files_get_addrinfo. If STDIO is really not threadsafe (independent of
file) couldn't either of these still have races against the content
process saving a download, for instance? Or is the issue that all the
threads are reading /etc/hosts?

Either way means pretty much importing the c file you referenced and
calling that version of getaddrinfo on android < honeycomb, but that
actually looks relatively easy to do to me. (almost everything in there
is static, so just drag it along and make the necessary changes).

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/5

------------------------------------------------------------------------
On 2011-09-19T18:32:57+00:00 Jimnchen+bmo wrote:

(In reply to Patrick McManus from comment #5)
> > I don't see a way to skip that 
> 
> One thing we could do would be to cache the contents of the file in ram in a
> read-only threadsafe buffer. Update it under lock for the same conditions we
> call res_ninit() right now. 
> 
> A slightly uglier approach would be to just put a lock around
> _files_get_addrinfo. If STDIO is really not threadsafe (independent of file)
> couldn't either of these still have races against the content process saving
> a download, for instance? Or is the issue that all the threads are reading
> /etc/hosts?

Hm, good point; actually stdio is not thread-safe in general :(

There might be other race conditions, but the one I found happens when allocating FILE handles: (the netbsd code had locks but they were removed in the android fork; boo)
https://github.com/android/platform_bionic/blob/master/libc/stdio/findfp.c#L95

So in theory, even if we lock getaddrinfo, we could race against other
stdio usages.

> Either way means pretty much importing the c file you referenced and calling
> that version of getaddrinfo on android < honeycomb, but that actually looks
> relatively easy to do to me. (almost everything in there is static, so just
> drag it along and make the necessary changes).

Thanks for the ideas!

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/6

------------------------------------------------------------------------
On 2011-09-20T21:43:36+00:00 Azakai wrote:

Sorry for the long list of questions here.

Do we know why other apps on Android running in parallel do not hit this
crash? Do they do all their IO through Java? Or does it have to do with
each Android app being a separate unix user somehow?

Also, do we know why we are not hitting this with our two-process model?
I guess we simply don't do much stdio stuff in the child? Surely we do
plenty of it indirectly though, through various libraries (font loading,
for example)?

How sure are we that the problem is limited to getaddrinfo and
gethostbyname? Do we run a risk of this crash with every file opened,
for example? Or do we feel it is limited to those two functions?

If we pull this code and fix it in our codebase, wouldn't we run into
potential problems with different versions of bionic on different
phones? (I mean, bionic could be patched for some issue on some device,
and our single fixed implementation would not have that stuff.)

Finally, how does the actual threading issue happen: On one side we are
calling getaddrinfo or gethostbyname, what is the other thread doing
that causes the problem, do we know?

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/7

------------------------------------------------------------------------
On 2011-09-20T22:04:27+00:00 Patrick McManus wrote:

(In reply to Alon Zakai (:azakai) from comment #7)

> Finally, how does the actual threading issue happen: On one side we are
> calling getaddrinfo or gethostbyname, what is the other thread doing that
> causes the problem, do we know?

I would assume, from the way this is described, the other thread is also
commonly doing getaddrinfo(). There are (up to) 8 different dedicated
DNS threads because getaddrinfo() is a blocking network operation that
we need to perform in parallel. The system libraries do not provide an
async API for it, so we do it in parallel.

It would also be common for those parallel lookups to start almost
simultaneously, maximizing the chances to be executing the same stdio
code simultaneously, because they are discovered in the same HTML parse
(perhaps as sharded domain names for images or perhaps as DNS pefetches
discovered by parsing links).

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/8

------------------------------------------------------------------------
On 2011-09-21T00:30:19+00:00 Michael K. Edwards wrote:

My plan is to pull in the gingerbread bionic implementation of
getaddrinfo() (in libc/netbsd/net/getaddrinfo.c) and reimplement
_gethtent() without stdio calls.  Note that freeaddrinfo() and
gai_strerror() will also be provided, to ensure that they match the
getaddrinfo() implementation.

I intend to add getaddrinfo.o to libmozutils, inside /other-
licenses/android, and to rename the public symbols to
__wrap_getaddrinfo(), etc. so they will shadow the libc symbols at link
time.

Reply at: https://bugs.launchpad.net/firefox/+bug/856072/comments/9


** Changed in: firefox
       Status: Unknown => In Progress

** Changed in: firefox
   Importance: Unknown => High

-- 
You received this bug notification because you are a member of Mozilla
Bugs, which is subscribed to Mozilla.
https://bugs.launchpad.net/bugs/856072

Title:
  Bionic domain name functions are not thread-safe on pre-3.0 Android

To manage notifications about this bug go to:
https://bugs.launchpad.net/firefox/+bug/856072/+subscriptions




More information about the Ubuntu-mozillateam-bugs mailing list