I have a multi-threaded C++ application, used as a component end for one of the integration channels. Since last 2-3 weeks, I have been facing a strange issue of application getting hanged and not responding.
Debugging started with `strace' and I found that the application hangs in futex as:
futex(0x5ac9df, FUTEX_WAIT, ....
I thought, it had something to do with pthread_mutex that I was using, to share a queue across the application. I ensured that, pthread_mutex_lock and pthread_mutex_unlock are happening correctly, since problem solving always starts with the assumption that, the problem is in your code.
Well, I did that and also put more debug statements around usage of `mutex'. Unfortunately, that did not help, since the problem persisted.
I was on the verge of restructuring the entire application, when on googling `futex_wait hangs', I stuck upon a link, where there is a discussion about the same issue what I was facing. Some text from the link:
Unfortunately, ctime() is not defined on this list. So, glibc does not guarantee the sane behavior when one uses ctime() in signal handler. BTW, I'm surprised that sysklogd calls some functions in signal handler.
Unfortunately, my application was doing the same, i.e., using a function `localtime' (which calls __libc_lock_lock() in glibc), in a signal handler. I couldn't believe it. Though the purpose of the signal handler was to clean up resources and exit the application, I was logging some data. The logging function was calling `localtime'.
Pathetically, it has not yet been fixed, as it seems that this is a problem with glibc on 2.6 kernel, and not application programmers are at the disposal of glibc or kernel developers to fix this.
Hello,
I call this a hard won experience... AFAICS, it's not a problem with glibc, but with your code :(
You can only call a certain class of functions in signal handler: the functions that are async-signal-safe. The Single Unix Specification defines about 100 such a function (see Single Unix Specification, §2.4.3 XSH IEEE Std 1003.1-2008) that are guaranteed to be async signal-safe (roughly, 5% of all available APIs). Your system may support more, but beware if you are concerned by portability aspects!
Why localtime(), ctime() etc. are not async-signal-safe? This is simple: to make these function threads-safe in an effective manner, glibc uses a mutex. Now suppose that your thread is happily logging some message, calling localtime(). The function localtime enters, takes the mutex, is about to compute the local time (which requires to access the time zone) ... when suddenly, a signal is received, and the handler executes. If the handler also calls localtime(), it causes localtime to take the mutex. Wait! The mutex is already locked by the same thread. Uuuupppsss... Deadlock.
Since you are using Pthreads, you have a better option available. Block the interested signal(s), and dedicate a special thread that waits synchronously for the signal(s). Do whatever you need to do upon reception. In this case, you're not restricted to use async-signal-safe functions. See Butenhof's book, Programming with POSIX Threads, §6.6.4.
Cheers,
Loïc.
Posted by: Loïc Domaigné | August 18, 2009 at 04:11 AM
Ooh baby, I love that Palenque comp. Great picks Jason!
Posted by: Taobao agent | January 17, 2011 at 03:04 AM
Very honoured to see your blog, I benefited a lot here, and it brings me a great deal of enjoyment.
I sincerely hope your blog continually up to date. I hope you can write more post better in the future
Thanks for post and i think i will learn something more here.
Posted by: Tattoo Supply | August 09, 2011 at 02:45 PM