GNOME Bugzilla – Bug 684526
Overflow error in soruce_remove()
Last modified: 2012-10-24 04:05:27 UTC
I ran into a problem where calling glib.source_remove(...) with the number that was returned from glib.timeout_add(...) causes the following error: "OverflowError: signed integer is greater than maximum". I did some testing and it seems that each time I call glib.timeout_add(...) I get a larger number. It seems that calling glib.source_remove(...) does not cause the source numbers to be freed up for re-use. This means that either the number is failing to wrap around at the correct time, or the numbers are failing to be re-used. If this is not the case, I see this as a very serious bug because it effectively limits the time that an application can run until the maximum size of the storage for the number is exceeded. If the number does wrap around, then it is likely that add and remove have a different idea as to at which number the wrap is to occur, and this would be a bug for sure. Python 2.7.3 libglib-2.0.so.0.3200.3 glib.pyglib_version (2, 28, 6)
Wow, that's a great number of sources that you added. I wrote a little test program which iterates GLib.idle_add() and GLib.source_remove(), and it will take several hours before it processes 2 billion iterations. GLib never resets the current ID, it just keeps incrementing them. See g_source_attach_unlocked() in glib/gmain.c, currently here: http://git.gnome.org/browse/glib/tree/glib/gmain.c#n1030 I checked the data types in gi/_glib/pygsource.c and gi/_glib/glibmodule.c, and they are "guint" for the IDs, so it's not a gint vs. guint confusion. This OverflowError is something thrown from Python, not from g-i or glib, but also not from PyGObject directly. Also, it's very unlikely that you really encounter an overflow of the IDs there. Do you have some example code which triggers this, or some more information when this happens?
Yes, it is a very large number. The program that I am having trouble with is a daemon that generates JSON RPC messages. It carries on long term communications with many other process handler machines (usually between 200 and 1200). There are about three messages per second to each handler. The code is very simple; when I send a message, I create a timer to call the same function that is called to handle a response from the remote machine, but the argument is a timeout indicator. If an actual reply is received, I call source_remove() before the timer expires. If the timer expires, I deal with the error and return false. Under normal operation, as many as 3600 timers get set every second. I understand that this is an extreme case, but at the same time, it is a fact that there should not be anything in a library as fundamental as glib that imposes a limit on how long a program can operate. That is clearly without any question a bug. I have several timers and file watchers that exist for the entire duration of the program, so in the case of a wrap around, care has to be taken to skip over the still active numbers, otherwise long term sources would be in conflict. -Neil-
Yes, I'm not denying that it is a bug, it's just exceptionally hard to reproduce (or I am doing a wrong approach), and I cannot see an obvious error in the code.
Sorry about the delay. Please excuse the wording of my previous message, I was not attempting to contest the issue of if it was a bug, I was attempting to describe why I end up overflowing the storage so fast. Here is a code snip, I removed many details for clearity, if the exact code is required, I can post it. class Connection(object): ... # This sends an RPC request to the remote host. def call(self, method, params=(), reply_handler=None, error_handler=None): self.sequence += 1 if self.sequence > 65535: self.sequence = 0 self.timeouts[self.sequence] = glib.timeout_add(self.timeout, self._call_response, {'id': self.sequence, 'timeout': True, 'error': {'message': 'RPC timeout calling %s()' % method, 'code': 560}}) ... # Here a message is sent to the remote host. ... # Handle the response from the remote host. def _call_response(self, data): rpc_id = data.get('id') if rpc_id not in self.reply_handlers: return False glib.source_remove(self.timeouts[rpc_id]) del self.timeouts[rpc_id] ... # Handle the response... Notes: _call_response is called with the decoded message when a message arrives from the remote end. If no message arrives before the timeout, glib.timeout_add() calls the function with an error message. Since a new timeout is created for each message, it is just a matter of time until there is an overflow, in my case on a busy server, this is about 12 days. When the error shows up, I get a trace from python that points at the glib.source_remove() line in _call_response() with the message, "OverflowError: signed integer is greater than maximum". -Neil-
(In reply to comment #4) > Please excuse the wording of my previous message No excuse necessary :) > When the error shows up, I get a trace from python that points at the > glib.source_remove() line in _call_response() with the message, "OverflowError: > signed integer is greater than maximum". Ah, that helps. I had a closer look at the function, and finally spotted the error. Committed with a test case: http://git.gnome.org/browse/pygobject/commit/?id=126a10f765af3d3a6f08ce5db7ed9f3ef647848f This fix will be in 3.4.2, too.
Thank you for addressing this issue. This will help greatly. However, the issue of the source number wrapping around becomes the next problem. The code sample that I posted should work properly now for an unlimited time except that there are other long-term glib sources in existence in the program. For example, I am using glib.io_add_watch() to watch the socket that the messages are sent through, so let's assume that glib.io_add_watch() returns a source of 1. What happens when the timers use up all of the source numbers and it wraps back around to 1, (glib.timeout_add() returns 1 after the wrap around), and I call glib.source_remove(1)? Should we file this as a new bug against glib? -Neil-
Right, that would be a bug in glib. It indeed doesn't seem to make any effort to check whether result = source->source_id = context->next_id++; is still taken, which would happen if you have done one complete round of allocating 2 billion sources and still have some early ones around.