Bug 684526 – Overflow error in soruce_remove()

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 684526 - Overflow error in soruce_remove()


Summary:	Overflow error in soruce_remove()


Status:	RESOLVED FIXED

Product:	pygobject
Classification:	Bindings
Component:	gobject
Version:	2.28.x
Hardware:	Other Linux

Importance:	Normal major
Target Milestone:	---
Assigned To:	Nobody's working on this now (help wanted and appreciated)
QA Contact:	Python bindings maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2012-09-21 06:40 UTC by Neil Whelchel
Modified:	2012-10-24 04:05 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Neil Whelchel 2012-09-21 06:40:04 UTC

I ran into a problem where calling glib.source_remove(...) with the number 
that was returned from glib.timeout_add(...) causes the following error:
"OverflowError: signed integer is greater than maximum".

I did some testing and it seems that each time I call glib.timeout_add(...) I 
get a larger number. It seems that calling glib.source_remove(...) does not 
cause the source numbers to be freed up for re-use. This means that either 
the 
number is failing to wrap around at the correct time, or the numbers are 
failing to be re-used. If this is not the case, I see this as a very serious 
bug because it effectively limits the time that an application can run until 
the maximum size of the storage for the number is exceeded.

If the number does wrap around, then it is likely that add and remove have a 
different idea as to at which number the wrap is to occur, and this would be a 
bug for sure.

Python 2.7.3
libglib-2.0.so.0.3200.3
glib.pyglib_version (2, 28, 6)

Comment 1 Martin Pitt 2012-10-04 15:45:52 UTC

Wow, that's a great number of sources that you added. I wrote a little test program which iterates GLib.idle_add() and GLib.source_remove(), and it will take several hours before it processes 2 billion iterations.

GLib never resets the current ID, it just keeps incrementing them. See g_source_attach_unlocked() in glib/gmain.c, currently here:

  http://git.gnome.org/browse/glib/tree/glib/gmain.c#n1030

I checked the data types in gi/_glib/pygsource.c and gi/_glib/glibmodule.c, and they are "guint" for the IDs, so it's not a gint vs. guint confusion.

This OverflowError is something thrown from Python, not from g-i or glib, but also not from PyGObject directly. Also, it's very unlikely that you really encounter an overflow of the IDs there.

Do you have some example code which triggers this, or some more information when this happens?

Comment 2 Neil Whelchel 2012-10-05 07:41:20 UTC

Yes, it is a very large number. The program that I am having trouble with is a daemon that  generates JSON RPC messages. It carries on long term communications with many other process handler machines (usually between 200 and 1200). There are about three messages per second to each handler. The code is very simple; when I send a message, I create a timer to call the same function that is called to handle a response from the remote machine, but the argument is a timeout indicator. If an actual reply is received, I call source_remove() before the timer expires. If the timer expires, I deal with the error and return false. Under normal operation, as many as 3600 timers get set every second. I understand that this is an extreme case, but at the same time, it is a fact that there should not be anything in a library as fundamental as glib that imposes a limit on how long a program can operate. That is clearly without any question a bug. I have several timers and file watchers that exist for the entire duration of the program, so in the case of a wrap around, care has to be taken to skip over the still active numbers, otherwise long term sources would be in conflict.
-Neil-

Comment 3 Martin Pitt 2012-10-05 08:08:18 UTC

Yes, I'm not denying that it is a bug, it's just exceptionally hard to reproduce (or I am doing a wrong approach), and I cannot see an obvious error in the code.

Comment 4 Neil Whelchel 2012-10-18 18:52:59 UTC

Sorry about the delay. Please excuse the wording of my previous message, I was not attempting to contest the issue of if it was a bug, I was attempting to describe why I end up overflowing the storage so fast.
Here is a code snip, I removed many details for clearity, if the exact code is required, I can post it.

class Connection(object):
    ...
    # This sends an RPC request to the remote host.
    def call(self, method, params=(), reply_handler=None, error_handler=None):
        self.sequence += 1
        if self.sequence > 65535:
            self.sequence = 0
        self.timeouts[self.sequence] = glib.timeout_add(self.timeout, self._call_response, {'id': self.sequence, 'timeout': True, 'error': {'message': 'RPC timeout calling %s()' % method, 'code': 560}})
        ...
        # Here a message is sent to the remote host.
        ...

    # Handle the response from the remote host.
    def _call_response(self, data):
        rpc_id = data.get('id')
        if rpc_id not in self.reply_handlers:
            return False
        glib.source_remove(self.timeouts[rpc_id])
        del self.timeouts[rpc_id]
        ...
        # Handle the response...

Notes:
_call_response is called with the decoded message when a message arrives from the remote end. If no message arrives before the timeout, glib.timeout_add() calls the function with an error message. Since a new timeout is created for each message, it is just a matter of time until there is an overflow, in my case on a busy server, this is about 12 days.
When the error shows up, I get a trace from python that points at the glib.source_remove() line in _call_response() with the message, "OverflowError: signed integer is greater than maximum".
-Neil-

Comment 5 Martin Pitt 2012-10-23 04:13:23 UTC

(In reply to comment #4)
> Please excuse the wording of my previous message

No excuse necessary :)

> When the error shows up, I get a trace from python that points at the
> glib.source_remove() line in _call_response() with the message, "OverflowError:
> signed integer is greater than maximum".

Ah, that helps. I had a closer look at the function, and finally spotted the error. Committed with a test case:

http://git.gnome.org/browse/pygobject/commit/?id=126a10f765af3d3a6f08ce5db7ed9f3ef647848f

This fix will be in 3.4.2, too.

Comment 6 Neil Whelchel 2012-10-23 22:10:47 UTC

Thank you for addressing this issue. This will help greatly.
However, the issue of the source number wrapping around becomes the next problem. The code sample that I posted should work properly now for an unlimited time except that there are other long-term glib sources in existence in the program. For example, I am using glib.io_add_watch() to watch the socket that the messages are sent through, so let's assume that glib.io_add_watch() returns a source of 1. What happens when the timers use up all of the source numbers and it wraps back around to 1, (glib.timeout_add() returns 1 after the wrap around), and I call glib.source_remove(1)? Should we file this as a new bug against glib?
-Neil-

Comment 7 Martin Pitt 2012-10-24 04:05:27 UTC

Right, that would be a bug in glib. It indeed doesn't seem to make any effort to check whether

  result = source->source_id = context->next_id++;

is still taken, which would happen if you have done one complete round of allocating 2 billion sources and still have some early ones around.