GNOME Bugzilla – Bug 581526
XID table corruption from reuse of XIDs, resulting in leak, incorrect window destroyed status ("unexpectedly destroyed"), and crash
Last modified: 2018-04-15 00:31:32 UTC
Steps to reproduce:
1) Use gdk_window_new to create a GdkWindow, which is a record of an X window.
The XID for the X window is recorded on the GdkWindow. In order that X
events can be traced from the X window XID to the GdkWindow (and other
reasons), the GdkWindow is recorded in a hash table keyed by XID (and
another reference is added to the GdkWindow ref_count).
2) Use gdk_window_destroy on the GdkWindow. This calls XDestroyWindow with
the associated XID, and so the server sends a DestroyNotify event and
deletes the X window. (The GdkWindow is not yet deleted. One reference to
the GdkWindow will not be removed until the DestroyNotify event is
3) Perform some operation that uses up the XIDs in the X Display's currently
available range. (Often there is only one XID available.) Xlib will
request a new range of XIDs from the server.
4) Use gdk_window_new to create a second GdkWindow. gdk_window_new uses
XCreateWindow, which (sometimes) returns an XID equal to the XID associated
with the previous window. At this point two GdkWindow objects each have
the same XID. This new GdkWindow is recorded against the XID in the table,
removing the entry pointing to the first GdkWindow (without removing a
reference to the first GdkWindow).
5) Return to the event loop. GDK processes X events, including the
DestroyNotify event for the first window. The XID in the event is
translated to a GdkWindow using the table, but this now returns the second
GdkWindow. This GdkWindow was not expecting a DestroyNotify event and so a
warning "GdkWindow %#lx unexpectedly destroyed" is emitted. A reference is
removed from this second GdkWindow, and it is marked as destroyed (even
though the X window has not been destroyed).
So the first GdkWindow is leaked, and the second GdkWindow is incorrectly
marked as destroyed, inhibiting receipt of events on this window.
When the second GdkWindow (marked destroyed) gets used as a parent for another
gdk_window_new, a SIGSEGV results (Bug 563592).
In the situation that I saw, the DestroyNotify event was sitting in the queue
after XCreateWindow was called at step 4. (I hope it was in the queue before
the XCreateWindow call.)
(This is the most common crash that I'm seeing in Firefox.
I think this is probably the cause of bug 418199 (though I'm not seeing parentless windows as described in that bug, possibly due to different gtk+ versions.)
After some discussion with X people, I've filed https://bugs.freedesktop.org/show_bug.cgi?id=21583
I think I may also add a g_warning to _gdk_xid_table_insert when there is a collision. That'll make it easier to trace myserious crashes to this problem.
:-) Thanks for the detailed explanation in comment #0, it explains why most of the X client crashes which I was getting (not only in Firefox, but also in gvim with GTK2/Gnome2 GUI) disappear miraculously when the program is invoked with the --sync command-line switch: indeed, when using synchronous interaction between the application and the X server, is is not possible to do a lot of stuff (such as creating another Xwindow) between destroying one Xwindow and getting the notification that the X server has indeed destroyed it.
I'd like to add "[Workaround: use --sync]" in the Whiteboard, but apparently I don't have enough privileges for that.
I think that comments 24-26 in https://bugzilla.mozilla.org/show_bug.cgi?id=467744 show that this problem is still around.
I've been running "firefox --sync" and still encounter the problem. I take it that this should not be the case. Comment 26 describes my observations more completely.
In reply to comment #4
It had got so bad for me that Firefox crashed every time with a BadIdChoice message at the end of its log regardless of --sync. Then I've installed the bugfixed xorg-x11-libxcb package from the "test" repository for openSUSE 11.1, and I haven't seen that crash since then. (I have another crash, logging a different apparently non-X message, but only with Firefox builds from 2009-07-02 or later.)
The libxcb (xcb_generate_id) patch is in Fedora 11, and I'm still getting problems described in #4
I wouldn't expect --sync to avoid this bug.
--sync affects interaction between Xlib and the Xserver.
This bug is related to how GDK processes events in Xlib's queue.
An arbitrary number of windows can be destroyed and created before associated
Xlib events are processed.
BTW, the BadIdChoice error is a different bug (now fixed in xcb), although the
circumstances that lead to that bug (short XID range response) are the kind of
situations that make this bug more likely to show.
One way to avoid processing events intended for destroyed Windows on newly
created GDK windows might be to record the request id at the time of GdkWindow
creation (including foreign window creation). Only Xevents with suitable
serial numbers would be sent to this GdkWindow. This would also deal with
events on Windows that have been created and destroyed directly through Xlib
(not GDK) in the same app (and Display).
GDK_DESTROY events are sent to destroyed windows, so an ordered list of
destroyed windows may be necessary. When processing DestroyNotify events (and
optionally other events too), the Window XID would be looked up in the
destroyed list before checking the XID->GdkWindow hash table. Windows could
be added to this list either during gdk_window_destroy or (perhaps more
efficiently) only when the entry in the hash table gets knocked out by a new
entry with colliding XID.
Foreign windows are more complicated. gdk_window_foreign_new_for_display
resuses entries in the XID->GdkWindow table. An existing entry must only be
reused if it has not been destroyed. If we assume that windows created by GDK
should be destroyed by GDK, it is only existing foreign window entries that
are difficult. For foreign windows with StructureNotifyMask selected, whether
it has been destroyed can be determined from DestroyNotify events in the queue
(and unqueued responses) with XCheckIfEvent (interpreting events appropriately
according to the destroyed window list).
If we want to continue to have foreign windows without StructureNotifyMask
selected, it may be necessary to rely on the client app to dispose the
GdkWindow before or when the Window is destroyed. (It actually looks like
these windows are currently leaked because of the reference from the XID
An extra list for destroyed GdkWindows whose XIDs have already been reused might be a workable idea.
The other alternative I had in mind was to use XCheckWindowEvent at the time of the collision and process the pending DestroyNotify out-of-order, but that might have other issues...
Which part of this bug is "unconfirmed"...?
peter, you must be new here. The status "UNCONFIRMED" vs. "NEW" don't have much meaning in bugzilla.gnome.org.
yes I'm new here, but not in bugzilla in general. it's just a problem because it's not visible whether this bug is worked on, debated, mysterious or plainly out of developer interest.
(obviously I'm in the gang of "metoo"s since I enjoy these hangs/crashes as well but I'm not really into GUI coding and I consider all people who's brave enough to code such crazy things as mad semigods. I just wondered what happens here, now. but then I don't worry, things are just fine. ;-))
(Btw someone referenced a mozilla bug about "overzealously aggressive XID caching" but I am not familiar with this part of the world. It's just that this bug started to come up really often recently.)
Created attachment 144074 [details] [review]
an ugly temporary patch that seems to avoid crashs
The bug became too frequent for me to be tolerated (firefox crashed, and re-crashed before its restart was stabilized...).
So I made a temporary patch using the idea of Matthias Clasen with XCheckWindowEvent on Comment 9.
It's really ugly since I just remove the XEvents from the event list, and don't handle them (I don't know how to do that), but it worked for me : no more "GdkWindow 0x******** unexpectedly destroyed", and no crash during last week.
I use it on gtk 2.16.6, but it should also work on git head, and probably other versions.
I repeat, this is a very ugly patch and not a final one. I publish it here only because I think it could help people like me that were too much annoyed by this bug.
Comment on attachment 144074 [details] [review]
an ugly temporary patch that seems to avoid crashs
>+ //FIXME: we want ony DestroyNotify events, or do we?
You may want to view the following video here: http://www.youtube.com/watch?v=fwIwZazMTgM
I created this video to clearly demonstrate at least one trigger for the XID Collision message. I believe there are at least two triggers and that both triggers are adobe flash 10 related.
You can see from the video that you should have re-createable real life test cases for this problem.
I run a Gentoo installation.
For those familiar with Gentoo, at the end of the video, I run:
emerge -epv mozilla-firefox | less
I have saved the output of these to text files if anyone is interested. Just contact me.
The reason is that the emerge -epv mozilla-firefox command will display every package and depencies required for mozilla-firefox. For the record, prior to creating the video, I actually did re-compile every package in this list (emerge -e mozilla-firefox) in order to ensure a clean run.
In the video, the left part of the screen is a konsole terminal window. The right part of the screen is firefox. I start firefox with the command "firefox -sync' in the terminal window.
I have FF set up to start with a number of tabs. As I change focus from tab to tab, watch the terminal window. There are two tabs where changing focus causes XID Collision messages to appear. It is particularly obvious that the error messages are generated during flash activity. Note especially the generation of messages as the flash window controls autohide and then re-appear. It's not clear to me in the second tab (The Daily Show) what kind of flash control is causing the messages. However, that site never seems to stop loading flash objects. Or rather, my patience runs out before the flash downloads can complete.
My reading of other people's problems suggest that x86 (i386) based systems don't have this problem but please regard this as an unconfirmed data point.
In this thread in the Gentoo forums, I am 'dufeu': http://forums.gentoo.org/viewtopic-t-788609-highlight-.html
The video best viewed in HD on a screen 1384x768 or larger.
Thank you all for your time and patience!
The Flash Player issue is bug 590690.
Thank you Karl. I sincerely appreciate both the quick and accurate response.
In my research, I missed reading this one. I really thought I had read all related zillion comments, posts and bug entries. Since bug 590690 is the correct place, I have a few more short notes I will add there.
And for the record this problem is observed on x86 (i686).
The most straight-forward fix for this would seem, to me at least, to immediately zero out the XID field in the data structure, or otherwise flag the XID field as being invalid, as soon as there is a request to destroy the GTK window object. (Probably want to remove the entry from the reverse hash list as well, right?)
In fact, if I were designing a system like this, I'd want to make the structure externally invalid altogether since we don't want it to be usable by various callers anymore anyway. Internally, of course, we can keep the structure around just to complete the notify event protocol and any memory releasing and accounting, but then we'd destroy the GTK data structure completely.
Moreover, if we are using objects (C++ or some structured programming method), internal fields like the one where XID is stored could be made inaccessible except by way of an "accessor" or a method to get the XID, if it needed it directly at all. This would ensure other internal library routines do not accidentally depend on what would otherwise be "friendly" data fields.
I am not familiar with the GTK data structures, but this is the general approach we used "back in the day," and it seems like it should work today as well for this problem. Is there some issue about GTK architecture that one has to know that would eliminate my simple solution as a possibility? Thanks.
The problem is not that events are sent to a GdkWindow that is no longer valid,
but that events are sent to a GdkWindow that is (in the time-frame of the event queue) not yet valid (because they were for a different X window with the same XID).
Is there any progress on this bug? So far, more then two years since the bug was reported.
I've updated my Debian to Squeeze couple of days ago and have Firefox segfaults 5-10 times a day. So, it is still a problem.
May I ask a dumb question? How Qt overcomes this problem with non-unique XIDs? May be it should be done the same way in GTK?
Just updating bug report noting that Firefox has been changed upstream and now avoids this issue.
Created attachment 316266 [details] [review]
Bug fix, singly-linked list used for XID storage
We wrote a patch of this bug (credits to Ludovic Brenta and Philippe Waroquiers). The idea is that, instead of storing one window per XID, we store a linked list, with the oldest window last, and use this to make sure the DestroyNotify event is always sent to the oldest window stored.
This patch was created against the 3.18.5 release.
Created attachment 324077 [details] [review]
Bug fix, singly-linked list used for XID storage
We corrected previous patch to change order of list from oldest to newest windows and use oldest window for Xlib events from queue.
This patch was created against the 3.18.9 release.
We're moving to gitlab! As part of this move, we are moving bugs to NEEDINFO if they haven't seen activity in more than a year. If this issue is still important to you and still relevant with GTK+ 3.22 or master, please reopen it and we will migrate it to gitlab.
As announced a while ago, we are migrating to gitlab, and bugs that haven't seen activity in the last year or so will be not be migrated, but closed out in bugzilla.
If this bug is still relevant to you, you can open a new issue describing the symptoms and how to reproduce it with gtk 3.22.x or master in gitlab: