After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 341327 - Memory corruption inside glib
Memory corruption inside glib
Status: RESOLVED FIXED
Product: gtk+
Classification: Platform
Component: Backend: X11
2.8.x
Other All
: Normal critical
: ---
Assigned To: gtkdev
gtkdev
: 340830 (view as bug list)
Depends on:
Blocks:
 
 
Reported: 2006-05-10 20:29 UTC by Sampo Savolainen
Modified: 2011-02-04 16:20 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Valgrind/memcheck report of ardour 2.0 (cvs version) (64.12 KB, application/x-gzip)
2006-05-10 20:35 UTC, Sampo Savolainen
Details

Description Sampo Savolainen 2006-05-10 20:29:49 UTC
Steps to reproduce:
The bug is very fickle, and is quite difficult to duplicate.

Stack trace:
The actual stack trace of the crash happens is only a byproduct of the actual
bug. Software using glib and gtk will eventually crash because of memory
corruption caused by this bug. Thus, a stack trace is irrelevant. I will attach
a valgrind report of an application running where memcheck should clue you in
what's happening.

Other information:
This bug is currently halting development of some software (for example, ardour
2.0 http://www.ardour.org) because the program will crash at random places due
to this issue.
Comment 1 Sampo Savolainen 2006-05-10 20:35:07 UTC
Created attachment 65194 [details]
Valgrind/memcheck report of ardour 2.0 (cvs version)

This report was run using debug builds of glib 2.8.6, gtk+ 2.8.17, pango-1.10.4 and atk-1.10.3. See line 3539. The line numbers from these libraries might not match with original sources, as some debug output and slight modifications have been made to the libraries. These modifications are only debug variables & such, nothing which could actually create the bug.
Comment 2 Sampo Savolainen 2006-05-11 07:08:25 UTC
I forgot to mention that Ardour is not the only project suffering from similar crashes with 2.8. It seems beast/bse (http://beast.gtk.org/) is also getting crashes very similar to the one in Ardour.
Comment 3 Tim Janik 2006-05-11 14:08:40 UTC
the crashes in beast are not neccessarily related, and they so far occour on amd64 only. we are still looking into it though.

looking at your report, line 3539 has:
==5507== Invalid read of size 4
==5507==    at 0x4C43599: gdk_xid_equal (gdkxid.c:126)
==5507==    by 0x479C4C0: g_hash_table_insert (ghash.c:249)
==5507==    by 0x4C433C2: _gdk_xid_table_insert (gdkxid.c:55)
==5507==    by 0x4C3F76E: gdk_window_new (gdkwindow-x11.c:893)
==5507==    by 0x497AAAD: gtk_drawing_area_realize (gtkdrawingarea.c:127)

which is highly unlikely to just be an ordinary bug in the window or xid handling  in gdk. (the xid accessed by gdk_xid_equal() is actually part of the GdkWindow implementation struct).
instead, this could more likely be caused by sometihng messing up the hash table structures. in glib-2.10, allocation of hashtable structures changed slightly by using the new GSlice allocator. so if you were using glib-2.10, you could try to debug your crashes with the hints outlined here:
  http://blogs.gnome.org/view/timj/2006/1/25/0
but you filed this against glib-2.8.

are you sure you're not just using gtk+-2.8 on top of glib-2.10.x?
Comment 4 Sampo Savolainen 2006-05-11 19:27:08 UTC
I'm using gtk+ 2.8.17 and glib 2.8.6. I'm sure as i've created the debug environment myself. I've compiled the libs with debug symbols and installed them with prefix=/opt/debug and I'm running ardour with LD_LIBRARY_PATH=/opt/debug/lib:..

This issue began as with a backtrace of a crash caused by this issue. The crash happens in gdk_event_translate() (x11 module) at line:

 if (window && !GDK_IS_WINDOW (window))

Previously. this line of code is ran:

 window = gdk_window_lookup_for_display (display, xwindow);

For this event type (110, whatever that event is?), xwindow is 0. This function just calls gdk_xid_table_lookup_for_display(), which does a hashtable lookup for ((GdkDisplayX11 *)display)->xid_ht for the key xwindow.

I've traced what keys are stored in xid_ht, and i know for a fact that a key pointing to 0 has never been stored in there. But, when gdk_xid_table_lookup_for_display() is called, one key points to 0. 

I found that I can reliably select the node before the data key points to gets overwritten. Thus i can add a watchpoint to the memory when the key is still intact.

This then lead me to insert_rc_property() where g_array_insert_val() ( a macro pointing to g_array_insert_vals() ) is called. Inside this function the code properly resizes the array (in this case, original length 0, memory pointing to 0) with g_array_maybe_expand(). The resized array points to a point in memory 28 bytes before the set watchpoint. After g_memmove(), memcpy() is called and the XID key gets overwritten.

I know this is not the real issue, this is only a full report of a case caused by the memory corruption.
Comment 5 Sampo Savolainen 2006-05-11 19:36:01 UTC
One additional detail. As you can see, there is no crash in the valgrind report. The operations I did with ardour while running in valgrind are operations which cause ardour to crash without valgrind. 
Comment 6 Paul Davis 2006-05-11 19:52:39 UTC
as an additional data point, dmalloc also confirmed no "visible" memory overrun issues just as valgrind did. this doesn't rule out application code, but whatever it is, its a very deep and subtle problem when neither valgrind nor dmalloc can catch the app overrunning a malloc block or using free-ed memory.

btw, the valgrind trace shows some clear errors in glib/gdk/gtk code in terms of memory read access (off by 1 and off by 2 errors).
Comment 7 Tim Janik 2006-05-11 22:27:54 UTC
this issue may be related to bug 340830.
Comment 8 Tim Janik 2006-05-11 22:38:54 UTC
Sampo, can you give more detailed versioning information?
in particular glib and gtk versions where your operations run stable,
and versions where you encounter the crash (and try to up/downgrade them
to get as close as possible).

i.e. if this really is related to bug 340830, can you confirm that just downgrading your gtk+ to 2.8.12, .11, .10 or .9 fixes the crashes?
and then, what's the first version where you run into it reliably?
Comment 9 Sampo Savolainen 2006-05-12 06:12:11 UTC
From comment #1: This report was run using debug builds of glib 2.8.6, gtk+ 2.8.17, pango-1.10.4 and atk-1.10.3.

I'll try this with an earlier version of gtk and get back to you.
Comment 10 Sampo Savolainen 2006-05-12 16:43:02 UTC
I tried with gtk+ 2.6.11 and i got the same crash (=same backtrace).
Comment 11 Sampo Savolainen 2006-05-12 16:58:50 UTC
Sorry, 2.8.11 not 2.6.xx
Comment 12 Sampo Savolainen 2006-05-17 21:10:22 UTC
I made a wrapper library to catch calls to free(). I used the wrapper function to completely disable the free()ing of memory. That fixed the crash in ardour.

Next i disabled the free() function in glib, by making  glib_mem_vtable.free point to a function which doesn't do anything. With this in place, the problem was also "fixed". This points the blame at glib/gtk.
Comment 13 Sampo Savolainen 2006-05-17 21:46:20 UTC
Furthermore, it seems localized to glib. I made glib call g_fake_free() instead of g_free() and g_fake_realloc() instaed of g_realloc(), both of which don't actually free memory at all. With these changes, gtk still frees memory, as it uses g_free(), but glib functions don't free.

With these changes, ardour is still stable.
Comment 14 Eero Tamminen 2006-05-25 11:50:08 UTC
If the program works when you've "disabled" freeing, it would indicate
that it's doing a double free/unref somewhere and/or your code is using
memory that is already freed.  Could this happen e.g. if different alloc
and free functions are mismatched?

Also, if the program's stomping over stack, Valgrind (nor other free tools)
cannot detect this very well.  :-/
Comment 15 Sampo Savolainen 2006-05-25 12:01:32 UTC
glib seems to offer memory profiling tools to detect double free()'s. How can I enable them? I've googled and asked on #gtk+ but no answer.

Comment 16 Paul Davis 2006-05-25 12:41:13 UTC
Eero - please note that Sampo disabled free in glib *only* as a way to "fix" the issue. The rest of the program continues to malloc/free normally. This tends to suggest (though it does not prove) that it is code inside glib/gdk/gtk and not our application that has a problem. We are still investigating. 

Your point about valgrind is well taken, but stomping on the stack will not corrupt malloc headers and thus lead to the problems we are seeing. At least, it won't do that with any particular ease or predictability.
Comment 17 Matthias Clasen 2006-05-25 13:12:49 UTC
If this is a stack smearing problem, compiling with gcc 4.1 and -fstack-protector
may help tracking it down.
Comment 18 Eero Tamminen 2006-05-29 06:36:19 UTC
I'm not saying that this could happen easily, but it is possible.

> please note that Sampo disabled free in glib *only* as a way to "fix"
> the issue. The rest of the program continues to malloc/free normally.
> This tends to suggest (though it does not prove) that it is code inside
> glib/gdk/gtk and not our application that has a problem.

You might e.g. be unreffing a pointer in one place and freeing
it in another or vice verse.

> Your point about valgrind is well taken, but stomping on the stack will not
> corrupt malloc headers and thus lead to the problems we are seeing. At least,
> it won't do that with any particular ease or predictability.

You could have a pointer and char array in stack beside each other.
If you overwrite (or underwrite) the array bounds, you modify the pointer
and then strange behaviour happens.  Also, stack overwrite could be
modifying an array index so that you write over other areas (and their
headers) you've allocated. This could happen also without stack stomping,
Valgrind doesn't detect all writes over the allocation boundaries even
when they are in heap (see Valgrind mailing list).


> btw, the valgrind trace shows some clear errors in glib/gdk/gtk code in terms
> of memory read access (off by 1 and off by 2 errors).

See: http://bugs.kde.org/show_bug.cgi?id=127980

Use non-stripped dynamic linker if you don't want to see Valgrind
warnings from the ld-*.so.


Btw. I noticed following Ardour errors in your Valgrind log:
------------
new style 15: (null), style->rc_properties->len = 0

(ardour.bin:5507): Gtk-CRITICAL **: gtk_widget_grab_default: assertion `GTK_WIDGET_CAN_DEFAULT (widget)' failed
loading system configuration file ../ardour_system.rc
------------

And a lot of "Conditional jump or move depends on uninitialized value(s)"
warnings, starting from this:
------------
==5507== Conditional jump or move depends on uninitialised value(s)
==5507==    at 0x8482F85: Editor::set_mouse_mode(Editing::MouseMode, bool) (editor_mouse.cc:171)
------------

They could well be the bug you're looking for, they happen before
the gdk_xid_equal() invalid read.
Comment 19 Tim Janik 2006-05-30 11:58:48 UTC
Sampo, afaiu, you still don't have a reliable and reproducable way to trigger the bug?
is there at least any chance to nail the free/realloc issue a bit more specific using your fake* replacements and reducing the number of places they are used in?
Comment 20 Sampo Savolainen 2006-05-30 12:19:38 UTC
I've had a pseude-reproducable way to trigger the bug for a long time. The problem is that due to the nature of the issue, it vanishes when using valgrind.

Just yesterday i got this very interesting stack trace from valgrind which I haven't had time to look at yet (btw. glib 2.10.2 also crashes):

==27331== Invalid read of size 4
==27331==    at 0x4C606E9: gdk_xid_equal (gdkxid.c:133)
==27331==    by 0x47B49A2: g_hash_table_insert (ghash.c:253)
==27331==    by 0x4C60512: _gdk_xid_table_insert (gdkxid.c:57)
==27331==    by 0x4C5C81E: gdk_window_new (gdkwindow-x11.c:893)
==27331==    by 0x49397CA: gtk_button_realize (gtkbutton.c:916)
==27331==    by 0x469FD58: Gtk::Widget_Class::realize_callback(_GtkWidget*) (widget.cc:3239)
==27331==    by 0x483A3AA: g_cclosure_marshal_VOID__VOID (gmarshal.c:77)
==27331==    by 0x482BD68: g_type_class_meta_marshal (gclosure.c:567)
==27331==    by 0x482C428: g_closure_invoke (gclosure.c:490)
==27331==    by 0x483DAF9: signal_emit_unlocked_R (gsignal.c:2368)
==27331==    by 0x483F31A: g_signal_emit_valist (gsignal.c:2197)
==27331==    by 0x483F668: g_signal_emit (gsignal.c:2241)
==27331==    by 0x4B31CD9: gtk_widget_realize (gtkwidget.c:2339)
==27331==    by 0x4B31F55: gtk_widget_map (gtkwidget.c:2256)
==27331==    by 0x497E523: gtk_container_map_child (gtkcontainer.c:2390)
==27331==    by 0x4937950: gtk_box_forall (gtkbox.c:703)
==27331==    by 0x46CBE70: Gtk::Container_Class::forall_vfunc_callback(_GtkContainer*, int, void (*)(_GtkWidget*, void*), void*) (container.cc:396)
==27331==    by 0x497C67B: gtk_container_forall (gtkcontainer.c:1265)
==27331==    by 0x497E59F: gtk_container_map (gtkcontainer.c:2398)
==27331==    by 0x469FB70: Gtk::Widget_Class::map_callback(_GtkWidget*) (widget.cc:3171)
==27331==    by 0x483A3AA: g_cclosure_marshal_VOID__VOID (gmarshal.c:77)
==27331==    by 0x482BD68: g_type_class_meta_marshal (gclosure.c:567)
==27331==    by 0x482C428: g_closure_invoke (gclosure.c:490)
==27331==    by 0x483DAF9: signal_emit_unlocked_R (gsignal.c:2368)
==27331==  Address 0xC4D13F4 is 108 bytes inside a block of size 192 free'd
==27331==    at 0x401D048: free (vg_replace_malloc.c:235)
==27331==    by 0x47C81F0: g_free (gmem.c:187)
==27331==    by 0x4A64CB7: gtk_rc_get_style (gtkrc.c:1676)
==27331==    by 0x4B31269: gtk_widget_reset_rc_style (gtkwidget.c:4639)
==27331==    by 0x4A7948C: do_size_request (gtksizegroup.c:590)
==27331==    by 0x4A798B6: _gtk_size_group_compute_requisition (gtksizegroup.c:788)
==27331==    by 0x4B29725: gtk_widget_size_request (gtkwidget.c:2709)
==27331==    by 0x4B246A0: gtk_vbox_size_request (gtkvbox.c:122)
==27331==    by 0x469FF4E: Gtk::Widget_Class::size_request_callback(_GtkWidget*, _GtkRequisition*) (widget.cc:3308)
==27331==    by 0x483AE24: g_cclosure_marshal_VOID__BOXED (gmarshal.c:566)
==27331==    by 0x482BD68: g_type_class_meta_marshal (gclosure.c:567)
==27331==    by 0x482C428: g_closure_invoke (gclosure.c:490)
==27331==    by 0x483DAF9: signal_emit_unlocked_R (gsignal.c:2368)
==27331==    by 0x483F31A: g_signal_emit_valist (gsignal.c:2197)
==27331==    by 0x484295F: g_signal_emit_by_name (gsignal.c:2265)
==27331==    by 0x4A794BE: do_size_request (gtksizegroup.c:592)
==27331==    by 0x4A798B6: _gtk_size_group_compute_requisition (gtksizegroup.c:788)
==27331==    by 0x4B29725: gtk_widget_size_request (gtkwidget.c:2709)
==27331==    by 0x4B246A0: gtk_vbox_size_request (gtkvbox.c:122)
==27331==    by 0x469FF4E: Gtk::Widget_Class::size_request_callback(_GtkWidget*, _GtkRequisition*) (widget.cc:3308)
==27331==    by 0x483AE24: g_cclosure_marshal_VOID__BOXED (gmarshal.c:566)
==27331==    by 0x482BD68: g_type_class_meta_marshal (gclosure.c:567)
==27331==    by 0x482C428: g_closure_invoke (gclosure.c:490)
==27331==    by 0x483DAF9: signal_emit_unlocked_R (gsignal.c:2368)
Comment 21 Sampo Savolainen 2006-06-03 18:03:06 UTC
The gdk xid hash_table seems to get corrupted because of gtk_style_get_font(). 

The function inserted an XID key to the hash table which got unallocated / removed from stack, but the pointer to the XID remained in the hash table.

As the function is deprecated and not really necessary in ardour, I just removed the function call and did not attempt to find a fix for the function.

Comment 22 Tim Janik 2006-06-04 01:01:45 UTC
There(In reply to comment #21)
> The gdk xid hash_table seems to get corrupted because of gtk_style_get_font(). 

There has been even more commenting from Sampo on #gtk+, i'll paste the
relevant statements:

<sampo_v2> can you quickly look at gdk/x11/gdkfont-x11.c ?
<sampo_v2> _gdk_font_destroy() is using a suspcious key for
           _gdk_xid_table_remove(): ((XFontStruct *) private->xfont)->fid
<sampo_v2> using that key for the hash table remove fails when removing fonts
           from the xid hash table.
<sampo_v2> if the key is changed to: xfont->xid, it works fine and no removes
           which end up removing no hash nodes come up

and yes, looking at the code in gdk/x11/gdkfont-x11.c, we do:

gdk_font_load_for_display():
  XFontStruct *xfont;
  GdkFontPrivateX *private;
  private->xfont = xfont;
  private->xid = xfont->fid | XID_FONT_BIT;
  _gdk_xid_table_insert (display, &private->xid, font);

_gdk_font_destroy():
  GdkFontPrivateX *private;
  _gdk_xid_table_remove (private->display,
                         ((XFontStruct *) private->xfont)->fid);

so yes, i think we should either remove (((XFontStruct *) private->xfont)->fid | XID_FONT_BIT) or simply private->xid.
Comment 23 Sampo Savolainen 2006-06-04 09:53:35 UTC
To summarize:

The XID hash table contain GdkFont's. The XID's for fonts are (XFontStruct *)->fid | XID_FONT_BIT. The equal function the for xid hash table tests whether ((*a) & ~XID_FONT_BIT) == ((*b) & ~XID_FONT_BIT).

But, the xid hash table  hash function uses (*key). What happens when removing the GdkFont from the hash table is that the hash function selects a different hash node than where the XID really is.

So, when a GkdFont is unreffed, the font isn't removed from the xid hash table. The key is free()'d, so the hash table will have an illegal pointer. The trouble begins when the memory is allocated again.

What happens with Ardour, in my test case, is that the memory is reallocated so that the four bytes which make up the key is set to zero. Then when an XEvent with now XID (xid == 0) is dispatched, the gdk window lookup will find the already unreffed GdkFont (which now contains absolute foo), and ardour will crash.
Comment 24 Enrico Tröger 2006-06-04 16:59:28 UTC
I don't know if it will help you, but my problems with Geany mentioned in bug 340830, seems to be solved. The problem was mixing X core fonts with Pango fonts. After switching to Pango fonts, the crashes stopped.

HTH,
Enrico
Comment 25 Tim Janik 2006-06-04 17:22:21 UTC
*** Bug 340830 has been marked as a duplicate of this bug. ***
Comment 26 Matthias Clasen 2006-06-05 01:20:14 UTC
Thanks for tracking this down. 

2006-06-04  Matthias Clasen  <mclasen@redhat.com>

        * gdk/x11/gdkfont-x11.c (_gdk_font_destroy): Remove the right
        XID from the xid table.  (#34327, Sampo Savolainen, Tim Janik)

Comment 27 Sampo Savolainen 2006-06-05 06:58:15 UTC
Correction: #341327 (not #34327 as mentioned above)