GNOME Bugzilla – Bug 747422
Crash in _gtk_css_value_ref due to NULL value
Last modified: 2016-01-10 16:35:07 UTC
I got this with epiphany using 3-16 branch Program received signal SIGSEGV, Segmentation fault. _gtk_css_value_ref (value=0x0) at gtkcssvalue.c:51 51 g_atomic_int_add (&value->ref_count, 1); (gdb) bt
+ Trace 234943
Sometimes it doesn't crash in _gtk_css_value_ref() but in _gtk_css_value_equal(). This is the backtrace I got this time: Program received signal SIGSEGV, Segmentation fault. _gtk_css_value_equal (value1=0x7fb499375600 <border_style_values>, value2=value2@entry=0x0) at gtkcssvalue.c:118 118 if (value1->class != value2->class) (gdb) bt
+ Trace 234967
This bug is a pain. I've been trying to catch it in valgrind for quite some time, but it never seems to happen when running in valgrind. At first I realized I was using the wrong environment variables in combination with valgrind (G_DEBUG=gc-friendly makes memory errors disappear), but I've been running it without any extra environment variables at all for several hours over multiple days, and have never hit the crash. But when running without valgrind, it's hard to go 10-15 minutes without a crash. So I guess it's a race that doesn't happen when the program gets slowed down. :( Without catching it in valgrind, I think the only thing we can do is squint really hard at GtkStyleContext, GtkCssValue, and GtkCssStyle. Note: the crash was introduced sometime between GTK+ 3.12 and 3.14.
I've checked that the value being null is border-top-style, just in case it helps.
(In reply to Michael Catanzaro from comment #2) > Note: the crash was introduced sometime between GTK+ 3.12 and 3.14. Never happened to me with 3.14, it started to happen here recently with 3.15.x
(In reply to Carlos Garcia Campos from comment #4) > Never happened to me with 3.14, it started to happen here recently with > 3.15.x You got lucky then :) Here is one problem report that covers F21: https://retrace.fedoraproject.org/faf/problems/759405/
A couple observations: * Company deleted the code where the first crash occurs in commit e95985da26c8fed22b9e05aa3ff9b4a5a53e3ad4. Might as well backport that, although it won't fix the root cause. * Both of these crashes are guarded by gtk_internal_return_val_if_fail. Without fixing the bug, we could work around the crash simply by changing them to g_return_val_if_fail.
Eh, that workaround just makes it crash later on. Had to try, though.
Created attachment 304843 [details] [review] [PATCH] css: Fix multiple crashes in parsing This is most likely the same bug I've observed. It is caused by the absence of NULL pointer checks in many places in CSS code. A patch is attached.
One more crash related to the same problem:
+ Trace 235138
Several more crashes related to the same problem: https://retrace.fedoraproject.org/faf/problems/968254/
The crash seems to be caused by the possibility of the GtkCssValue* being NULL if the corresponding value does not exist in the CSS. The most obvious crash was the one in the _gtk_css_border family of functions and is fixed by the above patch. Thare are less obvious ones, probably due to NULL values returned from one of the _gtk_css_***_get functions. A possible solution is adding proper NULL checking in every single function using GtkCssValue.
Thanks very much for working on this :) I will note that g_return_val_if_fail is used to document preconditions. It can only "fix" the crashes by turning them into critical warnings. That's 2345634x better than the crash and it would make our web browser usable again, but it's not a real fix: either the value should never be NULL (in which case g_return_val_if_fail is appropriate), or it's OK for it to be NULL (in which case you should return manually instead of using g_return_val_if_fail, but I suspect it's generally not OK). Note, this family of crashes shows up in many separate FAF reports; it's not just 1600 reports, more like 5000 at minimum: https://retrace.fedoraproject.org/faf/problems/960737/ https://retrace.fedoraproject.org/faf/problems/871220/ https://retrace.fedoraproject.org/faf/problems/887731/ https://retrace.fedoraproject.org/faf/problems/866571/
That value should never be NULL. I suspect it's caused by some invalid memory writes in close proximity or a free-before-use or some other messup. But I've never been able to trigger it. And there doesn't seem to be a way to reproduce it. So there's unfortunately been nothing I could do.
What happens if the value is not defined in the CSS of the actual GTK+ theme? I suspect this happens only with certain CSS configurations.
A concrete example: gtk_render_frame() calls gtk_css_style_render_border(). gtk_css_style_render_border() makes many calls to gtk_style_get_value(), and one of the crashes observed by me is caused by a NULL returned from this function. However, this NULL value seems to be legitimate: gtkcssstaticstyle.c 43 static GtkCssValue * 44 gtk_css_static_style_get_value (GtkCssStyle *style, 45 guint id) 46 { 47 GtkCssStaticStyle *sstyle = GTK_CSS_STATIC_STYLE (style); 48 49 if (sstyle->values == NULL || 50 id >= sstyle->values->len) 51 return NULL; 52 53 return g_ptr_array_index (sstyle->values, id); 54 } What happens then if a widget does not have border?
That's actually code that survived through some refactorings from back when GtkCssLookup didn't exist yet. It should probably be changed to an assert. The creation of a GtkCssStaticStyle ensures that the values array holds all values (and in the case where it's unspecified in the CSS uses the default value for that property).
I'm not quite sure, but it looks like that sstyle->values contains NULL at the position 20. If I understand the coredump correctly.
I found a possible cause. The function gtk_css_static_style_new_update() leaves holes in the style->values array. The function gtk_css_static_style_set_value() assumes that all values are set but does not check that. If id is too large, the value is resized, but new elements may be left unitialized. If, for some reason, ids are being skipped, the array will contain NULLs. The function gtk_css_static_style_new_update() creates a new style. This style is then initialized by going through ids one by one (hm, reallocating memory?), and the elements not in bitmask ARE LEFT UNINITIALIZED. An attempt to access these elements causes segmentation fault somewhere else.
Aren't ids not in bitmask filled in the loop above in https://git.gnome.org/browse/gtk+/tree/gtk/gtkcssstaticstyle.c?h=3.16.2#n274 ?
This loop is exactly where the problem is. This loop does not match the one in _gtk_css_lookup_resolve(). I don't see what the difference is, but my debugger shows exactly that - some values are never assigned. I can make additional tests of course, but I'm pretty sure it is so. I think it is better to make _gtk_css_lookup_resolve() first, since it already has the correct bitmask. Then just fill the NULLs by copying corresponding values from "style" to "result". Or it were more safe to do all assignments in a single place.
The bitmask is used to track which values are still missing. Code works in new_update() like this: 1) Fill in all values not in the bitmask by copying. 2) Pass the bitmask with bits for the remaining values set to the CSS lookup. 3) The CSS lookup looks in the css for all missing values that are defined there. Whenever we find a value, we set that value in lookup->values and unset the corresponding bit in lookup->missing. 4) in lookup_resolve() we iterate over all values and whenever we have a value, we set that one and whenever a value is still missing, we use the default value. Now every property will have been set: - Either it's been set in the loop in (1) by copying the value. - Or it's been found in the CSS and then copied by lookup_resolve() in (3) and (4). - Or if neither of the other 2 ways were used, the default value will have been used in (4). So now the value cannot be NULL. At least that's the theory.
Something that will maybe help with testing: a few downstream bug reports say the crash occurred when changing GTK+ themes. So I guess that means the user is messing around in Tweak Tool, then some other application dies due to the theme change. "While changing GTK+ themes with a small interval between each one" reports a user on a corebird crash. "Was viewing different settings in tweak tool when I changed GTK+ CurvLooks when it crashed" reports a user on a gnome-shell crash. Another report: "1. Launch GCalculator 2. Change style in xfce (Apperance - Style) 3. Click on GCalculator's window"
Another variant (or possibly a different issue?) in rbz#1225230: Truncated backtrace: Thread no. 1 (10 frames) #1 g_hash_table_lookup_node at ghash.c:368 #2 g_hash_table_lookup at ghash.c:1092 #3 style_values_lookup at gtkstylecontext.c:800 #4 _gtk_style_context_validate at gtkstylecontext.c:3060 #12 gtk_container_idle_sizer at gtkcontainer.c:1738 #13 _g_closure_invoke_va at gclosure.c:831 #15 g_signal_emit_by_name at gsignal.c:3405 #16 gdk_frame_clock_paint_idle at gdkframeclockidle.c:408 #22 g_main_context_iteration at gmain.c:3842 #23 g_application_run at gapplication.c:2282
Whoops, I meant 1211572: https://bugzilla.redhat.com/show_bug.cgi?id=1211572
Please, see this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1156124#c41
Please summarize relevant content here - more likely people read it in their notification mail than clicking a link and then reading it. Thanks! :)
Sorry. That link points to a micro-report generated with a specific GTK3 Fedora update. But Michael Catanzaro already said that it doesn't showed any new info. Thanks.
OK, in https://bugzilla.redhat.com/show_bug.cgi?id=1234634 someone hit one of the asserts we added in Fedora to check for holes in style->values. So that is independent confirmation of Alexey's theory. Unfortunately, only one single reporter has hit this assertion so far, but there are dozens of new reports of the original issue with that patch applied, so it's safe to conclude that while this is a real problem, it's not the primary issue.
I added a lot of extra debug printout and use it every day. Unfortunately I haven't got a crash yet. I observed holes in the array today (no crash, just warning), but my assert in gtk_css_static_style_new_update was NOT triggered. On my machine the hole occurs always at values[20] regardless of the gtk+ version and compile flags, so this does not look like memory corruption.
(In reply to Michael Catanzaro from comment #28) > OK, in https://bugzilla.redhat.com/show_bug.cgi?id=1234634 someone hit one > of the asserts we added in Fedora to check for holes in style->values. So > that is independent confirmation of Alexey's theory. Unfortunately, only one > single reporter has hit this assertion so far, but there are dozens of new > reports of the original issue with that patch applied, so it's safe to > conclude that while this is a real problem, it's not the primary issue. So this was wrong: it turns out that FAF was merging new crashes in check_correctness() with the original crash reports, causing me to incorrectly think that we were not getting crashes in check_correctness(). It's impossible to say how often the crashes occur in check_correctness(), since FAF is treating them as the same crash (due to similar backtraces).
So, forgive me the ignorance, that means if I put the "original contents" of a crash here I will help?
(In reply to Michael Catanzaro from comment #30) > It's impossible to say how often the crashes occur in check_correctness(), > since FAF is treating them as the same crash (due to similar backtraces). Michael, it is possible to say how often the crashes occur, but it is a bit tricky. You can search for the function on FAF's Problem page: https://retrace.fedoraproject.org/faf/problems/?component_names=&associate=__None&daterange=&bug_filter=None&function_names=check_correctness&binary_names=&source_file_names=&since_version=&since_release=&to_version=&to_release=# and you will get the problems (groups of similar backtraces) that contain at least one report (backtrace) having the searched function on stack. My search results says that we have 72 occurrences in 6 reports (backtraces) merged into 2 problems. https://retrace.fedoraproject.org/faf/problems/bthash/?bth=5b73ca480f7c5fb02869723ac6047aedbb6a8851&bth=73e0811a885f5bc8e8121add7aace40432af280f&bth=af029730da2b951e34004925360161a537d33c6f&bth=ca5f26391abfb13f65d598c33769651f5a3e2885&bth=f2295fb0dd25b6c7254f22b0aa9693a76617ebcd https://retrace.fedoraproject.org/faf/problems/bthash/?bth=380ca8eadc1883ae89eea2127893301ff4a9af88
Thanks Jakub. Anyway, we found a relatively reliable reproducer. Open Epiphany in gdb, then click the New Tab button. Close the new tab, then click the New Tab button again. Repeat until Epiphany crashes. I reproduced this four times in a row with under 10 attempts; it usually crashes in about three attempts, but it is random. It will happen either when opening or closing the tab. I can't reproduce it reliably like this except when using gdb (nor can I get it under valgrind ever, see above).
I hit this bug rather often, today I got motivated to chase it a while. I found out that ephy valgrind spews the following warning from time to time: ==15839== Invalid write of size 8 ==15839== at 0x6523F65: g_nullify_pointer (gutils.c:2028) ==15839== by 0x626890E: weak_refs_notify (gobject.c:2629) ==15839== by 0x6269A94: g_object_unref (gobject.c:3137) ==15839== by 0x5FABCA9: do_call (gdbusnamewatching.c:216) ==15839== by 0x5FAC05A: call_vanished_handler (gdbusnamewatching.c:320) ==15839== by 0x5FAC05A: on_name_owner_changed (gdbusnamewatching.c:307) ==15839== by 0x5F9CF23: emit_signal_instance_in_idle_cb (gdbusconnection.c:3657) ==15839== by 0x64F0C89: g_main_dispatch (gmain.c:3122) ==15839== by 0x64F0C89: g_main_context_dispatch (gmain.c:3737) ==15839== by 0x64F1007: g_main_context_iterate.isra.29 (gmain.c:3808) ==15839== by 0x64F10AB: g_main_context_iteration (gmain.c:3869) ==15839== by 0x5F7A5FB: g_application_run (gapplication.c:2311) ==15839== by 0x42EDD5: main (in /usr/bin/epiphany) ==15839== Address 0xf8cb030 is 176 bytes inside a block of size 1,128 free'd ==15839== at 0x4A07D2A: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==15839== by 0x6287CC6: g_type_free_instance (gtype.c:1941) ==15839== by 0x5F558E1: g_task_finalize (gtask.c:626) ==15839== by 0x6269B09: g_object_unref (gobject.c:3174) ==15839== by 0x5F55528: g_task_return_now (gtask.c:1104) ==15839== by 0x5F55D3D: g_task_return (gtask.c:1162) ==15839== by 0x469369: ??? (in /usr/bin/epiphany) ==15839== by 0x5F55528: g_task_return_now (gtask.c:1104) ==15839== by 0x5F55D3D: g_task_return (gtask.c:1162) ==15839== by 0x5FAD822: reply_cb (gdbusproxy.c:2579) ==15839== by 0x5F55528: g_task_return_now (gtask.c:1104) ==15839== by 0x5F55D3D: g_task_return (gtask.c:1162) ==15839== by 0x5FA2D01: g_dbus_connection_call_done (gdbusconnection.c:5407) ==15839== by 0x5F55528: g_task_return_now (gtask.c:1104) ==15839== by 0x5F55568: complete_in_idle_cb (gtask.c:1118) ==15839== by 0x64F0C89: g_main_dispatch (gmain.c:3122) ==15839== by 0x64F0C89: g_main_context_dispatch (gmain.c:3737) ==15839== by 0x64F1007: g_main_context_iterate.isra.29 (gmain.c:3808) ==15839== by 0x64F10AB: g_main_context_iteration (gmain.c:3869) ==15839== by 0x5F7A5FB: g_application_run (gapplication.c:2311) ==15839== by 0x42EDD5: main (in /usr/bin/epiphany) The address it pointed to seems quite unrelated to this bug, but I found that hacking glib so the vanished callback doesn't trigger would make ephy no longer crash. After a look around the ephy code, I indeed see how we set a vanished callback on ephy-web-extension-proxy.c, which unrefs the extension proxy, which triggers a weak ref on ephy-web-view.c, which nullifies priv->web_extension. This should work fine as long as priv->web_extension is valid memory, and the only case it should not be is if the EphyWebView is already gone, I indeed see no weak ref removal on EphyWebView destruction paths, so that seems to be it. I'm attaching a patch that detaches the EphyWebView from its extension proxy on ::dispose()
Created attachment 306624 [details] [review] ephy-web-view: Clean up after outliving web extension proxies If the EphyWebView is destroyed before the EphyWebExtensionProxy it's attached to does, we'll leave a dangling weak pointer, which will nullify random memory at the time the web extension proxy is actually destroyed. So, prepare for undoing the effects of page_created_cb() in case we ::dispose() when we still have a web extension attached.
Moving back to ephy. NB: I found other suspicious weak refs that might fall into the same mistake, didn't chase these too hard though.
Created attachment 306643 [details] [review] ephy-web-view: Clean up after outliving web extension proxies If the EphyWebView is destroyed before the EphyWebExtensionProxy it's attached to does, we'll leave a dangling weak pointer, which will nullify random memory at the time the web extension proxy is actually destroyed. So, prepare for undoing the effects of page_created_cb() in case we ::dispose() when we still have a web extension attached.
Review of attachment 306643 [details] [review]: How can this cause the CSS value to be null? I'm confused. ::: embed/ephy-web-view.c @@ +751,3 @@ + if (priv->web_extension) + { + g_object_remove_weak_pointer (G_OBJECT (priv->web_extension), (gpointer *)&priv->web_extension); Good catch! @@ +752,3 @@ + { + g_object_remove_weak_pointer (G_OBJECT (priv->web_extension), (gpointer *)&priv->web_extension); + g_signal_handlers_disconnect_by_data (ephy_embed_shell_get_default (), object); Do we really need this? We are using g_signal_connect_object for all EphyEmbedShell signals we connect to.
(In reply to Carlos Garcia Campos from comment #38) > Review of attachment 306643 [details] [review] [review]: > > How can this cause the CSS value to be null? I'm confused. As far as I gathered, there's plenty of css stuff going on around opening/closing tabs (eg. button sensitivities triggering animations...), so the memory from the EphyWebView is reused for mid-animation style calculations. somewhat later the extension proxy is freed, and NULLifies that memory. I think that's why it was so hard to spot on valgrind, because the nullified pointer was actually valid memory, just not the one we expected. The warning in my comment above is from one of those times when that memory was still invalid though. > > ::: embed/ephy-web-view.c > @@ +751,3 @@ > + if (priv->web_extension) > + { > + g_object_remove_weak_pointer (G_OBJECT (priv->web_extension), > (gpointer *)&priv->web_extension); > > Good catch! Thanks :), took a while of head scratching. > > @@ +752,3 @@ > + { > + g_object_remove_weak_pointer (G_OBJECT (priv->web_extension), > (gpointer *)&priv->web_extension); > + g_signal_handlers_disconnect_by_data (ephy_embed_shell_get_default > (), object); > > Do we really need this? We are using g_signal_connect_object for all > EphyEmbedShell signals we connect to. Right, I was a bit overeager there :). Let me update the patch again.
Created attachment 306647 [details] [review] ephy-web-view: Clean up after outliving web extension proxies If the EphyWebView is destroyed before the EphyWebExtensionProxy it's attached to does, we'll leave a dangling weak pointer, which will nullify random memory at the time the web extension proxy is actually destroyed. So, prepare for undoing the effects of page_created_cb() in case we ::dispose() when we still have a web extension attached.
Review of attachment 306647 [details] [review]: Please, push it to both branches. I owe you a few beers :-)
Yay! now on master/gnome-3-16 Attachment 306647 [details] pushed as 3596b8e - ephy-web-view: Clean up after outliving web extension proxies
A lot of people owe you a lot of beers. I saw those warnings from valgrind but ignored them as probably unrelated. Lesson learned.
*** Bug 746383 has been marked as a duplicate of this bug. ***