Bug 747422 – Crash in _gtk_css_value_ref due to NULL value

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 747422 - Crash in _gtk_css_value_ref due to NULL value


Summary:	Crash in _gtk_css_value_ref due to NULL value


Status:	RESOLVED FIXED

Product:	epiphany
Classification:	Core
Component:	General
Version:	3.16.x (obsolete)
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Epiphany Maintainers
QA Contact:	Epiphany Maintainers

URL:
Whiteboard:

Duplicates:	746383 (view as bug list)
Depends on:
Blocks:

Reported:	2015-04-06 17:07 UTC by Carlos Garcia Campos
Modified:	2016-01-10 16:35 UTC

See Also:	https://bugzilla.redhat.com/show_bug.cgi?id=1151771 https://bugzilla.redhat.com/show_bug.cgi?id=1176688 https://bugzilla.redhat.com/show_bug.cgi?id=1176716 https://bugzilla.redhat.com/show_bug.cgi?id=1179725 https://bugzilla.redhat.com/show_bug.cgi?id=1206735 https://bugzilla.redhat.com/show_bug.cgi?id=1206763 https://bugzilla.redhat.com/show_bug.cgi?id=1207097 https://bugzilla.redhat.com/show_bug.cgi?id=1207907 https://bugzilla.redhat.com/show_bug.cgi?id=1208932 https://bugzilla.redhat.com/show_bug.cgi?id=1156124 https://bugzilla.redhat.com/show_bug.cgi?id=1214479 https://bugzilla.redhat.com/show_bug.cgi?id=1229698 https://bugzilla.redhat.com/show_bug.cgi?id=1229668 https://bugzilla.redhat.com/show_bug.cgi?id=1227912 https://bugzilla.redhat.com/show_bug.cgi?id=1229108 https://bugzilla.redhat.com/show_bug.cgi?id=1225230 https://bugzilla.redhat.com/show_bug.cgi?id=1211572 https://bugzilla.redhat.com/show_bug.cgi?id=1229106 https://bugzilla.redhat.com/show_bug.cgi?id=1199848 https://bugzilla.redhat.com/show_bug.cgi?id=1199022 https://bugs.centos.org/view.php?id=8629 https://bugzilla.redhat.com/show_bug.cgi?id=1234634
GNOME target:	---
GNOME version:	---

Attachments
[PATCH] css: Fix multiple crashes in parsing (11.83 KB, patch) 2015-06-09 11:12 UTC, Alexey Galakhov	none	Details \| Review
ephy-web-view: Clean up after outliving web extension proxies (1.38 KB, patch) 2015-07-02 14:16 UTC, Carlos Garnacho	none	Details \| Review
ephy-web-view: Clean up after outliving web extension proxies (1.42 KB, patch) 2015-07-02 15:32 UTC, Carlos Garnacho	accepted-commit_now	Details \| Review
ephy-web-view: Clean up after outliving web extension proxies (1.33 KB, patch) 2015-07-02 15:51 UTC, Carlos Garnacho	accepted-commit_now	Details \| Review

Description Carlos Garcia Campos 2015-04-06 17:07:30 UTC

I got this with epiphany using 3-16 branch

Program received signal SIGSEGV, Segmentation fault.
_gtk_css_value_ref (value=0x0) at gtkcssvalue.c:51
51	  g_atomic_int_add (&value->ref_count, 1);
(gdb) bt

+ Trace 234943

#0 _gtk_css_value_ref
at gtkcssvalue.c line 51
#1 gtk_css_static_style_set_value
at gtkcssstaticstyle.c line 165
#2 gtk_css_static_style_new_update
at gtkcssstaticstyle.c line 279
#3 update_properties
at gtkstylecontext.c line 824
#4 _gtk_style_context_validate
at gtkstylecontext.c line 3050
#5 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#6 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#7 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#8 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#9 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#10 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#11 gtk_container_idle_sizer
at gtkcontainer.c line 1856
#12 _g_closure_invoke_va
at gclosure.c line 831
#13 g_signal_emit_valist
at gsignal.c line 3214
#14 g_signal_emit_by_name
at gsignal.c line 3401
#15 gdk_frame_clock_paint_idle
at gdkframeclockidle.c line 408
#16 gdk_threads_dispatch
at gdk.c line 717
#17 g_timeout_dispatch
at gmain.c line 4545
#18 g_main_dispatch
at gmain.c line 3122
#19 g_main_context_dispatch
at gmain.c line 3737
#20 g_main_context_iterate
at gmain.c line 3808
#21 g_main_context_iteration
at gmain.c line 3869
#22 g_application_run
at gapplication.c line 2308
#23 main
at ephy-main.c line 488

Comment 1 Carlos Garcia Campos 2015-04-13 17:00:11 UTC

Sometimes it doesn't crash in _gtk_css_value_ref() but in _gtk_css_value_equal(). This is the backtrace I got this time:

Program received signal SIGSEGV, Segmentation fault.
_gtk_css_value_equal (value1=0x7fb499375600 <border_style_values>, value2=value2@entry=0x0) at gtkcssvalue.c:118
118	  if (value1->class != value2->class)
(gdb) bt

+ Trace 234967

#0 _gtk_css_value_equal
at gtkcssvalue.c line 118
#1 gtk_css_style_get_difference
at gtkcssstyle.c line 89
#2 _gtk_style_context_validate
at gtkstylecontext.c line 3076
#3 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#4 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#5 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#6 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#7 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#8 _gtk_style_context_validate
at gtkstylecontext.c line 3088
#9 gtk_container_idle_sizer
at gtkcontainer.c line 1856
#10 _g_closure_invoke_va
at gclosure.c line 831
#11 g_signal_emit_valist
at gsignal.c line 3214
#12 g_signal_emit_by_name
at gsignal.c line 3401
#13 gdk_frame_clock_paint_idle
at gdkframeclockidle.c line 408
#14 gdk_threads_dispatch
at gdk.c line 717
#15 g_timeout_dispatch
at gmain.c line 4545
#16 g_main_dispatch
at gmain.c line 3122
#17 g_main_context_dispatch
at gmain.c line 3737
#18 g_main_context_iterate
at gmain.c line 3808
#19 g_main_context_iteration
at gmain.c line 3869
#20 g_application_run
at gapplication.c line 2308
#21 main
at ephy-main.c line 488

Comment 2 Michael Catanzaro 2015-04-13 17:47:04 UTC

This bug is a pain. I've been trying to catch it in valgrind for quite some time, but it never seems to happen when running in valgrind. At first I realized I was using the wrong environment variables in combination with valgrind (G_DEBUG=gc-friendly makes memory errors disappear), but I've been running it without any extra environment variables at all for several hours over multiple days, and have never hit the crash. But when running without valgrind, it's hard to go 10-15 minutes without a crash. So I guess it's a race that doesn't happen when the program gets slowed down. :(

Without catching it in valgrind, I think the only thing we can do is squint really hard at GtkStyleContext, GtkCssValue, and GtkCssStyle.

Note: the crash was introduced sometime between GTK+ 3.12 and 3.14.

Comment 3 Carlos Garcia Campos 2015-04-13 18:00:48 UTC

I've checked that the value being null is border-top-style, just in case it helps.

Comment 4 Carlos Garcia Campos 2015-04-13 18:01:37 UTC

(In reply to Michael Catanzaro from comment #2)
 
> Note: the crash was introduced sometime between GTK+ 3.12 and 3.14.

Never happened to me with 3.14, it started to happen here recently with 3.15.x

Comment 5 Michael Catanzaro 2015-04-13 18:42:17 UTC

(In reply to Carlos Garcia Campos from comment #4)
> Never happened to me with 3.14, it started to happen here recently with
> 3.15.x

You got lucky then :) Here is one problem report that covers F21: https://retrace.fedoraproject.org/faf/problems/759405/

Comment 6 Michael Catanzaro 2015-04-24 22:51:50 UTC

A couple observations:

* Company deleted the code where the first crash occurs in commit e95985da26c8fed22b9e05aa3ff9b4a5a53e3ad4. Might as well backport that, although it won't fix the root cause.
* Both of these crashes are guarded by gtk_internal_return_val_if_fail. Without fixing the bug, we could work around the crash simply by changing them to g_return_val_if_fail.

Comment 7 Michael Catanzaro 2015-04-25 04:01:46 UTC

Eh, that workaround just makes it crash later on. Had to try, though.

Comment 8 Alexey Galakhov 2015-06-09 11:12:43 UTC

Created attachment 304843 [details] [review]
[PATCH] css: Fix multiple crashes in parsing

This is most likely the same bug I've observed. It is caused by the absence of NULL pointer checks in many places in CSS code. A patch is attached.

Comment 9 Alexey Galakhov 2015-06-09 12:44:56 UTC

One more crash related to the same problem:

+ Trace 235138

#0 _gtk_css_value_equal
from /usr/lib/libgtk-3.so.0
#1 gtk_css_style_get_difference
from /usr/lib/libgtk-3.so.0
#2 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#3 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#4 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#5 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#6 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#7 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#8 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#9 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#10 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#11 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#12 _gtk_style_context_validate
from /usr/lib/libgtk-3.so.0
#13 gtk_container_idle_sizer
from /usr/lib/libgtk-3.so.0

Comment 10 Jakub Filak 2015-06-09 13:19:13 UTC

Several more crashes related to the same problem:

https://retrace.fedoraproject.org/faf/problems/968254/

Comment 11 Alexey Galakhov 2015-06-09 13:28:49 UTC

The crash seems to be caused by the possibility of the GtkCssValue* being NULL if the corresponding value does not exist in the CSS. The most obvious crash was the one in the _gtk_css_border family of functions and is fixed by the above patch. Thare are less obvious ones, probably due to NULL values returned from one of the _gtk_css_***_get functions. A possible solution is adding proper NULL checking in every single function using GtkCssValue.

Comment 12 Michael Catanzaro 2015-06-09 13:39:50 UTC

Thanks very much for working on this :)

I will note that g_return_val_if_fail is used to document preconditions. It can only "fix" the crashes by turning them into critical warnings. That's 2345634x better than the crash and it would make our web browser usable again, but it's not a real fix: either the value should never be NULL (in which case g_return_val_if_fail is appropriate), or it's OK for it to be NULL (in which case you should return manually instead of using g_return_val_if_fail, but I suspect it's generally not OK).

Note, this family of crashes shows up in many separate FAF reports; it's not just 1600 reports, more like 5000 at minimum:

https://retrace.fedoraproject.org/faf/problems/960737/
https://retrace.fedoraproject.org/faf/problems/871220/
https://retrace.fedoraproject.org/faf/problems/887731/
https://retrace.fedoraproject.org/faf/problems/866571/

Comment 13 Benjamin Otte (Company) 2015-06-09 14:11:45 UTC

That value should never be NULL.

I suspect it's caused by some invalid memory writes in close proximity or a free-before-use or some other messup.

But I've never been able to trigger it. And there doesn't seem to be a way to reproduce it. So there's unfortunately been nothing I could do.

Comment 14 Alexey Galakhov 2015-06-09 14:22:52 UTC

What happens if the value is not defined in the CSS of the actual GTK+ theme?

I suspect this happens only with certain CSS configurations.

Comment 15 Alexey Galakhov 2015-06-09 14:41:26 UTC

A concrete example:

gtk_render_frame() calls gtk_css_style_render_border(). gtk_css_style_render_border() makes many calls to gtk_style_get_value(), and one of the crashes observed by me is caused by a NULL returned from this function.

However, this NULL value seems to be legitimate:

gtkcssstaticstyle.c
 43 static GtkCssValue *
 44 gtk_css_static_style_get_value (GtkCssStyle *style,
 45                                 guint        id)
 46 {
 47   GtkCssStaticStyle *sstyle = GTK_CSS_STATIC_STYLE (style);
 48 
 49   if (sstyle->values == NULL ||
 50       id >= sstyle->values->len)
 51     return NULL;
 52 
 53   return g_ptr_array_index (sstyle->values, id);
 54 }

What happens then if a widget does not have border?

Comment 16 Benjamin Otte (Company) 2015-06-09 15:32:00 UTC

That's actually code that survived through some refactorings from back when GtkCssLookup didn't exist yet. It should probably be changed to an assert.

The creation of a GtkCssStaticStyle ensures that the values array holds all values (and in the case where it's unspecified in the CSS uses the default value for that property).

Comment 17 Alexey Galakhov 2015-06-11 15:43:07 UTC

I'm not quite sure, but it looks like that sstyle->values contains NULL at the position 20. If I understand the coredump correctly.

Comment 18 Alexey Galakhov 2015-06-12 08:14:21 UTC

I found a possible cause. The function gtk_css_static_style_new_update() leaves holes in the style->values array.

The function gtk_css_static_style_set_value() assumes that all values are set but does not check that. If id is too large, the value is resized, but new elements may be left unitialized. If, for some reason, ids are being skipped, the array will contain NULLs.

The function gtk_css_static_style_new_update() creates a new style. This style is then initialized by going through ids one by one (hm, reallocating memory?), and the elements not in bitmask ARE LEFT UNINITIALIZED. An attempt to access these elements causes segmentation fault somewhere else.

Comment 19 Benjamin Otte (Company) 2015-06-12 12:59:17 UTC

Aren't ids not in bitmask filled in the loop above in https://git.gnome.org/browse/gtk+/tree/gtk/gtkcssstaticstyle.c?h=3.16.2#n274 ?

Comment 20 Alexey Galakhov 2015-06-12 14:05:21 UTC

This loop is exactly where the problem is. This loop does not match the one in _gtk_css_lookup_resolve(). I don't see what the difference is, but my debugger shows exactly that - some values are never assigned. I can make additional tests of course, but I'm pretty sure it is so.

I think it is better to make _gtk_css_lookup_resolve() first, since it already has the correct bitmask. Then just fill the NULLs by copying corresponding values from "style" to "result". Or it were more safe to do all assignments in a single place.

Comment 21 Benjamin Otte (Company) 2015-06-13 00:20:51 UTC

The bitmask is used to track which values are still missing. Code works in new_update() like this:

1) Fill in all values not in the bitmask by copying.

2) Pass the bitmask with bits for the remaining values set to the CSS lookup.

3) The CSS lookup looks in the css for all missing values that are defined there. Whenever we find a value, we set that value in lookup->values and unset the corresponding bit in lookup->missing.

4) in lookup_resolve() we iterate over all values and whenever we have a value, we set that one and whenever a value is still missing, we use the default value.


Now every property will have been set:

- Either it's been set in the loop in (1) by copying the value.

- Or it's been found in the CSS and then copied by lookup_resolve() in (3) and (4).

- Or if neither of the other 2 ways were used, the default value will have been used in (4).

So now the value cannot be NULL.

At least that's the theory.

Comment 22 Michael Catanzaro 2015-06-16 14:50:41 UTC

Something that will maybe help with testing: a few downstream bug reports say the crash occurred when changing GTK+ themes. So I guess that means the user is messing around in Tweak Tool, then some other application dies due to the theme change.

"While changing GTK+ themes with a small interval between each one" reports a user on a corebird crash.

"Was viewing different settings in tweak tool when I changed GTK+ CurvLooks when it crashed" reports a user on a gnome-shell crash.

Another report:

"1. Launch GCalculator
2. Change style in xfce (Apperance - Style)
3. Click on GCalculator's window"

Comment 23 Michael Catanzaro 2015-06-17 18:31:11 UTC

Another variant (or possibly a different issue?) in rbz#1225230:

Truncated backtrace:
Thread no. 1 (10 frames)
 #1 g_hash_table_lookup_node at ghash.c:368
 #2 g_hash_table_lookup at ghash.c:1092
 #3 style_values_lookup at gtkstylecontext.c:800
 #4 _gtk_style_context_validate at gtkstylecontext.c:3060
 #12 gtk_container_idle_sizer at gtkcontainer.c:1738
 #13 _g_closure_invoke_va at gclosure.c:831
 #15 g_signal_emit_by_name at gsignal.c:3405
 #16 gdk_frame_clock_paint_idle at gdkframeclockidle.c:408
 #22 g_main_context_iteration at gmain.c:3842
 #23 g_application_run at gapplication.c:2282

Comment 24 Michael Catanzaro 2015-06-17 18:32:15 UTC

Whoops, I meant 1211572: https://bugzilla.redhat.com/show_bug.cgi?id=1211572

Comment 25 Diogo Campos 2015-06-18 16:54:37 UTC

Please, see this comment: https://bugzilla.redhat.com/show_bug.cgi?id=1156124#c41

Comment 26 André Klapper 2015-06-18 17:45:58 UTC

Please summarize relevant content here - more likely people read it in their notification mail than clicking a link and then reading it. Thanks! :)

Comment 27 Diogo Campos 2015-06-18 17:57:07 UTC

Sorry.

That link points to a micro-report generated with a specific GTK3 Fedora update.
But Michael Catanzaro already said that it doesn't showed any new info.

Thanks.

Comment 28 Michael Catanzaro 2015-06-22 22:15:40 UTC

OK, in https://bugzilla.redhat.com/show_bug.cgi?id=1234634 someone hit one of the asserts we added in Fedora to check for holes in style->values. So that is independent confirmation of Alexey's theory. Unfortunately, only one single reporter has hit this assertion so far, but there are dozens of new reports of the original issue with that patch applied, so it's safe to conclude that while this is a real problem, it's not the primary issue.

Comment 29 Alexey Galakhov 2015-06-22 22:30:13 UTC

I added a lot of extra debug printout and use it every day. Unfortunately I haven't got a crash yet. I observed holes in the array today (no crash, just warning), but my assert in gtk_css_static_style_new_update was NOT triggered. On my machine the hole occurs always at values[20] regardless of the gtk+ version and compile flags, so this does not look like memory corruption.

Comment 30 Michael Catanzaro 2015-06-28 22:39:51 UTC

(In reply to Michael Catanzaro from comment #28)
> OK, in https://bugzilla.redhat.com/show_bug.cgi?id=1234634 someone hit one
> of the asserts we added in Fedora to check for holes in style->values. So
> that is independent confirmation of Alexey's theory. Unfortunately, only one
> single reporter has hit this assertion so far, but there are dozens of new
> reports of the original issue with that patch applied, so it's safe to
> conclude that while this is a real problem, it's not the primary issue.

So this was wrong: it turns out that FAF was merging new crashes in check_correctness() with the original crash reports, causing me to incorrectly think that we were not getting crashes in check_correctness(). It's impossible to say how often the crashes occur in check_correctness(), since FAF is treating them as the same crash (due to similar backtraces).

Comment 31 Diogo Campos 2015-06-28 22:50:38 UTC

So, forgive me the ignorance, that means if I put the "original contents" of a crash here I will help?

Comment 32 Jakub Filak 2015-06-29 06:35:10 UTC

(In reply to Michael Catanzaro from comment #30)
> It's impossible to say how often the crashes occur in check_correctness(),
> since FAF is treating them as the same crash (due to similar backtraces).

Michael, it is possible to say how often the crashes occur, but it is a bit tricky. You can search for the function on FAF's Problem page:

https://retrace.fedoraproject.org/faf/problems/?component_names=&associate=__None&daterange=&bug_filter=None&function_names=check_correctness&binary_names=&source_file_names=&since_version=&since_release=&to_version=&to_release=#

and you will get the problems (groups of similar backtraces) that contain at least one report (backtrace) having the searched function on stack.

My search results says that we have 72 occurrences in 6 reports (backtraces) merged into 2 problems.

https://retrace.fedoraproject.org/faf/problems/bthash/?bth=5b73ca480f7c5fb02869723ac6047aedbb6a8851&bth=73e0811a885f5bc8e8121add7aace40432af280f&bth=af029730da2b951e34004925360161a537d33c6f&bth=ca5f26391abfb13f65d598c33769651f5a3e2885&bth=f2295fb0dd25b6c7254f22b0aa9693a76617ebcd

https://retrace.fedoraproject.org/faf/problems/bthash/?bth=380ca8eadc1883ae89eea2127893301ff4a9af88

Comment 33 Michael Catanzaro 2015-07-02 01:38:41 UTC

Thanks Jakub.

Anyway, we found a relatively reliable reproducer. Open Epiphany in gdb, then click the New Tab button. Close the new tab, then click the New Tab button again. Repeat until Epiphany crashes. I reproduced this four times in a row with under 10 attempts; it usually crashes in about three attempts, but it is random. It will happen either when opening or closing the tab.

I can't reproduce it reliably like this except when using gdb (nor can I get it under valgrind ever, see above).

Comment 34 Carlos Garnacho 2015-07-02 14:16:16 UTC

I hit this bug rather often, today I got motivated to chase it a while. I found out that ephy valgrind spews the following warning from time to time:

==15839== Invalid write of size 8
==15839==    at 0x6523F65: g_nullify_pointer (gutils.c:2028)
==15839==    by 0x626890E: weak_refs_notify (gobject.c:2629)
==15839==    by 0x6269A94: g_object_unref (gobject.c:3137)
==15839==    by 0x5FABCA9: do_call (gdbusnamewatching.c:216)
==15839==    by 0x5FAC05A: call_vanished_handler (gdbusnamewatching.c:320)
==15839==    by 0x5FAC05A: on_name_owner_changed (gdbusnamewatching.c:307)
==15839==    by 0x5F9CF23: emit_signal_instance_in_idle_cb (gdbusconnection.c:3657)
==15839==    by 0x64F0C89: g_main_dispatch (gmain.c:3122)
==15839==    by 0x64F0C89: g_main_context_dispatch (gmain.c:3737)
==15839==    by 0x64F1007: g_main_context_iterate.isra.29 (gmain.c:3808)
==15839==    by 0x64F10AB: g_main_context_iteration (gmain.c:3869)
==15839==    by 0x5F7A5FB: g_application_run (gapplication.c:2311)
==15839==    by 0x42EDD5: main (in /usr/bin/epiphany)
==15839==  Address 0xf8cb030 is 176 bytes inside a block of size 1,128 free'd
==15839==    at 0x4A07D2A: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==15839==    by 0x6287CC6: g_type_free_instance (gtype.c:1941)
==15839==    by 0x5F558E1: g_task_finalize (gtask.c:626)
==15839==    by 0x6269B09: g_object_unref (gobject.c:3174)
==15839==    by 0x5F55528: g_task_return_now (gtask.c:1104)
==15839==    by 0x5F55D3D: g_task_return (gtask.c:1162)
==15839==    by 0x469369: ??? (in /usr/bin/epiphany)
==15839==    by 0x5F55528: g_task_return_now (gtask.c:1104)
==15839==    by 0x5F55D3D: g_task_return (gtask.c:1162)
==15839==    by 0x5FAD822: reply_cb (gdbusproxy.c:2579)
==15839==    by 0x5F55528: g_task_return_now (gtask.c:1104)
==15839==    by 0x5F55D3D: g_task_return (gtask.c:1162)
==15839==    by 0x5FA2D01: g_dbus_connection_call_done (gdbusconnection.c:5407)
==15839==    by 0x5F55528: g_task_return_now (gtask.c:1104)
==15839==    by 0x5F55568: complete_in_idle_cb (gtask.c:1118)
==15839==    by 0x64F0C89: g_main_dispatch (gmain.c:3122)
==15839==    by 0x64F0C89: g_main_context_dispatch (gmain.c:3737)
==15839==    by 0x64F1007: g_main_context_iterate.isra.29 (gmain.c:3808)
==15839==    by 0x64F10AB: g_main_context_iteration (gmain.c:3869)
==15839==    by 0x5F7A5FB: g_application_run (gapplication.c:2311)
==15839==    by 0x42EDD5: main (in /usr/bin/epiphany)

The address it pointed to seems quite unrelated to this bug, but I found that hacking glib so the vanished callback doesn't trigger would make ephy no longer crash.

After a look around the ephy code, I indeed see how we set a vanished callback on ephy-web-extension-proxy.c, which unrefs the extension proxy, which triggers a weak ref on ephy-web-view.c, which nullifies priv->web_extension. 

This should work fine as long as priv->web_extension is valid memory, and the only case it should not be is if the EphyWebView is already gone, I indeed see no weak ref removal on EphyWebView destruction paths, so that seems to be it.

I'm attaching a patch that detaches the EphyWebView from its extension proxy on ::dispose()

Comment 35 Carlos Garnacho 2015-07-02 14:16:43 UTC

Created attachment 306624 [details] [review]
ephy-web-view: Clean up after outliving web extension proxies

If the EphyWebView is destroyed before the EphyWebExtensionProxy
it's attached to does, we'll leave a dangling weak pointer, which
will nullify random memory at the time the web extension proxy
is actually destroyed.

So, prepare for undoing the effects of page_created_cb() in case
we ::dispose() when we still have a web extension attached.

Comment 36 Carlos Garnacho 2015-07-02 14:19:57 UTC

Moving back to ephy.

NB: I found other suspicious weak refs that might fall into the same mistake, didn't chase these too hard though.

Comment 37 Carlos Garnacho 2015-07-02 15:32:02 UTC

Created attachment 306643 [details] [review]
ephy-web-view: Clean up after outliving web extension proxies

If the EphyWebView is destroyed before the EphyWebExtensionProxy
it's attached to does, we'll leave a dangling weak pointer, which
will nullify random memory at the time the web extension proxy
is actually destroyed.

So, prepare for undoing the effects of page_created_cb() in case
we ::dispose() when we still have a web extension attached.

Comment 38 Carlos Garcia Campos 2015-07-02 15:39:54 UTC

Review of attachment 306643 [details] [review]:

How can this cause the CSS value to be null? I'm confused.

::: embed/ephy-web-view.c
@@ +751,3 @@
+  if (priv->web_extension)
+    {
+      g_object_remove_weak_pointer (G_OBJECT (priv->web_extension), (gpointer *)&priv->web_extension);

Good catch!

@@ +752,3 @@
+    {
+      g_object_remove_weak_pointer (G_OBJECT (priv->web_extension), (gpointer *)&priv->web_extension);
+      g_signal_handlers_disconnect_by_data (ephy_embed_shell_get_default (), object);

Do we really need this? We are using g_signal_connect_object for all EphyEmbedShell signals we connect to.

Comment 39 Carlos Garnacho 2015-07-02 15:50:22 UTC

(In reply to Carlos Garcia Campos from comment #38)
> Review of attachment 306643 [details] [review] [review]:
> 
> How can this cause the CSS value to be null? I'm confused.

As far as I gathered, there's plenty of css stuff going on around opening/closing tabs (eg. button sensitivities triggering animations...), so the memory from the EphyWebView is reused for mid-animation style calculations. somewhat later the extension proxy is freed, and NULLifies that memory.

I think that's why it was so hard to spot on valgrind, because the nullified pointer was actually valid memory, just not the one we expected. The warning in my comment above is from one of those times when that memory was still invalid though.

> 
> ::: embed/ephy-web-view.c
> @@ +751,3 @@
> +  if (priv->web_extension)
> +    {
> +      g_object_remove_weak_pointer (G_OBJECT (priv->web_extension),
> (gpointer *)&priv->web_extension);
> 
> Good catch!

Thanks :), took a while of head scratching.

> 
> @@ +752,3 @@
> +    {
> +      g_object_remove_weak_pointer (G_OBJECT (priv->web_extension),
> (gpointer *)&priv->web_extension);
> +      g_signal_handlers_disconnect_by_data (ephy_embed_shell_get_default
> (), object);
> 
> Do we really need this? We are using g_signal_connect_object for all
> EphyEmbedShell signals we connect to.

Right, I was a bit overeager there :). Let me update the patch again.

Comment 40 Carlos Garnacho 2015-07-02 15:51:03 UTC

Created attachment 306647 [details] [review]
ephy-web-view: Clean up after outliving web extension proxies

If the EphyWebView is destroyed before the EphyWebExtensionProxy
it's attached to does, we'll leave a dangling weak pointer, which
will nullify random memory at the time the web extension proxy
is actually destroyed.

So, prepare for undoing the effects of page_created_cb() in case
we ::dispose() when we still have a web extension attached.

Comment 41 Carlos Garcia Campos 2015-07-02 15:56:46 UTC

Review of attachment 306647 [details] [review]:

Please, push it to both branches. I owe you a few beers :-)

Comment 42 Carlos Garnacho 2015-07-02 16:07:43 UTC

Yay! now on master/gnome-3-16

Attachment 306647 [details] pushed as 3596b8e - ephy-web-view: Clean up after outliving web extension proxies

Comment 43 Michael Catanzaro 2015-07-02 16:12:59 UTC

A lot of people owe you a lot of beers.

I saw those warnings from valgrind but ignored them as probably unrelated. Lesson learned.

Comment 44 Michael Catanzaro 2016-01-10 16:35:07 UTC

*** Bug 746383 has been marked as a duplicate of this bug. ***