After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 797136 - NetworkManager 1.12.2 dumps core due to heap corruption
NetworkManager 1.12.2 dumps core due to heap corruption
Status: RESOLVED FIXED
Product: NetworkManager
Classification: Platform
Component: general
unspecified
Other Linux
: Normal critical
: ---
Assigned To: NetworkManager maintainer(s)
NetworkManager maintainer(s)
Depends on:
Blocks:
 
 
Reported: 2018-09-13 08:46 UTC by Martin Wilck
Modified: 2018-09-26 20:50 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Martin Wilck 2018-09-13 08:46:52 UTC
I'm forwarding this from

https://bugzilla.opensuse.org/show_bug.cgi?id=1107197

Several users have seen sporadic core dumps since NetworkManager was updated to 1.12.2 in openSUSE Factory.

Crashes occur typically in malloc(), often in arguably harmless code paths like curl_easy_init():

  • #0 __GI_raise
    at ../sysdeps/unix/sysv/linux/raise.c line 50
  • #1 __GI_abort
    at abort.c line 79
  • #2 __libc_message
    at ../sysdeps/posix/libc_fatal.c line 181
  • #3 malloc_printerr
    at malloc.c line 5350
  • #4 _int_malloc
    at malloc.c line 3926
  • #5 __libc_calloc
    at malloc.c line 3436
  • #6 ??
    from /usr/lib64/libcurl.so.4
  • #7 curl_easy_init
    from /usr/lib64/libcurl.so.4
  • #8 nm_connectivity_check_start
    at src/nm-connectivity.c line 510
  • #9 concheck_start
    at src/devices/nm-device.c line 2772
  • #10 concheck_periodic_schedule_set
    at src/devices/nm-device.c line 2439
  • #11 concheck_periodic_timeout_cb
    at src/devices/nm-device.c line 2265
  • #12 g_timeout_dispatch
    at gmain.c line 4649
  • #13 g_main_dispatch
    at gmain.c line 3176
  • #14 g_main_context_dispatch
    at gmain.c line 3829
  • #15 g_main_context_iterate
    at gmain.c line 3902
  • #16 g_main_loop_run
    at gmain.c line 4098
  • #17 main
    at src/main.c line 438

This indicates some sort of heap corruption. As the error was frequently reported in the context of connectivity checking, a possible suspect would be
commit d8a31794c8b9 "connectivity: rework async connectivity check requests". But that's just a guess so far.

One user has run NM under valgrind (shortened here, full log at https://bugzilla.opensuse.org/show_bug.cgi?id=1107197#c13)

Invalid write of size 8
   at 0x52267BE: curl_multi_remove_handle (in /usr/lib64/libcurl.so.4.5.0)
   by 0x283072: cb_data_free (nm-connectivity.c:208)
   by 0x2832EA: _con_curl_check_connectivity (nm-connectivity.c:270)
 Address 0x8d10420 is 0 bytes inside a block of size 21,336 free'd
   at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
   by 0x5210900: ??? (in /usr/lib64/libcurl.so.4.5.0)
   by 0x521ECAC: curl_easy_cleanup (in /usr/lib64/libcurl.so.4.5.0)
   by 0x28307B: cb_data_free (nm-connectivity.c:209)
   by 0x283C71: nm_connectivity_check_cancel (nm-connectivity.c:574)

Invalid read of size 4
   by 0x52287C4: curl_multi_socket_action (in /usr/lib64/libcurl.so.4.5.0)
   by 0x283110: _con_curl_check_connectivity (nm-connectivity.c:249)
 Address 0x937f610 is 21,328 bytes inside a block of size 21,336 free'd
   at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
   by 0x5210900: ??? (in /usr/lib64/libcurl.so.4.5.0)
   by 0x521ECAC: curl_easy_cleanup (in /usr/lib64/libcurl.so.4.5.0)
   by 0x28307B: cb_data_free (nm-connectivity.c:209)
   by 0x283C71: nm_connectivity_check_cancel (nm-connectivity.c:574)

Invalid write of size 8
   at 0x5225E76: curl_multi_add_handle (in /usr/lib64/libcurl.so.4.5.0)
   by 0x283B4A: nm_connectivity_check_start (nm-connectivity.c:541)
   by 0x2190EC: concheck_start (nm-device.c:2772)
   by 0x219222: concheck_periodic_schedule_set (nm-device.c:2439)
   by 0x21937D: concheck_periodic_timeout_cb (nm-device.c:2265)
 Address 0x8d10420 is 0 bytes inside a block of size 21,336 free'd
   at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
   by 0x5210900: ??? (in /usr/lib64/libcurl.so.4.5.0)
   by 0x521ECAC: curl_easy_cleanup (in /usr/lib64/libcurl.so.4.5.0)
   by 0x28307B: cb_data_free (nm-connectivity.c:209)
   by 0x283C71: nm_connectivity_check_cancel (nm-connectivity.c:574)

Invalid write of size 8
   at 0x52267BE: curl_multi_remove_handle (in /usr/lib64/libcurl.so.4.5.0)
   by 0x283072: cb_data_free (nm-connectivity.c:208)
   by 0x2835C6: _timeout_cb (nm-connectivity.c:473)
 Address 0x8d10420 is 0 bytes inside a block of size 21,336 free'd
   at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
   by 0x5210900: ??? (in /usr/lib64/libcurl.so.4.5.0)
   by 0x521ECAC: curl_easy_cleanup (in /usr/lib64/libcurl.so.4.5.0)
   by 0x28307B: cb_data_free (nm-connectivity.c:209)
   by 0x283C71: nm_connectivity_check_cancel (nm-connectivity.c:574)
   by 0x214AE8: concheck_handle_complete (nm-device.c:2601)
   by 0x222157: concheck_cb (nm-device.c:2725)

Invalid write of size 8
   at 0x52267BE: curl_multi_remove_handle (in /usr/lib64/libcurl.so.4.5.0)
   by 0x283072: cb_data_free (nm-connectivity.c:208)
   by 0x283284: _con_curl_check_connectivity (nm-connectivity.c:275)
   by 0x28339C: _con_curl_socketevent_cb (nm-connectivity.c:359)
 Address 0xd1ab620 is 21,600 bytes inside a block of size 32,768 free'd
   at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
   by 0x18A932: _nm_auto_free_impl (nm-macros-internal.h:76)
   by 0x18A932: event_handler_recvmsgs (nm-linux-platform.c:6579)
   by 0x18E07C: event_handler_read_netlink (nm-linux-platform.c:6790)
   by 0x18F09C: delayed_action_handle_WAIT_FOR_NL_RESPONSE (nm-linux-platform.c:3594)
   by 0x18F09C: delayed_action_handle_one (nm-linux-platform.c:3669)
   by 0x18F09C: delayed_action_handle_all (nm-linux-platform.c:3687)
   by 0x191D78: do_add_addrroute (nm-linux-platform.c:4577)
   by 0x192169: ip6_address_add (nm-linux-platform.c:6273)
   by 0x1A5097: nm_platform_ip6_address_add (nm-platform.c:3181)
   by 0x1A91D1: nm_platform_ip6_address_sync (nm-platform.c:3798)
   by 0x18183B: nm_ip6_config_commit (nm-ip6-config.c:549)
   by 0x21FC77: nm_device_set_ip_config (nm-device.c:11397)
   by 0x221318: ip_config_merge_and_apply (nm-device.c:6785)
   by 0x2345D4: activate_stage5_ip6_config_commit (nm-device.c:9964)


All but the last trace shows libcurl4-internal callbacks accessing data freed in 
nm_connectivity_check_cancel(). The last one also occurs in a libcurl callback related to connectivity checking, but accesses a block freed elswhere.
Comment 1 Thomas Haller 2018-09-13 10:05:30 UTC
Some notes:

Yes, the issue seems to be in src/nm-connectivity.c.
Which could be (partially) verified by disabling connectivity check:

 $ busctl set-property org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager ConnectivityCheckEnabled "b" 0

and restarting NM (in case memory is already corrupted).

I wouldn't call this a workaround, but probably it can be used to avoid the crash until this is fixed.


Note that current upstream branches "master", "nm-1-12", and release "1.12.2" are very similar for src/nm-connectivity.c. It seems the issue would apply to them all.

Between 1.12.0 and 1.12.2 a notable fix for connectivity checking was https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=660e308dd28fb502e2f9e58044684c2452cf6f6e . Not sure that is related.
Comment 2 Thomas Haller 2018-09-13 11:40:49 UTC
The valgrind output is helpful. Thanks!!!


First, nm_connectivity_check_cancel() destroys the pending connectivity check, including calling curl_multi_remove_handle() [1].

Later, the file descriptor of libcurl has something to read, and we iterate over the curl handles that are affected [2]. Seems at that point, the already removed handle is returned again, leading to a crash.


[1] https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/nm-connectivity.c?id=4a1ff8ad8c543a76b2430a79fb8ffee52e8d3f56#n191
[2] https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/nm-connectivity.c?id=4a1ff8ad8c543a76b2430a79fb8ffee52e8d3f56#n255



What exact libcurl version is this? Does it have the patch

https://github.com/curl/curl/commit/963d4583eff7a22f74e319746bec52f7ddb3dddc

? If not, any chance of trying that?
Comment 3 Martin Wilck 2018-09-13 12:05:23 UTC
(In reply to Thomas Haller from comment #2)

> What exact libcurl version is this? Does it have the patch
> 
> https://github.com/curl/curl/commit/963d4583eff7a22f74e319746bec52f7ddb3dddc
> 

No. openSUSE factory is currently at 7.60.0.

>  ? If not, any chance of trying that?

Probably, yes.
Comment 4 Martin Wilck 2018-09-13 12:14:35 UTC
The commit you reference seems to be out of date. I've found
this:

https://github.com/curl/curl/commit/4c901638b4d7bab5e5a847f6970425d9c1f2ea2d

This is in 7.61_0. openSUSE has 7.61.1 as experimental package under 

https://download.opensuse.org/repositories/devel:/libraries:/c_c++/openSUSE_Factory/devel:libraries:c_c++.repo

I've asked the reporter to try that.
Comment 5 Martin Wilck 2018-09-13 12:23:46 UTC
(In reply to Martin Wilck from comment #3)
> (In reply to Thomas Haller from comment #2)
> 
> > What exact libcurl version is this? Does it have the patch
> > 
> > https://github.com/curl/curl/commit/963d4583eff7a22f74e319746bec52f7ddb3dddc
> > 
> 
> No. openSUSE factory is currently at 7.60.0.
> 

WRONG information. Factory has 7.61.1 already, and thus should include the above fix.
Comment 6 Thomas Haller 2018-09-13 16:15:30 UTC
I am unable to reproduce this crash, so far.
Comment 7 Martin Wilck 2018-09-13 20:58:00 UTC
Yeah, I'm told it's sporadic. Haven't seen it myself either.
Comment 8 Thomas Haller 2018-09-14 05:41:13 UTC
Can somebody confirm, that disabling connectivity check avoids the crash? (comment 1).

Very likely the issue is in connectivity checking, but it would be good to be sure (with memory corruption you never know).
Comment 9 Yifan J 2018-09-15 02:19:26 UTC
(In reply to Thomas Haller from comment #8)
> Can somebody confirm, that disabling connectivity check avoids the crash?
> (comment 1).
> 
> Very likely the issue is in connectivity checking, but it would be good to
> be sure (with memory corruption you never know).

I have updated the ConnectivityCheckEnabled property to false since yesterday, no coredump appeared after 16 hours. In the past days, it usually crashes 7-8 times every evening at home, especially when I connect to VPN using third-party tools.
Comment 10 Thomas Haller 2018-09-15 08:14:04 UTC
Can somebody provide a link to the SRPM of NetworkManager + libcurl which are affected? I got lost searching for it in the Openbuild web frontent (and the exact version numbers were not clear to me to begin with). Thanks.
Comment 11 Beniamino Galvani 2018-09-15 12:09:37 UTC
I think the problem is the following: here cb_data gets freed from the
easy_header callback function:

==24572==    at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==24572==    by 0x52DDAE5: Curl_close (url.c:392)
==24572==    by 0x52EC02C: curl_easy_cleanup (easy.c:825)
==24572==    by 0x5FDCD2: cb_data_free (nm-connectivity.c:215)
==24572==    by 0x5FF6DE: nm_connectivity_check_cancel (nm-connectivity.c:585)
==24572==    by 0x55F7F9: concheck_handle_complete (nm-device.c:2601)
==24572==    by 0x574C12: concheck_cb (nm-device.c:2725)
==24572==    by 0x5FD887: cb_data_invoke_callback (nm-connectivity.c:167)
==24572==    by 0x5FD959: easy_header_cb (nm-connectivity.c:435)
==24572==    by 0x52D73CB: chop_write (sendf.c:612)
==24572==    by 0x52D73CB: Curl_client_write (sendf.c:668)
==24572==    by 0x52D54ED: Curl_http_readwrite_headers (http.c:3904)
==24572==    by 0x52E9EA7: readwrite_data (transfer.c:548)
==24572==    by 0x52E9EA7: Curl_readwrite (transfer.c:1161)
==24572==    by 0x52F4193: multi_runsingle (multi.c:1915)
==24572==    by 0x52F5531: multi_socket (multi.c:2607)
==24572==    by 0x52F5804: curl_multi_socket_action (multi.c:2771)

Since we are in a callback, we can't call curl API functions and the
curl_multi_remove_handle() fails.  However, curl_easy_cleanup()
succeeds. This leaves a stale entry in the internal list of easy
handles.

Then, when trying to remove another easy handle, the stale entry is
accessed causing a segfault:

==24572== Invalid write of size 8
==24572==    at 0x52F5A6E: curl_multi_remove_handle (multi.c:791)
==24572==    by 0x5FDC6D: cb_data_free (nm-connectivity.c:210)
==24572==    by 0x5FDE74: _con_curl_check_connectivity (nm-connectivity.c:277)
==24572==    by 0x5FE3C3: _con_curl_socketevent_cb (nm-connectivity.c:368)
==24572==    by 0x4F05EA7: g_main_context_dispatch (in /usr/lib64/libglib-2.0.so.0.5600.2)
==24572==    by 0x4F06297: ??? (in /usr/lib64/libglib-2.0.so.0.5600.2)
==24572==    by 0x4F06591: g_main_loop_run (in /usr/lib64/libglib-2.0.so.0.5600.2)
==24572==    by 0x41DD83: main (main.c:438)

I'll prepare a fix for this.
Comment 12 Thomas Haller 2018-09-15 16:54:36 UTC
==24572==    by 0x5FF6DE: nm_connectivity_check_cancel (nm-connectivity.c:585)
==24572==    by 0x55F7F9: concheck_handle_complete (nm-device.c:2601)

A handle can obviously only be canceled zero or one time, and must not be used *after* the callback (cb_data_invoke_callback) or nm_connectivity_check_cancel() returns. Because at that point, the handle is (possibly) already freed.

However, we can tighten that requirement up: a handle can only be canceled *before* the callback is invoked (but you cannot cancel it from within the callback). With that API, the bug would be in nm-device, which tries to cancel the handle that is already about to to complete. With this second interpretation, the implementation in NMConnectivity becomes simpler, while possibly moving complexity to the caller (NMDevice).
Comment 13 Thomas Haller 2018-09-15 19:53:30 UTC
==24572==    by 0x52DDAE5: Curl_close (url.c:392)
==24572==    by 0x52EC02C: curl_easy_cleanup (easy.c:825)
==24572==    by 0x5FDCD2: cb_data_free (nm-connectivity.c:215)
==24572==    by 0x5FF6DE: nm_connectivity_check_cancel (nm-connectivity.c:585)
==24572==    by 0x55F7F9: concheck_handle_complete (nm-device.c:2601)
==24572==    by 0x574C12: concheck_cb (nm-device.c:2725)
==24572==    by 0x5FD887: cb_data_invoke_callback (nm-connectivity.c:167)
==24572==    by 0x5FD959: easy_header_cb (nm-connectivity.c:435)
==24572==    by 0x52D73CB: chop_write (sendf.c:612)

rethinking. Now that is odd.

cb_data_invoke_callback() clears cb_data->callback before invoking the callback,
and nm_connectivity_check_cancel() asserts that the callback is still set.

That goes in line with what I said before, how it's intended to be: you cannot cancel the check from within the callback. But I don't see how above trace can happen, unless the memory corruption already happened before that.
Comment 14 Thomas Haller 2018-09-15 20:41:45 UTC
Ah ok. It is canceling another handle :)

NMDevice should be allowed to do that...
Comment 16 Thomas Haller 2018-09-17 09:10:10 UTC
On review: https://github.com/NetworkManager/NetworkManager/pull/207
Comment 18 Martin Wilck 2018-09-26 20:50:24 UTC
Thanks a lot for fixing this!