GNOME Bugzilla – Bug 797136
NetworkManager 1.12.2 dumps core due to heap corruption
Last modified: 2018-09-26 20:50:24 UTC
I'm forwarding this from https://bugzilla.opensuse.org/show_bug.cgi?id=1107197 Several users have seen sporadic core dumps since NetworkManager was updated to 1.12.2 in openSUSE Factory. Crashes occur typically in malloc(), often in arguably harmless code paths like curl_easy_init():
+ Trace 238689
This indicates some sort of heap corruption. As the error was frequently reported in the context of connectivity checking, a possible suspect would be commit d8a31794c8b9 "connectivity: rework async connectivity check requests". But that's just a guess so far. One user has run NM under valgrind (shortened here, full log at https://bugzilla.opensuse.org/show_bug.cgi?id=1107197#c13) Invalid write of size 8 at 0x52267BE: curl_multi_remove_handle (in /usr/lib64/libcurl.so.4.5.0) by 0x283072: cb_data_free (nm-connectivity.c:208) by 0x2832EA: _con_curl_check_connectivity (nm-connectivity.c:270) Address 0x8d10420 is 0 bytes inside a block of size 21,336 free'd at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x5210900: ??? (in /usr/lib64/libcurl.so.4.5.0) by 0x521ECAC: curl_easy_cleanup (in /usr/lib64/libcurl.so.4.5.0) by 0x28307B: cb_data_free (nm-connectivity.c:209) by 0x283C71: nm_connectivity_check_cancel (nm-connectivity.c:574) Invalid read of size 4 by 0x52287C4: curl_multi_socket_action (in /usr/lib64/libcurl.so.4.5.0) by 0x283110: _con_curl_check_connectivity (nm-connectivity.c:249) Address 0x937f610 is 21,328 bytes inside a block of size 21,336 free'd at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x5210900: ??? (in /usr/lib64/libcurl.so.4.5.0) by 0x521ECAC: curl_easy_cleanup (in /usr/lib64/libcurl.so.4.5.0) by 0x28307B: cb_data_free (nm-connectivity.c:209) by 0x283C71: nm_connectivity_check_cancel (nm-connectivity.c:574) Invalid write of size 8 at 0x5225E76: curl_multi_add_handle (in /usr/lib64/libcurl.so.4.5.0) by 0x283B4A: nm_connectivity_check_start (nm-connectivity.c:541) by 0x2190EC: concheck_start (nm-device.c:2772) by 0x219222: concheck_periodic_schedule_set (nm-device.c:2439) by 0x21937D: concheck_periodic_timeout_cb (nm-device.c:2265) Address 0x8d10420 is 0 bytes inside a block of size 21,336 free'd at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x5210900: ??? (in /usr/lib64/libcurl.so.4.5.0) by 0x521ECAC: curl_easy_cleanup (in /usr/lib64/libcurl.so.4.5.0) by 0x28307B: cb_data_free (nm-connectivity.c:209) by 0x283C71: nm_connectivity_check_cancel (nm-connectivity.c:574) Invalid write of size 8 at 0x52267BE: curl_multi_remove_handle (in /usr/lib64/libcurl.so.4.5.0) by 0x283072: cb_data_free (nm-connectivity.c:208) by 0x2835C6: _timeout_cb (nm-connectivity.c:473) Address 0x8d10420 is 0 bytes inside a block of size 21,336 free'd at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x5210900: ??? (in /usr/lib64/libcurl.so.4.5.0) by 0x521ECAC: curl_easy_cleanup (in /usr/lib64/libcurl.so.4.5.0) by 0x28307B: cb_data_free (nm-connectivity.c:209) by 0x283C71: nm_connectivity_check_cancel (nm-connectivity.c:574) by 0x214AE8: concheck_handle_complete (nm-device.c:2601) by 0x222157: concheck_cb (nm-device.c:2725) Invalid write of size 8 at 0x52267BE: curl_multi_remove_handle (in /usr/lib64/libcurl.so.4.5.0) by 0x283072: cb_data_free (nm-connectivity.c:208) by 0x283284: _con_curl_check_connectivity (nm-connectivity.c:275) by 0x28339C: _con_curl_socketevent_cb (nm-connectivity.c:359) Address 0xd1ab620 is 21,600 bytes inside a block of size 32,768 free'd at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x18A932: _nm_auto_free_impl (nm-macros-internal.h:76) by 0x18A932: event_handler_recvmsgs (nm-linux-platform.c:6579) by 0x18E07C: event_handler_read_netlink (nm-linux-platform.c:6790) by 0x18F09C: delayed_action_handle_WAIT_FOR_NL_RESPONSE (nm-linux-platform.c:3594) by 0x18F09C: delayed_action_handle_one (nm-linux-platform.c:3669) by 0x18F09C: delayed_action_handle_all (nm-linux-platform.c:3687) by 0x191D78: do_add_addrroute (nm-linux-platform.c:4577) by 0x192169: ip6_address_add (nm-linux-platform.c:6273) by 0x1A5097: nm_platform_ip6_address_add (nm-platform.c:3181) by 0x1A91D1: nm_platform_ip6_address_sync (nm-platform.c:3798) by 0x18183B: nm_ip6_config_commit (nm-ip6-config.c:549) by 0x21FC77: nm_device_set_ip_config (nm-device.c:11397) by 0x221318: ip_config_merge_and_apply (nm-device.c:6785) by 0x2345D4: activate_stage5_ip6_config_commit (nm-device.c:9964) All but the last trace shows libcurl4-internal callbacks accessing data freed in nm_connectivity_check_cancel(). The last one also occurs in a libcurl callback related to connectivity checking, but accesses a block freed elswhere.
Some notes: Yes, the issue seems to be in src/nm-connectivity.c. Which could be (partially) verified by disabling connectivity check: $ busctl set-property org.freedesktop.NetworkManager /org/freedesktop/NetworkManager org.freedesktop.NetworkManager ConnectivityCheckEnabled "b" 0 and restarting NM (in case memory is already corrupted). I wouldn't call this a workaround, but probably it can be used to avoid the crash until this is fixed. Note that current upstream branches "master", "nm-1-12", and release "1.12.2" are very similar for src/nm-connectivity.c. It seems the issue would apply to them all. Between 1.12.0 and 1.12.2 a notable fix for connectivity checking was https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=660e308dd28fb502e2f9e58044684c2452cf6f6e . Not sure that is related.
The valgrind output is helpful. Thanks!!! First, nm_connectivity_check_cancel() destroys the pending connectivity check, including calling curl_multi_remove_handle() [1]. Later, the file descriptor of libcurl has something to read, and we iterate over the curl handles that are affected [2]. Seems at that point, the already removed handle is returned again, leading to a crash. [1] https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/nm-connectivity.c?id=4a1ff8ad8c543a76b2430a79fb8ffee52e8d3f56#n191 [2] https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/src/nm-connectivity.c?id=4a1ff8ad8c543a76b2430a79fb8ffee52e8d3f56#n255 What exact libcurl version is this? Does it have the patch https://github.com/curl/curl/commit/963d4583eff7a22f74e319746bec52f7ddb3dddc ? If not, any chance of trying that?
(In reply to Thomas Haller from comment #2) > What exact libcurl version is this? Does it have the patch > > https://github.com/curl/curl/commit/963d4583eff7a22f74e319746bec52f7ddb3dddc > No. openSUSE factory is currently at 7.60.0. > ? If not, any chance of trying that? Probably, yes.
The commit you reference seems to be out of date. I've found this: https://github.com/curl/curl/commit/4c901638b4d7bab5e5a847f6970425d9c1f2ea2d This is in 7.61_0. openSUSE has 7.61.1 as experimental package under https://download.opensuse.org/repositories/devel:/libraries:/c_c++/openSUSE_Factory/devel:libraries:c_c++.repo I've asked the reporter to try that.
(In reply to Martin Wilck from comment #3) > (In reply to Thomas Haller from comment #2) > > > What exact libcurl version is this? Does it have the patch > > > > https://github.com/curl/curl/commit/963d4583eff7a22f74e319746bec52f7ddb3dddc > > > > No. openSUSE factory is currently at 7.60.0. > WRONG information. Factory has 7.61.1 already, and thus should include the above fix.
I am unable to reproduce this crash, so far.
Yeah, I'm told it's sporadic. Haven't seen it myself either.
Can somebody confirm, that disabling connectivity check avoids the crash? (comment 1). Very likely the issue is in connectivity checking, but it would be good to be sure (with memory corruption you never know).
(In reply to Thomas Haller from comment #8) > Can somebody confirm, that disabling connectivity check avoids the crash? > (comment 1). > > Very likely the issue is in connectivity checking, but it would be good to > be sure (with memory corruption you never know). I have updated the ConnectivityCheckEnabled property to false since yesterday, no coredump appeared after 16 hours. In the past days, it usually crashes 7-8 times every evening at home, especially when I connect to VPN using third-party tools.
Can somebody provide a link to the SRPM of NetworkManager + libcurl which are affected? I got lost searching for it in the Openbuild web frontent (and the exact version numbers were not clear to me to begin with). Thanks.
I think the problem is the following: here cb_data gets freed from the easy_header callback function: ==24572== at 0x40319AB: free (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==24572== by 0x52DDAE5: Curl_close (url.c:392) ==24572== by 0x52EC02C: curl_easy_cleanup (easy.c:825) ==24572== by 0x5FDCD2: cb_data_free (nm-connectivity.c:215) ==24572== by 0x5FF6DE: nm_connectivity_check_cancel (nm-connectivity.c:585) ==24572== by 0x55F7F9: concheck_handle_complete (nm-device.c:2601) ==24572== by 0x574C12: concheck_cb (nm-device.c:2725) ==24572== by 0x5FD887: cb_data_invoke_callback (nm-connectivity.c:167) ==24572== by 0x5FD959: easy_header_cb (nm-connectivity.c:435) ==24572== by 0x52D73CB: chop_write (sendf.c:612) ==24572== by 0x52D73CB: Curl_client_write (sendf.c:668) ==24572== by 0x52D54ED: Curl_http_readwrite_headers (http.c:3904) ==24572== by 0x52E9EA7: readwrite_data (transfer.c:548) ==24572== by 0x52E9EA7: Curl_readwrite (transfer.c:1161) ==24572== by 0x52F4193: multi_runsingle (multi.c:1915) ==24572== by 0x52F5531: multi_socket (multi.c:2607) ==24572== by 0x52F5804: curl_multi_socket_action (multi.c:2771) Since we are in a callback, we can't call curl API functions and the curl_multi_remove_handle() fails. However, curl_easy_cleanup() succeeds. This leaves a stale entry in the internal list of easy handles. Then, when trying to remove another easy handle, the stale entry is accessed causing a segfault: ==24572== Invalid write of size 8 ==24572== at 0x52F5A6E: curl_multi_remove_handle (multi.c:791) ==24572== by 0x5FDC6D: cb_data_free (nm-connectivity.c:210) ==24572== by 0x5FDE74: _con_curl_check_connectivity (nm-connectivity.c:277) ==24572== by 0x5FE3C3: _con_curl_socketevent_cb (nm-connectivity.c:368) ==24572== by 0x4F05EA7: g_main_context_dispatch (in /usr/lib64/libglib-2.0.so.0.5600.2) ==24572== by 0x4F06297: ??? (in /usr/lib64/libglib-2.0.so.0.5600.2) ==24572== by 0x4F06591: g_main_loop_run (in /usr/lib64/libglib-2.0.so.0.5600.2) ==24572== by 0x41DD83: main (main.c:438) I'll prepare a fix for this.
==24572== by 0x5FF6DE: nm_connectivity_check_cancel (nm-connectivity.c:585) ==24572== by 0x55F7F9: concheck_handle_complete (nm-device.c:2601) A handle can obviously only be canceled zero or one time, and must not be used *after* the callback (cb_data_invoke_callback) or nm_connectivity_check_cancel() returns. Because at that point, the handle is (possibly) already freed. However, we can tighten that requirement up: a handle can only be canceled *before* the callback is invoked (but you cannot cancel it from within the callback). With that API, the bug would be in nm-device, which tries to cancel the handle that is already about to to complete. With this second interpretation, the implementation in NMConnectivity becomes simpler, while possibly moving complexity to the caller (NMDevice).
==24572== by 0x52DDAE5: Curl_close (url.c:392) ==24572== by 0x52EC02C: curl_easy_cleanup (easy.c:825) ==24572== by 0x5FDCD2: cb_data_free (nm-connectivity.c:215) ==24572== by 0x5FF6DE: nm_connectivity_check_cancel (nm-connectivity.c:585) ==24572== by 0x55F7F9: concheck_handle_complete (nm-device.c:2601) ==24572== by 0x574C12: concheck_cb (nm-device.c:2725) ==24572== by 0x5FD887: cb_data_invoke_callback (nm-connectivity.c:167) ==24572== by 0x5FD959: easy_header_cb (nm-connectivity.c:435) ==24572== by 0x52D73CB: chop_write (sendf.c:612) rethinking. Now that is odd. cb_data_invoke_callback() clears cb_data->callback before invoking the callback, and nm_connectivity_check_cancel() asserts that the callback is still set. That goes in line with what I said before, how it's intended to be: you cannot cancel the check from within the callback. But I don't see how above trace can happen, unless the memory corruption already happened before that.
Ah ok. It is canceling another handle :) NMDevice should be allowed to do that...
See also: https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1792745
On review: https://github.com/NetworkManager/NetworkManager/pull/207
Should be fixed now: master: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=fa40fc6d765248194d692a090b3abfbea6fe4e8f nm-1-14: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=7f05debf99ab30227180acd7b24a363013c957f2 nm-1-12: https://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=3dbd264418f235dd11e7802ae6f3970734653ab8 thanks for the detailed bug report!!
Thanks a lot for fixing this!