After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 735325 - NM repeatedly updates default route, causing GNetworkMonitor using apps to go crazy
NM repeatedly updates default route, causing GNetworkMonitor using apps to go...
Status: RESOLVED FIXED
Product: NetworkManager
Classification: Platform
Component: IP and DNS config
0.9.x
Other Linux
: Normal normal
: ---
Assigned To: Dan Winship
NetworkManager maintainer(s)
Depends on:
Blocks:
 
 
Reported: 2014-08-24 13:37 UTC by Elad Alfassa
Modified: 2014-11-07 15:23 UTC
See Also:
GNOME target: 3.14
GNOME version: ---


Attachments
strace (1.20 MB, text/plain)
2014-08-24 13:46 UTC, Elad Alfassa
Details

Description Elad Alfassa 2014-08-24 13:37:17 UTC
I don't use Evolution. I have my calendar synced because I set my google account in GOA, but it's a small calendar with not much data.

Recently, when not doing ANYTHING that's related to calendars or evolution in any way, I noticed my computer is overheating (99°C !).

According to `top`, evolution-calendar-factory is using 100% CPU during that time for very long periods of time. This wastes energy, overheats laptops, and drains batteries.

I attached gdb to the process to see what the hell it's doing, and got this backtrace:

  • #0 g_list_sort_real
  • #1 g_list_sort_real
  • #2 g_list_sort_real
  • #3 g_queue_sort
  • #4 g_async_queue_sort_unlocked
  • #5 g_thread_pool_set_sort_function
  • #6 task_thread_cancelled
  • #7 _g_closure_invoke_va
  • #8 g_signal_emit_valist
  • #9 g_signal_emit
  • #10 g_cancellable_cancel
  • #11 backend_update_online_state_idle_cb
  • #12 g_main_context_dispatch
  • #13 g_main_context_iterate.isra
  • #14 g_main_loop_run
  • #15 dbus_server_run_server
  • #16 ffi_call_unix64
  • #17 ffi_call
  • #18 g_cclosure_marshal_generic_va
  • #19 _g_closure_invoke_va
  • #20 g_signal_emit_valist
  • #21 g_signal_emit
  • #22 e_dbus_server_run
  • #23 main


This is extremely annoying and should be fixed soon, preferably before my laptop melts down :)
Comment 1 Elad Alfassa 2014-08-24 13:46:40 UTC
Created attachment 284342 [details]
strace

Attached strace dump of few seconds of that process. It keeps coming back after I kill it, that's really not good!
Comment 2 Matthew Barnes 2014-08-24 14:36:12 UTC
Looks like GNetworkMonitor (or something below it) is stuck in a loop.

strace shows something stat()'ing "/etc/resolv.conf", followed by some I/O, followed by a read() call returning EAGAIN, then it all repeats.

E-D-S is apparently being spammed with GNetworkMonitor::network-changed signals, and is simply reacting to them.

Dan Winship can probably make more sense of the strace.
Comment 3 Matthew Barnes 2014-08-24 14:41:54 UTC
You might be able to suppress the problem for now by restarting the calendar factory from the command-line like so:

GIO_USE_NETWORK_MONITOR=base /usr/libexec/evolution-calendar-factory -r
Comment 4 Elad Alfassa 2014-08-28 12:54:09 UTC
It appears that the GNetworkMonitor::network-changed is spamming other applications as well, I can see it in Software too.
Comment 5 Dan Winship 2014-08-30 17:56:49 UTC
(In reply to comment #2)
> E-D-S is apparently being spammed with GNetworkMonitor::network-changed
> signals, and is simply reacting to them.

Which is because GNetworkMonitor is apparently being spammed with routing change notifications from the kernel, and is simply reacting to them...

The strace output is truncated so I can't fully decode the messages, but the headers at least are there:

recvmsg(9, {msg_name(0)=NULL, msg_iov(1)=[{"l\0\0\0\30\0\0\0\216F[T\243\2\300=\n\0\0\0\376\4\0\1\0\0\0\0\10\0\17\0"..., 108}],

which is:
  6c 00 00 00   [ length 108 ]
  18 00 00 00   [ type RTM_NEWROUTE, flags 0 ]
  8e 46 5b 54   [ serial # ]
  a3 02 c0 3d   [ pid 2734866493 ?? ]

  0a 00 00 00   [ AF_INET6, src len 0, dst len 0, TOS 0 ]
  fe 04 00 01   [ RT_TABLE_MAIN, RTPROT_STATIC, RT_SCOPE_UNIVERSE, RTN_UNICAST ]
  00 00 00 00   [ flags 0 ]
  08 00 0f 00
  ...

(The "pid" field should be 0 according to the docs, but maybe those docs are out of date?)

RTPROT_STATIC indicates that this is a route added by a userland tool, not created by the kernel automatically. src len 0 / dst len 0 means that it's the default route. So, something appears to be constantly changing the IPv6 default route. (Or else, the kernel is confused and is repeatedly sending out notifications about the default route changing even though it isn't.)


The netlink messages aren't all identical; there seems to be a repeating cycle of four messages: The second one has serial 0, pid 0, and advertises that a cloned route to a single IP address has been created. The one after that is identical to the first message, except with a higher serial number. The last one is an RTM_DELROUTE for the route added by the second message. Then the cycle repeats.
Comment 6 Elad Alfassa 2014-08-30 18:06:15 UTC
Now when I look at it more closely, it happens only when I'm connected to ethernet at work, where I have IPv4 assigned by DHCP and staticly configured IPv6... I have no idea why the default route would change like that.

Would a more complete strace help diagnose this problem?
Comment 7 Dan Winship 2014-09-02 15:34:11 UTC
(In reply to comment #6)
> Would a more complete strace help diagnose this problem?

Possibly. You could pass "-s 256" to get the full netlink packet contents, and "-f" to get all threads.

It might also be useful to run NetworkManager with debug logging (to see what it's doing/noticing about the routing). Add this to /etc/NetworkManager/NetworkManager.conf:

  [logging]
  level=DEBUG
  domains=DEFAULT

and restart NM.
Comment 8 Elad Alfassa 2014-09-08 07:56:54 UTC
I can't reproduce this issue anymore. Can anyone else? If not I'll close this bug.
Comment 9 Matthias Clasen 2014-09-11 21:49:07 UTC
seems like a no
Comment 10 Dan Winship 2014-10-21 15:21:23 UTC
It turns out this is NM's fault; it's constantly requesting updates from the router and then constantly updating things to reflect that.
Comment 11 Dan Winship 2014-10-21 15:34:01 UTC
pushed branch danw/rdisc-spam for review

(see also https://bugzilla.redhat.com/show_bug.cgi?id=1151665)
Comment 12 Dan Williams 2014-10-21 18:11:04 UTC
Looks good to me.  Tag the commit with the Gnome and RH bug numbers when you merge it?
Comment 13 Dan Winship 2014-10-21 19:04:45 UTC
OK. I built a test package for the Fedora reporters... waiting to hear back on that.
Comment 14 Jiri Klimes 2014-10-22 08:20:06 UTC
The patch looks right to me.
Comment 15 Thomas Haller 2014-10-22 10:56:03 UTC
LGTM too
Comment 16 Dan Winship 2014-10-27 18:48:18 UTC
The previous fix did not fix the bug for the reporter. After more investigation, I found another potential problem, and fixing that did make the bug go away.

Branch is now at danw/routing-spam-bgo735325.

Note that this means that we are somehow messing up and ending up with a default route in an NMIP6Config somehow. There are more logs in the Fedora bug, but nothing jumps out. (It's not because the router is pushing us a default route.) The patch doesn't fix the default-route-in-NMIP6Config bug, it just makes us not get stuck in a loop when it does happen.
Comment 17 Thomas Haller 2014-10-27 19:11:53 UTC
(In reply to comment #16)
> The previous fix did not fix the bug for the reporter. After more
> investigation, I found another potential problem, and fixing that did make the
> bug go away.
> 
> Branch is now at danw/routing-spam-bgo735325.
> 
> Note that this means that we are somehow messing up and ending up with a
> default route in an NMIP6Config somehow. There are more logs in the Fedora bug,
> but nothing jumps out. (It's not because the router is pushing us a default
> route.) The patch doesn't fix the default-route-in-NMIP6Config bug, it just
> makes us not get stuck in a loop when it does happen.

The second commit looks good to me too. I actually do the same on th/bgo735512_route_metric
Comment 18 Dan Williams 2014-10-28 22:23:19 UTC
Second commit also looks good to me.  Ready to merge?
Comment 19 Dan Winship 2014-10-29 13:15:15 UTC
pushed to master and nm-0-9-10