Bug 747985 – [performance] refactor NetworkManager platform netlink connections (libnl)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 747985 - [performance] refactor NetworkManager platform netlink connections (libnl)


Summary:	[performance] refactor NetworkManager platform netlink connections (libnl)


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:	747981
Blocks:	735445

Reported:	2015-04-16 11:20 UTC by Thomas Haller
Modified:	2015-06-17 09:58 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Thomas Haller 2015-04-16 11:20:03 UTC

Currently, platform has two netlink sockets.

One (nlh_event) is used to get events when a link/address/route changes (event_notification()).

The other (nlh) is used for command (add link, delete route) and to fetch configuration synchronously. Only data fetched synchronously is put into the cache.

Note that the actual data received from the event socket is not used. The event is only used to retrigger a synchronous fetch and update the cached objects.
This is correct, because the asynchronous events are not in sync with our cache, so we cannot cache them directly. But the problem is that this causes a performance/scalability bottleneck.


This that especially problematic for addresses and routes because we cannot reload a single object, instead get_kernel_object() always gets *all* the routes/addresses for an ifindex, searches the one that it is interested in, and updates the cache with the object that was to be refreshed.
Actually, a refresh_object() of an address/route, is not much more efficient then to refresh all objects for the cache.


There are improvements possible, for example, we could combine events from the event socket to reduce the number of synchronous refreshs.


But I think the real solution is to actually use the objects from the asynchronous events. Then we don't need to refetch anything at all.

I think that should be done by using only one netlink socket for asynchronous events and for synchronous actions/refetch.
The only difficulty is, that whenever we do anything synchronously, we must anticipate that there are already events in the socket that must be processed first.

So, for example add a new link, we must do:

  (1) process all current (async) events in the socket.
  (2) send our RTM_NEWLINK command
  (3) (possibly) process further unrelated asynch events from the socket that 
    were received between (1) and (2).
  (4) process the reply to comment (2).


Step (1) is only needed to reduce the possibility that the netlink socket exhausts the buffer space.



Let's first refactor platform caching before doing this (bug 747981)

Comment 1 Thomas Haller 2015-04-16 11:52:43 UTC

pyroute2 does something similar:
https://github.com/svinota/pyroute2/blob/4c0d478bf43667c2a46fc14edb3a6dc865856f60/pyroute2/netlink/nlsocket.py#L561

Comment 2 Dan Winship 2015-04-16 12:08:27 UTC

(In reply to Thomas Haller from comment #0)
> But I think the real solution is to actually use the objects from the
> asynchronous events. Then we don't need to refetch anything at all.

That's not quite true. The kernel doesn't guarantee that we will always receive all notifications. And in particular, at least in the past, certain notifications were *never* emitted. (like IPv6 default route changes IIRC)

Comment 3 Thomas Haller 2015-04-16 12:15:44 UTC

(In reply to Dan Winship from comment #2)
> (In reply to Thomas Haller from comment #0)
> > But I think the real solution is to actually use the objects from the
> > asynchronous events. Then we don't need to refetch anything at all.
> 
> That's not quite true. The kernel doesn't guarantee that we will always
> receive all notifications. And in particular, at least in the past, certain
> notifications were *never* emitted. (like IPv6 default route changes IIRC)

if you don't get a notification, you also wouldn't get it currently. A missing notification is a kernel bug, and the only workaround is a full resync.

Ok, you are right. I said: "don't need to refetch anything at all". But having only one socket, doesn't exclude the possibility of still doing full resyncs at special moments.

Comment 4 Thomas Haller 2015-05-11 16:02:25 UTC

Partly implemented in bug 747981, th/platform_refact_caching-bgo747981

Comment 5 Thomas Haller 2015-06-17 09:58:16 UTC

merged bug 747981 to master:

http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=35dcd8ac33d631c54532aa3bd0b3d2e026ec6407


especially patch http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=051cf8bbde9b73cdb3718be94e78237758b451ff fixes this bug for the most parts.


We still have two netlink sockes, 'nlh' and 'nlh_event', but now we also fetch the data via the 'nlh_event' socket. 'nlh' is only used for actions like add/remove/change address/link/route.

This already significantly reduces the number of fetches.


While we might one day merge 'nlh' and 'nlh_event', there is no strong reason to do that now. So, I close this bug as fixed.