Bug 747981 – [performance] refactor NetworkManager platform caching for libnl objects

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 747981 - [performance] refactor NetworkManager platform caching for libnl objects


Summary:	[performance] refactor NetworkManager platform caching for libnl objects


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:	748131
Blocks:	735445 nm-1-2 747985 747989

Reported:	2015-04-16 10:59 UTC by Thomas Haller
Modified:	2015-06-17 09:54 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Thomas Haller 2015-04-16 10:59:40 UTC

Currently, nm-linux-platform directly caches libnl objects.

That has several disadvantages:


- whenever we access the cache (ip4_route_get_all()), we must convert the
  libnl object to our native NMPlatformObject types.
  This is a potential performance/scalability issue.

- the libnl objects are not extensible
  
  - we cannot add our own data, so we need for example 
    _rtnl_addr_hack_lifetimes_rel_to_abs() to store the exact lifetime.
    Also hack_empty_master_iff_lower_up() seems ugly.
    We can also not associate the GUdevDevice instance, that's why we have
    a cache for libnl object (priv->link_cache) and priv->udev_devices.

  - we cannot change the equality functions, so we need to clear the address
    family for bridges (_nl_link_family_unset(), nm_nl_cache_search()), and
    hack the diff function (nm_nl_object_diff()).

- there is no index implemented for routes and addresses. To access them 
  (ip4_route_get_all()) we iterate over all objects.
  -- yes, we could implement an index also with libnl caching, but that would
  add more hacks.

- simple lookup (e.g. for a link object, which we do frequently) requires to 
  construct a libnl object as needle.



Instead,
 - we should cache our own native objects
 - add possibly indexes for common types of lookups
 - common types of lookup should not require copying data (ip4_route_get_all()).

Comment 1 Thomas Haller 2015-04-16 11:03:25 UTC

WIP at th/platform_refact_caching-bgo747981

Comment 2 Thomas Haller 2015-04-16 11:40:49 UTC

(In reply to Thomas Haller from comment #1)
> WIP at th/platform_refact_caching-bgo747981

Note: the NMPObjects are refcounted. Still, updating an object in the cache modifies the object inplace.

It might be and interesting idea to make objects in the cache immutable and expose the ref-counting to the users of NMPlatform. If the users could make use of immutable, ref-counted objects, we could make that happen later.

Comment 3 Dan Williams 2015-04-30 02:49:07 UTC

Generally looks OK to me, my issues are mostly about style.

I'm not a huge fan of the _pf_ prefix.  These things are all linux-platform private functions so why that prefix?  Why not just do_refresh_object or do_refresh_all?

Also, let's shorten function names like _pf_coerce_master_connected_needs_toggle_find_master().  It doesn't need to be quite that long; the function itself should have enough commenting to figure out what it does.

Also, on dispose(), do we really need to handle all the actions that are outstanding?  Why can't we just clean them up and drop them on the floor?  None of the _pf_delayed_action_handle_() do anything when cleanup_only=TRUE, which tells me we don't need to call these functions at all on cleanup.

Comment 4 Thomas Haller 2015-04-30 14:37:07 UTC

(In reply to Dan Williams from comment #3)
> Generally looks OK to me, my issues are mostly about style.
> 
> I'm not a huge fan of the _pf_ prefix.  These things are all linux-platform
> private functions so why that prefix?  Why not just do_refresh_object or
> do_refresh_all?

Yes, all "These things" are NMPlatform *object* related. It was there to distinguish them from what is not: ethtool_*(), libnl-utilities, udev-utilities.

Anyway, I dropped "_pf_".


> Also, let's shorten function names like
> _pf_coerce_master_connected_needs_toggle_find_master().  It doesn't need to
> be quite that long; the function itself should have enough commenting to
> figure out what it does.

I renamed this one... any other?


> Also, on dispose(), do we really need to handle all the actions that are
> outstanding?  Why can't we just clean them up and drop them on the floor? 
> None of the _pf_delayed_action_handle_() do anything when cleanup_only=TRUE,
> which tells me we don't need to call these functions at all on cleanup.

Originally I had a delayed action that took a @user_data that must be released when the action was handled. Now all delayed-actions pass either NULL or GINT_TO_POINTER(ifindex). I dropped this.


repushed

Comment 5 Thomas Haller 2015-05-04 09:28:32 UTC

repushed. So fare, looks good.

Didn't test it yet beyond basics.

Comment 6 Thomas Haller 2015-05-04 09:48:26 UTC

(set wrongly to RESOLVED. Undo)

Comment 7 Thomas Haller 2015-05-05 12:09:45 UTC

Repushed.

Tests are fixed and pass for me.

Running NM with new code worked too (seemingly).

Comment 8 Thomas Haller 2015-05-07 17:14:47 UTC

Repushed...

The first 43 commits are ready for review :)

Comment 9 Thomas Haller 2015-05-11 16:06:36 UTC

Added another commit

While the first parts
   platform: remove obsolete functions after refactoring platform cache
   platform: use new platform caching
   platform: add new platform caching to nm-linux-platform.c
rework the caching, they don't change the way we reload objects (e.g. having nlh_event vs. nlh sockets).

The last commit:
   platform: fetch objects via the event socket
changes that, and undoes some of the work done previously.


nm-linux-platform.h changed so much, maybe you only want to review the result -- and skip the intermediate commits.

Comment 10 Thomas Haller 2015-05-19 15:28:07 UTC

rebased on master.


The first few patches are part of lr/root-s-not-dead that allow running platform-tests as root. Let's handle lr/root-s-not-dead as part of this branch...

Comment 11 Dan Williams 2015-05-22 22:32:05 UTC

> platform: add flags parameter to NMPlatformLink

It doesn't look like 'up' and 'arp' are used in that many places, so perhaps those members should just be removed and only 'flags' should be used?


> platform: add vlan_id parameter to NMPlatformLink

A VLAN ID can never be > 4095, so I don't think we need [64] here.  Maybe just [10]?

> platform: add addr parameter to NMPlatformLink

Any reason to not just #include "nm-utils.h" in nm-platform.h to get NM_UTILS_HWADDR_LEN_MAX?

Comment 12 Thomas Haller 2015-05-23 15:22:52 UTC

(In reply to Dan Williams from comment #11)
> > platform: add flags parameter to NMPlatformLink
> 
> It doesn't look like 'up' and 'arp' are used in that many places, so perhaps
> those members should just be removed and only 'flags' should be used?

Indeed. That's what the FIXME comment there indicates. But the branch already has 50+ commits. I planed to do that afterwards.


> > platform: add vlan_id parameter to NMPlatformLink
> 
> A VLAN ID can never be > 4095, so I don't think we need [64] here.  Maybe
> just [10]?

I changed it to [16]... so that we never cut off the guint16 value
changed. Note that this is only a stack allocated array. Shouldn't matter at all (performance-wise).


> > platform: add addr parameter to NMPlatformLink
> 
> Any reason to not just #include "nm-utils.h" in nm-platform.h to get
> NM_UTILS_HWADDR_LEN_MAX?

No strong reason. To keep the number of includes low. How about adding a static assert to ensure the values match?


Pushed 2 fixup commits.

Comment 13 Thomas Haller 2015-05-25 09:46:22 UTC

(In reply to Dan Williams from comment #11)
> > platform: add flags parameter to NMPlatformLink
> 
> It doesn't look like 'up' and 'arp' are used in that many places, so perhaps
> those members should just be removed and only 'flags' should be used?

dropped them now.

3 new commits:

* 73c62f0 platform: remove redundant NMPlatformLink fields "arp" and "up"
* 1c5e3df utils: add NM_FLAGS_SET(), NM_FLAGS_UNSET() and NM_FLAGS_ASSIGN() macros
* 432a43b utils: move NM_FLAGS_*() macros to header file "include/nm-utils-internal.h"

Comment 14 Dan Williams 2015-05-25 15:27:56 UTC

> core: add NMRefString

When allocating an NMRefString, aren't we leaving 1 byte unused?  The string pointer itself is [1] but then when allocating, we're adding strlen (str) + 1 to the size of the struct.  Couldn't we just leave off the +1 in the length and get to the correct size?

> core: add NMMultiIndex

MultiIndex is really just an array hash, right?  eg a multi-index is just a GHashTable(GPtrArray) that happens to also include a second-level hash for fast lookups of the array items.  Using "MultiIndex" confused me at first because I wasn't really sure what it was supposed to mean, but thinking about it as a hash table of arrays is clearer for me.  

I would add a slightly better commit message and also code documentation in nm-multi-index.c that better describes the class and what it's trying to achieve.  The NMMultiIndex class itself doesn't care about "multiple indexes" or anything like that, it's just a basic Array Hash into which the user can put anything it wants, because the user controls the hash IDs.  It's NMPObject itself that actually makes NMMultiIndex into a multi-index, because it dynamically generates hash IDs (eg, NMPCacheIDs) based on the object type and its attributes. 

I'd also personally rename it "NMArrayHash" too :)

> platform: implement NMPObject and NMPCache

I'd almost rather that NMPCache handled the nm_multi_index_id_iter_init() bits so that you didn't have to have nmp_cache_get_multi_index().  The only user is one place in nm-linux-platform.c, and that seems like an encapsulation violation to expose the cache's MultiIndex directly.

Also, perhaps since _vt_cmd_obj_init_cache_id_*() all have a case for NMP_CACHE_ID_TYPE_OBJECT_TYPE that could be handled generically by a helper function that all of them call?  Since the object type is stored in the NMPObject's vtable that should be easily accessible by an entry point function (eg, like a superclass method) that checks for the object type and then calls klass->cmd_obj_init_cache_id.  Should save code and make the callers of cmd_obj_init_cache_id a bit cleaner I think.

> platform: fetch objects via the event socket

Why can't cache_prune_candidates_record_all() use NMP_CACHE_ID_STATIC?  Every other use of nmp_cache_id_init_*() in nm-linux-platform.c does, so why not this one?  I don't think there would be any side-effects since the function doesn't call anything else that would use a cache ID.  Also, if the comments above about having the multi_index_id_iter_init() seem reasonable the maybe this doesn't have to be called from here...

Comment 15 Thomas Haller 2015-05-25 20:06:30 UTC

(In reply to Dan Williams from comment #14)
> > core: add NMRefString
> 
> When allocating an NMRefString, aren't we leaving 1 byte unused?  The string
> pointer itself is [1] but then when allocating, we're adding strlen (str) +
> 1 to the size of the struct.  Couldn't we just leave off the +1 in the
> length and get to the correct size?

I think it's correct as is, because

   s = g_malloc (G_STRUCT_OFFSET (_NMString, str) + len);

is different from

   s = g_malloc (sizeof (_NMString) + len);




> > core: add NMMultiIndex
> 
> MultiIndex is really just an array hash, right?  eg a multi-index is just a
> GHashTable(GPtrArray) that happens to also include a second-level hash for
> fast lookups of the array items.  Using "MultiIndex" confused me at first
> because I wasn't really sure what it was supposed to mean, but thinking
> about it as a hash table of arrays is clearer for me.  
> 
> I would add a slightly better commit message and also code documentation in
> nm-multi-index.c that better describes the class and what it's trying to
> achieve.  The NMMultiIndex class itself doesn't care about "multiple
> indexes" or anything like that, it's just a basic Array Hash into which the
> user can put anything it wants, because the user controls the hash IDs. 
> It's NMPObject itself that actually makes NMMultiIndex into a multi-index,
> because it dynamically generates hash IDs (eg, NMPCacheIDs) based on the
> object type and its attributes. 

> I'd also personally rename it "NMArrayHash" too :)

I don't agree with this interpretation.

Implementation wise it's a hash-of-hashes, where the outer keys are opaque NMMultiIndexId and the inner values are gpointer.

IMO the array aspect is not very important. It's just that nm_multi_index_lookup() returns a list of the pointers. It does so only because a NULL-terminated array is convenient to handle for the caller -- contrary to exposing some iterator.

It's also not an array in the sense that the order is fixed. The returned elements are in random order.

IMO it's not a "hash for arrays" (NMArrayHash). GHashTable(GPtrArray) sounds like values of the hash are GPtrArray. That is not the case. The pointers are the values.


I reworded the commit message. Is that better? IMO seeing them as "buckets" is more intuitive.


One pointer can be associated with multiple/different keys. Hence the "multi".
"index" stands for "Lookup table".



> > platform: implement NMPObject and NMPCache
> 
> I'd almost rather that NMPCache handled the nm_multi_index_id_iter_init()
> bits so that you didn't have to have nmp_cache_get_multi_index().  The only
> user is one place in nm-linux-platform.c, and that seems like an
> encapsulation violation to expose the cache's MultiIndex directly.

fixup.


> Also, perhaps since _vt_cmd_obj_init_cache_id_*() all have a case for
> NMP_CACHE_ID_TYPE_OBJECT_TYPE that could be handled generically by a helper
> function that all of them call?  Since the object type is stored in the
> NMPObject's vtable that should be easily accessible by an entry point
> function (eg, like a superclass method) that checks for the object type and
> then calls klass->cmd_obj_init_cache_id.  Should save code and make the
> callers of cmd_obj_init_cache_id a bit cleaner I think.

Pushed a fixup.


> > platform: fetch objects via the event socket
> 
> Why can't cache_prune_candidates_record_all() use NMP_CACHE_ID_STATIC? 
> Every other use of nmp_cache_id_init_*() in nm-linux-platform.c does, so why
> not this one?  I don't think there would be any side-effects since the
> function doesn't call anything else that would use a cache ID.  Also, if the
> comments above about having the multi_index_id_iter_init() seem reasonable
> the maybe this doesn't have to be called from here...

Changed, with the fixup above.


Repushed

Comment 16 Thomas Haller 2015-06-05 15:16:48 UTC

merged another chunk:
http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=d38c3851f803c4e2f6faa1f88dfe55f952148891


and repushed.

Comment 17 Lubomir Rintel 2015-06-09 14:02:56 UTC

> platform: add scope parameter to NMPlatformIP4Route

I don't get the inversion dance at all. Is the solemn purpose that scope_inv of 0 is RT_SCOPE_NOWHERE instead of RT_SCOPE_UNIVERSE?

> glib-compat: add g_ptr_array_insert()

Why? Is this actually used anywhere?

> core: add NMRefString

It's difficult to me to understand why is this a good idea. Why not just g_strdup() the udis? (is it used anywhere else?). Is it likely that the memory saved will pay for additional complexity and even the increase in code  (800B).

There's an line break at the end of b/src/NetworkManagerUtils.c

> platform: add nmp-object.h file

Trailing line break here too, in src/platform/tests/.gitignore

> platform: add new platform caching to nm-linux-platform.c

Here as well: src/platform/nm-linux-platform.c

> platform: fetch objects via the event socket
...
> request, we must make sure that the result manifistated itself

"manifistatied" sounds overly violent. "manifestated" would be sufficient :)

Also, please just drop the valgrind.suppressions chunk.

Comment 18 Thomas Haller 2015-06-10 16:08:33 UTC

(In reply to Lubomir Rintel from comment #17)
> > platform: add scope parameter to NMPlatformIP4Route
> 
> I don't get the inversion dance at all. Is the solemn purpose that scope_inv
> of 0 is RT_SCOPE_NOWHERE instead of RT_SCOPE_UNIVERSE?

Correct, we want to get a default NMPlatformIP4Route instance by doing memset(0). But for scope the default is 255, not 0.
So instead store the inverse.

> 
> > glib-compat: add g_ptr_array_insert()
> 
> Why? Is this actually used anywhere?

Used at first, dropped later. I'd keep it because it's a useful function, that is good to have in your ~utils~ toolbox.


> > core: add NMRefString
> 
> It's difficult to me to understand why is this a good idea. Why not just
> g_strdup() the udis? (is it used anywhere else?). Is it likely that the
> memory saved will pay for additional complexity and even the increase in
> code  (800B).

If you have a useful class, it can gain more widespread use over time... (provided it *is* useful, which is a valid point...).

Where do you see additional complexity?
Note that in C, you must always know:
  - who owns some data (and what is it's lifetime)
  - and if you own it, how to release it when you no longer need it.
    Usually by calling one of g_free, g_slice_free, g_object_unref, 
    g_hash_table_unref, etc.pp.

All you need to know is that you must free the udi field with nm_ref_string_unref(), instead of g_free().


> There's an line break at the end of b/src/NetworkManagerUtils.c

what's wrong with empty lines at the end of a file? Indeed git-diff highlights it...


> > platform: add nmp-object.h file
> 
> Trailing line break here too, in src/platform/tests/.gitignore
>
> > platform: add new platform caching to nm-linux-platform.c
> 
> Here as well: src/platform/nm-linux-platform.c

yes :)


> 
> > platform: fetch objects via the event socket
> ...
> > request, we must make sure that the result manifistated itself
> 
> "manifistatied" sounds overly violent. "manifestated" would be sufficient :)

:)



> Also, please just drop the valgrind.suppressions chunk.

done.

Comment 19 Lubomir Rintel 2015-06-15 14:58:14 UTC

Looks good to me

Comment 20 Dan Williams 2015-06-16 15:03:15 UTC

> platform: register singleton instance early with NM_PLATFORM_DO_SETUP_SINGLETON

To me using "setup" for this operation is somewhat confusing.  How about NM_PLATFORM_REGISTER_SINGLETON, since right now that's all this (and nm_platform_setup()) controls?


> platform: no longer expose udi field in NMPlatformLink

I like this solution to the NMRefString/udi debate.


> platform: invoke platform signals with clone of object

Can you copy the code comment into the commit log too?


> platform: return NMPlatformError from link-add functions

Can you take a look at dcbw/devices-for-all-1?  Nevermind... went and looked at your review there and noticed you said you'd handle the conflict.  Thanks!

But also, the nm-linux-platform.c code in nm_platform_link_add(), nm_platform_vlan_add(), and nm_platform_infiniband_partition_add() for checking for an existing link looks really similar, is there any way to consolidate that to a helper so we don't have to duplicate it 3 times?


> device: don't check for NM_PLATFORM_ERROR_NOT_FOUND in set_nm_ipv6ll()

The check was there for either/both of two things: (1) we didn't used to check ifindex > 0, and/or (2) the ifindex for WWAN/BT/PPPoE/etc devices may not be updated quickly enough in some cases so it could get called when a WWAN device crashes or sometihng?  I'm not opposed to the change at all, but we might need to add this back later.


> platform: drop nm_platform_get_error()

Yay!

Comment 21 Thomas Haller 2015-06-16 15:56:59 UTC

(In reply to Dan Williams from comment #20)

fixed all of the above, and reworded some commit messages.

Repushed.

Comment 22 Dan Williams 2015-06-16 16:04:11 UTC

LGTM now

Comment 23 Thomas Haller 2015-06-17 09:54:15 UTC

merged all to master:

http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=35dcd8ac33d631c54532aa3bd0b3d2e026ec6407