After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 699772 - [review] pavlix/rdisc: implement IPv6 automatic configuration in userspace, using libndp
[review] pavlix/rdisc: implement IPv6 automatic configuration in userspace, u...
Status: RESOLVED FIXED
Product: NetworkManager
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: Pavel Simerda
NetworkManager maintainer(s)
Depends on:
Blocks: 680891 692279
 
 
Reported: 2013-05-06 18:13 UTC by Pavel Simerda
Modified: 2013-08-20 15:37 UTC
See Also:
GNOME target: ---
GNOME version: ---



Description Pavel Simerda 2013-05-06 18:13:17 UTC
Jiří Pírko created a library for neighbor discovery protocol used for IPv6 automatic configuration, optionally with DHCPv6. Currently, NetworkManager uses kernel's internal NDP implementation which is far from ideal and doesn't meet NetworkManager's requirements.

The proposed solution is:

1) Abstract class for IPv6 autoconf (NMIP6Autoconf?), like DHCP abstract interface or NMPlatform
2) Implementation using libndp (NMLNDBIP6Autoconf?)
3) Port of the current implementation using kernel autoconf for distros without libndp package (NMLegacyIP6Autoconf?)
4) Fake implementation similar (NMFakeIP6Autoconf?)
5) Testsuite
6) NMIP6Manager modified to only use NMPlatform and NMIP6Autoconf, not NMNetlinkMonitor
Comment 1 Pavel Simerda 2013-05-06 18:13:52 UTC
If there is any reasonable way to avoid doing #3, I would be more than happy.
Comment 2 Dan Winship 2013-05-06 19:25:11 UTC
(In reply to comment #1)
> If there is any reasonable way to avoid doing #3, I would be more than happy.

libndp is small; we could include it in the tree (either directly or via some sort of git magic), but also provide a configure option to use an installed version instead, so distros can switch to that as soon as they start packaging libndp.
Comment 3 Pavel Simerda 2013-05-17 23:41:20 UTC
https://github.com/jpirko/libndp
Comment 4 Pavel Simerda 2013-05-31 14:12:53 UTC
It is more of a concept pre-review with a fake implementation. After review, this can be pushed to master with no consequences as the code is not yet being used by nm-device.

I already pushed the libndp git submodule and an update to ./autogen.sh to better handle submodules. It will be used by the linux implementation that is not yet ready.

The basic idea is that with the real implementation each router advertisement (and only router advertisment) will trigger the config-changed signal and that the public members will be already updated according to the router advertisement.
Comment 5 Dan Winship 2013-05-31 16:03:31 UTC
(In reply to comment #4)
> It is more of a concept pre-review with a fake implementation. After review,
> this can be pushed to master with no consequences as the code is not yet being
> used by nm-device.

But is there any reason to push it now?

> The basic idea is that with the real implementation each router advertisement
> (and only router advertisment) will trigger the config-changed signal and that
> the public members will be already updated according to the router
> advertisement.

1) Needs better naming. With "IP6" in it. NMIP6Monitor? NMIP6AdvertisedGateway, NMIP6AdvertisedRoute, etc?

2) GArray is an ugly API and is mostly intended to be used as a "builder" API, with ordinary C arrays appearing in the public parts of the API. Eg, use GArray internally, but then once you've parsed everything, do:

  rdisc->num_gateways = gateways->len;
  rdisc->gateways = (NMRDiscGateway *) g_array_free (gateways, FALSE);

etc.

3) Shouldn't this be part of NMPlatform rather than its own class?
Comment 6 Pavel Simerda 2013-06-06 06:30:48 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > It is more of a concept pre-review with a fake implementation. After review,
> > this can be pushed to master with no consequences as the code is not yet being
> > used by nm-device.
> 
> But is there any reason to push it now?

Not except the 'release early' rule of thumb to give it out for experimentation. It can simply wait for the real implementation.

> > The basic idea is that with the real implementation each router advertisement
> > (and only router advertisment) will trigger the config-changed signal and that
> > the public members will be already updated according to the router
> > advertisement.
> 
> 1) Needs better naming. With "IP6" in it.

I'm not sure. Although router discovery is commonly used with IPv6 and the API is IPv6-only, the concepts are protocol independent. If anyone (I doubt it) wanted to use it with IPv4, the API and implementation could be adapted for that.

> NMIP6Monitor?

Not a monitor, rather something like a DHCP client but for a different protocol with a different workflow.

If you do insist on an IPv6-specific name, let's use NMIP6Discovery. In my opinion the IPv6-style configuration negotiation brings more problems than it solves, so I don't think anyone will want to bring it back to IPv4 :D.

> NMIP6AdvertisedGateway,
> NMIP6AdvertisedRoute, etc?

For the public structures? Yes, they probably need IP6 in their names. Your names are good, or we could use s/Advertised/Discovered/ or whatever.

> 2) GArray is an ugly API

Not much more ugly than the rest of Glib API, though :).

> and is mostly intended to be used as a "builder" API,
> with ordinary C arrays appearing in the public parts of the API. Eg, use GArray
> internally, but then once you've parsed everything, do:
> 
>   rdisc->num_gateways = gateways->len;
>   rdisc->gateways = (NMRDiscGateway *) g_array_free (gateways, FALSE);
> 
> etc.

Would you prefer a GList or similar, in this case? We will not only be building the lists at once, but also manipulate them adding/removing items upon each recieved RA.

> 3) Shouldn't this be part of NMPlatform rather than its own class?

Definitely not. The NMPlatform is here to communicate with the operating system, more specifically the Linux kernel. NMRDisc is more like a DHCP client, just using a different protocol and workflow, kernel is only used as a transport layer.

In theory, if NetworkManager is ported to another platform supported by libndp, and dhclient, then NMPlatform needs a new implementation, while both router discovery code and DHCP code are ready to use. This could also allow for interesting testing scenarios where you could use fake router discovery with real platform or fake platform with real discovery.
Comment 7 Dan Winship 2013-06-07 15:15:09 UTC
(In reply to comment #6)
> > 1) Needs better naming. With "IP6" in it.
> 
> I'm not sure. Although router discovery is commonly used with IPv6 and the API
> is IPv6-only, the concepts are protocol independent. If anyone (I doubt it)
> wanted to use it with IPv4, the API and implementation could be adapted for
> that.

There is no *actual* reason to think it might someday be protocol-independent though. Lots of things *might* happen. We don't want to plan ahead in the API for all of them.

> >   rdisc->num_gateways = gateways->len;
> >   rdisc->gateways = (NMRDiscGateway *) g_array_free (gateways, FALSE);
> > 
> > etc.
> 
> Would you prefer a GList or similar, in this case? We will not only be building
> the lists at once, but also manipulate them adding/removing items upon each
> recieved RA.

Oh, hm, I was thinking there that RDisc was a temporary structure created just for the signal emission and then destroyed afterward.

I have less of an opinion of how it should be in that case.
Comment 8 Pavel Simerda 2013-06-10 08:23:06 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > > 1) Needs better naming. With "IP6" in it.
> > 
> > I'm not sure. Although router discovery is commonly used with IPv6 and the API
> > is IPv6-only, the concepts are protocol independent. If anyone (I doubt it)
> > wanted to use it with IPv4, the API and implementation could be adapted for
> > that.
> 
> There is no *actual* reason to think it might someday be protocol-independent
> though. Lots of things *might* happen. We don't want to plan ahead in the API
> for all of them.

I wasn't suggesting it will happen. I just used it to show the concept of router discovery is similar to that of DHCP and different to that of NMPlatform.

> > Would you prefer a GList or similar, in this case? We will not only be building
> > the lists at once, but also manipulate them adding/removing items upon each
> > recieved RA.
> 
> Oh, hm, I was thinking there that RDisc was a temporary structure created just
> for the signal emission and then destroyed afterward.

See bug 680891 and bug 692279 and the stub implementation of libndp-based implementation which is already in pavlix/rdisc. This will give you a better picture of what can happen.

Currently I'm working on merging/obsoleting all the lists of addresses/routes/gateways/nameservers/domains.

> I have less of an opinion of how it should be in that case.

Any advice is welcome. The lists will be kept sorted according to custom critera and records will be often added (anywhere), removed or moved. The lists will be usually very small.

In future, the maintainance of rdisc will be related to various RFC updates or new RFCs supported. Or it may happen that an external library (libndp most probably) will be used for all of the standards stuff. I need to talk to jpirko about it but this will not be done any soon AFAIK.
Comment 9 Dan Winship 2013-06-11 15:33:43 UTC
mentioned on IRC: we're currently installing libndp, but we should not be; libndp is not a feature of NetworkManager, it's an implementation detail. Especially, we don't want NetworkManager packages to conflict with future libndp packages.

The easiest fix (assuming Jiři doesn't want to add an --uninstalled option or something to libndp upstream) is probably to remove libndp from SUBDIRS, and just have a rule that does "(cd libndp; $(MAKE))" at the right time

Of course we also need the configure check for an already-install libndp. Rawhide apparently already has packages.
Comment 10 Dan Winship 2013-06-11 16:22:37 UTC
(In reply to comment #8)
> Any advice is welcome. The lists will be kept sorted according to custom
> critera and records will be often added (anywhere), removed or moved. The lists
> will be usually very small.

And everyone looking at the lists is going to want to walk the list in order (either to the end of the list, or up to the Nth element) rather than accessing the elements randomly, right? So maybe a GSList is best.

> In future, the maintainance of rdisc will be related to various RFC updates or
> new RFCs supported. Or it may happen that an external library (libndp most
> probably) will be used for all of the standards stuff. I need to talk to jpirko
> about it but this will not be done any soon AFAIK.

I don't think we want to rely on libndp updates for parsing new RA data; it takes too long for new features to get implemented upstream, released in a stable release, and then propagated down to distro packages. (We are having this problem with libnl now.) The libndp API provides access to the raw RA data, so if there ends up being something new we want to support, we can just parse that data out ourselves.
Comment 11 Dan Winship 2013-06-13 13:27:21 UTC
pushed danw/ndpfix with patches to (a) not install libndp, and (b) allow building against system libndp; and danw/rdisc-ndpfix, which is just pavlix/rdisc, rebased over danw/ndpfix, plus a tiny patch to use system libndp if that's what was configured
Comment 12 Dan Winship 2013-06-13 16:36:55 UTC
(In reply to comment #11)
> pushed danw/ndpfix with patches to (a) not install libndp, and (b) allow
> building against system libndp

pushed to master after discussions on irc and email
Comment 13 Pavel Simerda 2013-06-14 07:49:24 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > pushed danw/ndpfix with patches to (a) not install libndp, and (b) allow
> > building against system libndp
> 
> pushed to master after discussions on irc and email

Thanks!
Comment 14 Pavel Simerda 2013-06-14 22:47:48 UTC
Ready for another review round. An implementation based on libndp is now available and a patch for nm-device to use the new router discovery framework is there. I didn't find time for much testing, though.

For now I'm keeping the old names. Dan Winship pointed out that IPv6 should be explicit in the name. I was thinking about NMIP6Discovery, NMFakeIP6Discovery, NMLNDPIP6Discovery, NMIP6DiscoveredAddress, etc... please comment on the choice of names, I would like to switch the names *once* if possible.

(In reply to comment #10)
> And everyone looking at the lists is going to want to walk the list in order
> (either to the end of the list, or up to the Nth element) rather than accessing
> the elements randomly, right? So maybe a GSList is best.

I looked at the GList and GSList APIs and realized they don't provide some of the traditional list actions like O(1) insertion after a known item, etc. The GList/GSList APIs are mostly index-based, which sort of spoils the advantage over GArray so I'm keeping GArray for now.

> I don't think we want to rely on libndp updates for parsing new RA data;

Actually, I don't think there will be new RA data types every now and then, so this shouldn't be a problem. But this discussion is not necessary for this particular task.

> it takes too long for new features to get implemented upstream, released in a
> stable release, and then propagated down to distro packages. (We are having
> this problem with libnl now.)

On the other hand, external libraries have been the basis of the success of the whole linux ecosystem. Many things could be then fixed without touching NetworkManager. The relation between NM and libndp is different from that of libnl and even the libnl problems are temporary and often can be solved without bumping the dependency.

> The libndp API provides access to the raw RA
> data, so if there ends up being something new we want to support, we can just
> parse that data out ourselves.

True.

Please review. If you have time for testing, even better.
Comment 15 Dan Williams 2013-06-28 21:35:35 UTC
> rdisc: abstract class and fake implementation

Any reason dhcp_level is 'int' rather than 'guint32'?  We've already got NM_RDISC_DHCP_LEVEL_UNKNOWN which should cover anything we want for unset/unknown, so I think it can be an unsigned instead.

Is there any reason a subclass would *not* have a start() implementation?  Maybe we should g_assert()  here or g_return_if_fail().

Should the fields for NMRDiscXXX be 'unsigned int' instead of 'int'?  Is -1 being used for an 'unset' value or something?  At least for timestamp it seems like a timestamp of 0 could be used to indicate "not set" and then we don't have to deal with potential integer wrapping bugs since we're handling network-sourced values.

> rdisc: libndp implementation

recieve_ra() is misspelled, should be receive_ra() (i and e flipped, yeah English sucks...)

In recieve_ra() if you change dhcp_level to an unsigned int, need to update this local variable too.

Need to fix the commented out clean_servers()/clean_domains() calls in recieve_ra().

next_event doesn't seem to be used anywhere yet?

Also, what happens if a new router advertisement comes in that specified DHCP but the previous DHCP triggered by the previous RA hasn't completed yet?  I think we should cancel the previous DHCP run if the DHCP flags are different.  That would involve comparing priv->dhcp_level to the new level and calling dhcp6_cleanup() if they are different, then kicking off a new request with dhcp6_start().


> device: use internal router discovery implementation

Need to clean up the commented out nm_ip6_manager_prepare_interface() bits in addrconf6_start().

Need to clean up the tried_ipv6 stuff in nm_device_deactivate().  I don't think it's used anymore, so just delete it.
Comment 16 Pavel Simerda 2013-07-01 10:49:53 UTC
(In reply to comment #15)
> > rdisc: abstract class and fake implementation
> 
> Any reason dhcp_level is 'int' rather than 'guint32'?  We've already got
> NM_RDISC_DHCP_LEVEL_UNKNOWN which should cover anything we want for
> unset/unknown, so I think it can be an unsigned instead.

Switching from int to NMRDiscDHCPLevel.

> Is there any reason a subclass would *not* have a start() implementation? 
> Maybe we should g_assert()  here or g_return_if_fail().

Done.

> Should the fields for NMRDiscXXX be 'unsigned int' instead of 'int'?  Is -1
> being used for an 'unset' value or something?  At least for timestamp it seems
> like a timestamp of 0 could be used to indicate "not set" and then we don't
> have to deal with potential integer wrapping bugs since we're handling
> network-sourced values.

Turned network-sourced fields to guint32. Kept int plen, if you don't have objections. Switched timestamp from int to time_t.

> > rdisc: libndp implementation
> 
> recieve_ra() is misspelled, should be receive_ra() (i and e flipped, yeah
> English sucks...)

My favorite one.

> In recieve_ra() if you change dhcp_level to an unsigned int, need to update
> this local variable too.

I tend to prefer the enum type.
 
> Need to fix the commented out clean_servers()/clean_domains() calls in
> recieve_ra().

Done.

> next_event doesn't seem to be used anywhere yet?

Fixed.

> Also, what happens if a new router advertisement comes in that specified DHCP
> but the previous DHCP triggered by the previous RA hasn't completed yet?

Good question.

> I
> think we should cancel the previous DHCP run if the DHCP flags are different.
> That would involve comparing priv->dhcp_level to the new level

That's handled by the 'changed' map argument and NM_RDISC_CONFIG_DHCP_LEVEL.

> and calling
> dhcp6_cleanup() if they are different, then kicking off a new request with
> dhcp6_start().

What about this?

        }
 
        if (changed & NM_RDISC_CONFIG_DHCP_LEVEL) {
+               dhcp6_cleanup (self, TRUE, TRUE);
+
                priv->dhcp6_mode = rdisc->dhcp_level;
 
                switch (priv->dhcp6_mode) {
                case NM_RDISC_DHCP_LEVEL_NONE:
-                       g_clear_object (&priv->dhcp6_ip6_config);
                        break;
                default:
                        nm_log_info (LOGD_DEVICE | LOGD_DHCP6,

> > device: use internal router discovery implementation
> 
> Need to clean up the commented out nm_ip6_manager_prepare_interface() bits in
> addrconf6_start().

Done.

> Need to clean up the tried_ipv6 stuff in nm_device_deactivate().  I don't think
> it's used anymore, so just delete it.

Done.

Waiting for additional comments, especially those from Dan Winship and also for instructions/ideas regarding the class/module renaming.
Comment 17 Dan Winship 2013-07-02 17:53:28 UTC
>    rdisc: abstract class and fake implementation

You don't need src/rdisc/Makefile.am; just add "rdisc/tests" directly to src/Makefile.am's SUBDIRS.


>    rdisc: libndp implementation

>+	ndpe = ndp_msg_new (&msg, NDP_MSG_RS);
>+	g_assert (!ndpe);

This confused me... it looks like you're asserting that ndp_msg_new() failed and returned NULL. I think "error" or "err" would be a better name for ndpe. (Likewise elsewhere.)

Also, I think send_rs() leaks msg on success?

The "/* Skip more preferable routes */" code in add_gateway() and add_route() is a no-op. (The loop immediately proceeds to the next element either way.)

build_address() is exceedingly magical, and needs comments. Perhaps it belongs in libndp as a helper function? (Also, the fact that it contains both a variable "ll" and a magic number "11" is not great.)

>+ * Subsequent router advertisements can represent new default gateways
>+ * on the network. We should present all of them in router preference
>+ * order.

The code does not actually sort rdisc->gateways in preference order (but the final nm-device.c patch assumes that it does).

nm_lndp_rdisc_finalize() needs to g_source_remove() send_rs_id (if it's still set) and event_id, and free event_channel.

I think there needs to be code to send an RS if data is about to expire? (Like NMIP6Manager currently does, to deal with routers that only advertise based on the route lifetime, without taking into account the DNS data lifetime.)


>    ip6: add nm_ip6_config_reset_addresses()
>    
>    Copy-paste of nm_ip4_config_reset_routes().

I assume s/routes/addresses/ ?


>    device: clean up ip6 configuration flow

You remove the set of PENDING_IP6_CONFIG in nm_device_activate_schedule_ip6_config_result(), but not the reading and clearing of it in nm_device_activate_ip6_config_commit() (which ought to mean it will crash always since the code asserts that PENDING_IP6_CONFIG was set?)


>    device: use internal router discovery implementation

You don't want to add nm-fake-rdisc.[ch] to libNetworkManager_la_SOURCES.

If the code is no longer using NMIP6Manager, then you should remove it; we can look in the git history if we need to refer back to it.
Comment 18 Pavel Simerda 2013-07-09 11:22:48 UTC
(In reply to comment #17)
> >    rdisc: abstract class and fake implementation
> 
> You don't need src/rdisc/Makefile.am; just add "rdisc/tests" directly to
> src/Makefile.am's SUBDIRS.

This is intentional. Just as with src/platform, a developer may want to work in the respective subdirectory, edit the files and build the tests.

> >    rdisc: libndp implementation
> 
> >+	ndpe = ndp_msg_new (&msg, NDP_MSG_RS);
> >+	g_assert (!ndpe);
> 
> This confused me... it looks like you're asserting that ndp_msg_new() failed
> and returned NULL. I think "error" or "err" would be a better name for ndpe.
> (Likewise elsewhere.)

Granted.

> Also, I think send_rs() leaks msg on success?

Fixed.

> The "/* Skip more preferable routes */" code in add_gateway() and add_route()
> is a no-op. (The loop immediately proceeds to the next element either way.)

Turned into break.

> build_address() is exceedingly magical, and needs comments.

Renamed, simplified and commented.

> Perhaps it belongs in libndp as a helper function?

Yes. Perhaps most of nm-lndp-rdisc belongs in libndp and I'm all for trying to move the code there.

> >+ * Subsequent router advertisements can represent new default gateways
> >+ * on the network. We should present all of them in router preference
> >+ * order.
> 
> The code does not actually sort rdisc->gateways in preference order (but the
> final nm-device.c patch assumes that it does).

Modified the preference logic.

> nm_lndp_rdisc_finalize() needs to g_source_remove() send_rs_id (if it's still
> set) and event_id, and free event_channel.

Done.

> I think there needs to be code to send an RS if data is about to expire?

Yes, this one should be ported. TODO.

And the workarounds for too short lifetimes as well. TODO.

> >    ip6: add nm_ip6_config_reset_addresses()
> >    
> >    Copy-paste of nm_ip4_config_reset_routes().
> 
> I assume s/routes/addresses/ ?

Simplified the message. I'll merge this patch into the pavlix/runtime ip?-config reorg patch anyway.

> >    device: clean up ip6 configuration flow
> 
> You remove the set of PENDING_IP6_CONFIG in
> nm_device_activate_schedule_ip6_config_result(), but not the reading and
> clearing of it in nm_device_activate_ip6_config_commit() (which ought to mean
> it will crash always since the code asserts that PENDING_IP6_CONFIG was set?)

Could you please look at the fixup patch I added just after the original one?

> >    device: use internal router discovery implementation
> 
> You don't want to add nm-fake-rdisc.[ch] to libNetworkManager_la_SOURCES.

I'm not sure about it yet, nm-fake-platform is also there. Those will be needed for fake NetworkManager runs. I think we can fix this properly when enabling the fake runs, as it doesn't cause any harm.

> If the code is no longer using NMIP6Manager, then you should remove it; we can
> look in the git history if we need to refer back to it.

Sure.
Comment 19 Pavel Simerda 2013-07-10 21:44:50 UTC
(In reply to comment #18)
> > I think there needs to be code to send an RS if data is about to expire?
> 
> Yes, this one should be ported. TODO.

Done.

> And the workarounds for too short lifetimes as well. TODO.

Done.
Comment 20 Dan Williams 2013-07-12 15:01:14 UTC
time_t is like size_t apparently, and can't fit in a %d on 64-bit arches.

diff --git a/src/rdisc/nm-lndp-rdisc.c b/src/rdisc/nm-lndp-rdisc.c
index 61cc8c2..c48aff9 100644
--- a/src/rdisc/nm-lndp-rdisc.c
+++ b/src/rdisc/nm-lndp-rdisc.c
@@ -330,7 +330,7 @@ check_timestamps (NMRDisc *rdisc, time_t timestamp, NMRDiscConfigMap changed)
                g_signal_emit_by_name (rdisc, NM_RDISC_CONFIG_CHANGED, changed);
 
        if (nextevent != UINT_MAX) {
-               debug ("Scheduling next timestamp/lifetime check: %d seconds", nextevent);
+               debug ("Scheduling next timestamp/lifetime check: %zd seconds", nextevent);
                priv->timeout_id = g_timeout_add_seconds (nextevent, timeout_cb, rdisc);
        }
 }
Comment 21 Dan Williams 2013-07-12 16:07:26 UTC
--------------------
NetworkManager[7138]: <debug> [1373644860.209405] [rdisc/nm-rdisc.c:54] nm_rdisc_start(): Starting router discovery: 5
NetworkManager[7138]: <debug> [1373644860.209472] [rdisc/nm-lndp-rdisc.c:550] process_events(): Processing libndp events.
NetworkManager[7138]: <debug> [1373644860.209510] [rdisc/nm-lndp-rdisc.c:199] solicit(): Scheduling router solicitation.
NetworkManager[7138]: <debug> [1373644860.209674] [NetworkManagerUtils.c:442] nm_utils_do_sysctl(): sysctl: setting '/proc/sys/net/ipv6/conf/wlan0/use_tempaddr' to '0'
NetworkManager[7138]: <debug> [1373644860.330096] [rdisc/nm-lndp-rdisc.c:179] send_rs(): Sending router solicitation: 5
NetworkManager[7138]: <error> [1373644860.330964] [rdisc/nm-lndp-rdisc.c:183] send_rs(): Cannot send router solicitation.
NetworkManager[7138]: <debug> [1373644860.331803] [rdisc/nm-lndp-rdisc.c:187] send_rs(): Scheduling router solicitation retry in 10 seconds.

no idea what that's about, but the debug code in send_ra() should probably print out the error returned from libndp so that we have a clue what's going on.  It turns out the error is -99, which is probably EADDRNOTAVAIL.  Any idea what that's about?

---------------------

(NetworkManager:3831): GLib-CRITICAL **: g_bytes_get_data: assertion `bytes != NULL' failed

which comes from receive_ra().  It appears that nothing in nm-device.c calls nm_rdisc_set_lladdr(), so yeah rdisc->lladdr is NULL and that makes things unhappy.  Do interfaces like PPP even have an lladdr?  In any case, we should either assert that an lladdr has been set on the NMRDisc, or the code should handle not having one.

----------------------

NetworkManager[9845]: <info> Activation (wlan0) Stage 4 of 5 (IPv6 Configure Timeout) scheduled...
NetworkManager[9845]: <info> Activation (wlan0) Stage 4 of 5 (IPv6 Configure Timeout) started...
NetworkManager[9845]: <info> Activation (wlan0) Stage 4 of 5 (IPv6 Configure Timeout) complete.
NetworkManager[9845]: <debug> [1373645074.736818] [rdisc/nm-lndp-rdisc.c:546] process_events(): Processing libndp events.
NetworkManager[9845]: <debug> [1373645074.736890] [rdisc/nm-lndp-rdisc.c:405] receive_ra(): Recieved router advertisement: 5 at 9921
NetworkManager[9845]: <debug> [1373645074.736931] [rdisc/nm-lndp-rdisc.c:333] check_timestamps(): Scheduling next timestamp/lifetime check: 3600 seco

Here we're using DHCP in Managed mode, but the IP6 config timeout (because DHCP6 didn't reply) doesn't terminate NDP at all.  I feel like a failure of DHCP managed mode should terminate NDP, because the router explicitly requests that addressing be obtained via DHCP, and this has failed.

----------------------

Debugging: can we get the rdisc code debug/info statements to print out what interface they are sending stuff to?  Otherwise there's no way to know whether the requests are going out eth0 or wlan0 or whatever.  Yes, this requires an interface name lookup but that's OK.  We can cache that in the NMRDisc struct to make it easier.
Comment 22 Dan Williams 2013-07-12 16:14:24 UTC
I pushed some hacks to dcbw/rdisc-merge, which is rebased on top of master.  See if you find any of it (or the merge itself) useful.
Comment 23 Pavel Simerda 2013-07-12 17:09:31 UTC
(In reply to comment #20)
> time_t is like size_t apparently, and can't fit in a %d on 64-bit arches.

I'm now turning it into int for logging purposes. Looks like time_t is *always* a 64-bit signed integer on GNU, feel free to offer a better solution. I'm trying to handle it more carefully outside of logging context.

Refactored the whole time_t code, please check whether it looks sane to you. There were bugs in the code logic, mixing lifetimes with timestamps, etc, it should be better now.

(In reply to comment #21)
> --------------------
> NetworkManager[7138]: <debug> [1373644860.209405] [rdisc/nm-rdisc.c:54]
> nm_rdisc_start(): Starting router discovery: 5
> NetworkManager[7138]: <debug> [1373644860.209472] [rdisc/nm-lndp-rdisc.c:550]
> process_events(): Processing libndp events.
> NetworkManager[7138]: <debug> [1373644860.209510] [rdisc/nm-lndp-rdisc.c:199]
> solicit(): Scheduling router solicitation.
> NetworkManager[7138]: <debug> [1373644860.209674] [NetworkManagerUtils.c:442]
> nm_utils_do_sysctl(): sysctl: setting
> '/proc/sys/net/ipv6/conf/wlan0/use_tempaddr' to '0'
> NetworkManager[7138]: <debug> [1373644860.330096] [rdisc/nm-lndp-rdisc.c:179]
> send_rs(): Sending router solicitation: 5
> NetworkManager[7138]: <error> [1373644860.330964] [rdisc/nm-lndp-rdisc.c:183]
> send_rs(): Cannot send router solicitation.
> NetworkManager[7138]: <debug> [1373644860.331803] [rdisc/nm-lndp-rdisc.c:187]
> send_rs(): Scheduling router solicitation retry in 10 seconds.
> 
> no idea what that's about, but the debug code in send_ra() should probably
> print out the error returned from libndp so that we have a clue what's going
> on.  It turns out the error is -99, which is probably EADDRNOTAVAIL.  Any idea
> what that's about?

Seen that already with nm-ip6-manager long time ago. I suspect this may happen when the link-local address is not yet available for communication. A workaround would be to use a shorter interval when that occurs. A proper solution would be to make nm-device *wait* for the duplicate address detection of the link-local address that's performed by the kernel, which is currently not supported by nm-platform.

If you see this error with the very first solicitation, the issue is not fatal and only causes a 10s delay.

> (NetworkManager:3831): GLib-CRITICAL **: g_bytes_get_data: assertion `bytes !=
> NULL' failed
> 
> which comes from receive_ra().  It appears that nothing in nm-device.c calls
> nm_rdisc_set_lladdr(), so yeah rdisc->lladdr is NULL and that makes things
> unhappy.  Do interfaces like PPP even have an lladdr?

I will need to look at the RFC for neighbor discovery over PPP. I suspect it has its specific and we only currently handle ethernet-like interfaces. TODO.

> In any case, we should
> either assert that an lladdr has been set on the NMRDisc, or the code should
> handle not having one.

Yep.

> NetworkManager[9845]: <info> Activation (wlan0) Stage 4 of 5 (IPv6 Configure
> Timeout) scheduled...
> NetworkManager[9845]: <info> Activation (wlan0) Stage 4 of 5 (IPv6 Configure
> Timeout) started...
> NetworkManager[9845]: <info> Activation (wlan0) Stage 4 of 5 (IPv6 Configure
> Timeout) complete.
> NetworkManager[9845]: <debug> [1373645074.736818] [rdisc/nm-lndp-rdisc.c:546]
> process_events(): Processing libndp events.
> NetworkManager[9845]: <debug> [1373645074.736890] [rdisc/nm-lndp-rdisc.c:405]
> receive_ra(): Recieved router advertisement: 5 at 9921
> NetworkManager[9845]: <debug> [1373645074.736931] [rdisc/nm-lndp-rdisc.c:333]
> check_timestamps(): Scheduling next timestamp/lifetime check: 3600 seco
> 
> Here we're using DHCP in Managed mode, but the IP6 config timeout (because
> DHCP6 didn't reply) doesn't terminate NDP at all.

Actually, nm-device should destroy the rdisc object to disable the processing. TODO.

> I feel like a failure of
> DHCP managed mode should terminate NDP, because the router explicitly requests
> that addressing be obtained via DHCP, and this has failed.

Device configuration failure should always terminate router discovery.

(In reply to comment #22)
> I pushed some hacks to dcbw/rdisc-merge, which is rebased on top of master. 
> See if you find any of it (or the merge itself) useful.

Squashed all of your improvements to pavlix/rdisc plus added my own fixes.

Thanks for your time. Ready for testing, except for the known issues above.
Comment 24 Dan Williams 2013-07-12 17:57:35 UTC
Also, if the network doesn't have IPv6 enabled, but IPv6 method is "auto", the code will keep sending out solicitations every 10 seconds forever.  I think we've got two options, if we don't get an RA during the initial connection attempt:

1) just stop soliciting and wait for an RA to appear, then restart the solicitation stuff

2) (exponentially?) back off the solicitation interval to 2 or 4 minutes or something?

At some point we're just not going to get a response to our solicitation, and we should be a good citizen and stop flooding the network with solicitations so often when we're pretty sure there's no router.

Future enhancement: like with DHCP, I think it's reasonable to restart the solicitation timer at 10 seconds (and do DHCP v4/v6 renew) when the carrier changes.
Comment 25 Pavel Simerda 2013-07-12 18:12:29 UTC
(In reply to comment #24)
> Also, if the network doesn't have IPv6 enabled, but IPv6 method is "auto", the
> code will keep sending out solicitations every 10 seconds forever.  I think
> we've got two options, if we don't get an RA during the initial connection
> attempt:
> 
> 1) just stop soliciting and wait for an RA to appear, then restart the
> solicitation stuff
> 
> 2) (exponentially?) back off the solicitation interval to 2 or 4 minutes or
> something?
> 
> At some point we're just not going to get a response to our solicitation, and
> we should be a good citizen and stop flooding the network with solicitations so
> often when we're pretty sure there's no router.
> 
> Future enhancement: like with DHCP, I think it's reasonable to restart the
> solicitation timer at 10 seconds (and do DHCP v4/v6 renew) when the carrier
> changes.

Yep, sounds reasonable. TODO.
Comment 26 Dan Williams 2013-07-12 18:31:03 UTC
One last fix, which is my fault:

diff --git a/src/devices/nm-device.c b/src/devices/nm-device.c
index ece77bc..d744fbd 100644
--- a/src/devices/nm-device.c
+++ b/src/devices/nm-device.c
@@ -2885,7 +2885,7 @@ addrconf6_start (NMDevice *self)
                priv->ac_ip6_config = NULL;
        }
 
-       priv->rdisc = nm_lndp_rdisc_new (nm_device_get_ip_ifindex (self), priv->ip_iface);
+       priv->rdisc = nm_lndp_rdisc_new (nm_device_get_ip_ifindex (self), nm_device_get_ip_iface (self));
        nm_platform_sysctl_set (priv->ip6_accept_ra_path, "0");
 
        if (!priv->rdisc) {
Comment 27 Pavel Simerda 2013-07-15 13:27:48 UTC
(In reply to comment #26)
> One last fix, which is my fault:
> 
> diff --git a/src/devices/nm-device.c b/src/devices/nm-device.c
> index ece77bc..d744fbd 100644
> --- a/src/devices/nm-device.c
> +++ b/src/devices/nm-device.c
> @@ -2885,7 +2885,7 @@ addrconf6_start (NMDevice *self)
>                 priv->ac_ip6_config = NULL;
>         }
> 
> -       priv->rdisc = nm_lndp_rdisc_new (nm_device_get_ip_ifindex (self),
> priv->ip_iface);
> +       priv->rdisc = nm_lndp_rdisc_new (nm_device_get_ip_ifindex (self),
> nm_device_get_ip_iface (self));
>         nm_platform_sysctl_set (priv->ip6_accept_ra_path, "0");
> 
>         if (!priv->rdisc) {

Fixed.
Comment 28 Dan Williams 2013-07-16 18:13:19 UTC
There are some routing and address differences between git master and pavlix/rdisc; I'm not sure if they are intentional or not?  Taken from the same network with RA + DHCPv6 managed:

pavlix/rdisc:
---------------
4: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:23:15:08:7e:ac brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.100/24 scope global wlan0
       valid_lft forever preferred_lft forever
    inet6 3ffe:b80:17e2::254/128 scope global 
       valid_lft forever preferred_lft forever
    inet6 3ffe:b80:17e2:0:223:15ff:fe08:7eac/128 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::223:15ff:fe08:7eac/64 scope link 
       valid_lft forever preferred_lft forever

2001:470:20::2 via fe80::201:2eff:fe47:3b0f dev wlan0  metric 0 
    cache 
3ffe:b80:17e2::254 dev wlan0  proto kernel  metric 256 
3ffe:b80:17e2:0:223:15ff:fe08:7eac dev wlan0  proto kernel  metric 256 
3ffe:b80:17e2::/64 dev wlan0  proto static  metric 9 
fe80::/64 dev wlan0  proto kernel  metric 256 
default via fe80::201:2eff:fe47:3b0f dev wlan0  proto static  metric 1024 


git master:
-------------
4: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:23:15:08:7e:ac brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.100/24 scope global wlan0
       valid_lft forever preferred_lft forever
    inet6 3ffe:b80:17e2::254/128 scope global 
       valid_lft forever preferred_lft forever
    inet6 3ffe:b80:17e2:0:223:15ff:fe08:7eac/64 scope global dynamic 
       valid_lft 86394sec preferred_lft 14394sec
    inet6 fe80::223:15ff:fe08:7eac/64 scope link 
       valid_lft forever preferred_lft forever

3ffe:b80:17e2::1 dev wlan0  metric 0 
    cache  expires -4309062sec
3ffe:b80:17e2::254 dev wlan0  proto kernel  metric 256 
3ffe:b80:17e2::/64 dev wlan0  proto kernel  metric 256  expires 86389sec
fe80::/64 dev wlan0  proto kernel  metric 256 
default via fe80::201:2eff:fe47:3b0f dev wlan0  proto ra  metric 1024  expires 49sec


Summary:

1) 3ffe:b80:17e2:0:223:15ff:fe08:7eac (the RA-provided route) has a prefix of /128 in pavlix/rdisc, but /64 in git master

2) pavlix/rdisc adds a route to a local address?? 3ffe:b80:17e2:0:223:15ff:fe08:7eac

3) pavlix/rdisc adds a static route to the DHCP-provided DNS server (2001:470:20::2)

4) git master adds a static route to the RA-provided DNS server (3ffe:b80:17e2::1)

5) the default route added by the kernel on git master sets an expiry time; should we do this in pavlix/rdisc?  In fact, the kernel sets expiry times for a lot of the routes generated by the RA, should we do that in pavix/rdisc too?
Comment 29 Dan Williams 2013-07-16 18:18:12 UTC
My radvd.conf is:

	prefix 3ffe:0b80:17e2::/64
	{
		AdvOnLink on;
		AdvAutonomous on;
	};

	RDNSS 3ffe:0b80:17e2::1 {
		AdvRDNSSLifetime 30;
	};

	DNSSL foobar6.com {
		AdvDNSSLLifetime 30;
	};

dhcp6d.conf is:

authoritative;
option dhcp6.name-servers 2001:470:20::2;

default-lease-time 300;
max-lease-time 300;

subnet6 3ffe:b80:17e2::/64 {
	range6 3ffe:b80:17e2::128 3ffe:b80:17e2::254;
}
Comment 30 Pavel Simerda 2013-07-16 19:30:43 UTC
(In reply to comment #28)
> 1) 3ffe:b80:17e2:0:223:15ff:fe08:7eac (the RA-provided route) has a prefix of
> /128 in pavlix/rdisc, but /64 in git master

This is intentional. The prefix length is AFAIK only significant for routing and I believe that is better handled in the routing table directly. When AdvOnLink is on, the route is installed by NetworkManager. When AdvOnLink is off, the route should *not* be installed.

Also, as a pleasant side effect, this means that all dynamic IPv6 addresses are /128, thus removing the difference between RA-provided address and DHCP-provided address.

> 2) pavlix/rdisc adds a route to a local address??
> 3ffe:b80:17e2:0:223:15ff:fe08:7eac

It's automatically installed by the kernel and is AFAIK harmless and overridden by in the 'local' table anyway.

> 3) pavlix/rdisc adds a static route to the DHCP-provided DNS server
> (2001:470:20::2)
> 4) git master adds a static route to the RA-provided DNS server
> (3ffe:b80:17e2::1)

In my opinion, it is a bug to install routes to DNS servers. DNS servers live at a different layer. But the difference between 'master' and 'rdisc' suggests the order of nameservers is changed.

> 5) the default route added by the kernel on git master sets an expiry time;
> should we do this in pavlix/rdisc?  In fact, the kernel sets expiry times for a
> lot of the routes generated by the RA, should we do that in pavix/rdisc too?

It is a good idea to set the expiry times for all addresses and routes, but it shouldn't be a blocker. Compare to lifetimes coming from DHCP. It would be a feature request for pavlix/runtime.
Comment 31 Pavel Simerda 2013-07-16 19:44:02 UTC
(In reply to comment #30)
> (In reply to comment #28)
> > 3) pavlix/rdisc adds a static route to the DHCP-provided DNS server
> > (2001:470:20::2)
> > 4) git master adds a static route to the RA-provided DNS server
> > (3ffe:b80:17e2::1)
> 
> In my opinion, it is a bug to install routes to DNS servers. DNS servers live
> at a different layer. But the difference between 'master' and 'rdisc' suggests
> the order of nameservers is changed.

Taking back. The routes are added by the kernel. We generally need to ignore all 'cache' routes and it is a bug that your tool (e.g. iproute) displays them by default.
Comment 32 Pavel Simerda 2013-07-16 20:19:09 UTC
Known issue: Temporary addresses (aka privacy extensions) are not supported.

We agreed on IRC that pavlix/rdisc can be merged without this feature provided that we will add the feature later, when pavlix/runtime bits regarding address lifetimes are added. Also we agreed that pavlix/runtime should be amended to support address/route lifetimes for router discovery as well as DHCP (DHCP addresses are already there).
Comment 33 Dan Williams 2013-07-17 16:42:45 UTC
I'm not getting the RDNSS or DNSSL information from the RA put into resolv.conf, also I don't see it in the debug dump of the RA info:

NetworkManager[2226]: <debug> [1374079219.222859] [rdisc/nm-rdisc.c:103] config_changed(): (wlan0): router discovery configuration changed [RSD]:
NetworkManager[2226]: <debug> [1374079219.222920] [rdisc/nm-rdisc.c:104] config_changed():   dhcp-level managed
NetworkManager[2226]: <debug> [1374079219.222967] [rdisc/nm-rdisc.c:109] config_changed():   gateway fe80::201:2eff:fe47:3b0f pref 2 exp 17171
NetworkManager[2226]: <debug> [1374079219.223016] [rdisc/nm-rdisc.c:115] config_changed():   address 3ffe:b80:17e2:0:223:15ff:fe08:7eac exp 86400
NetworkManager[2226]: <debug> [1374079219.223107] [rdisc/nm-rdisc.c:121] config_changed():   route 3ffe:b80:17e2:: pref 0 exp 86400

I've verified that information *is* sent in the RA using wireshark though, so somehow either libndp isn't parsing it, or it's not getting read correctly out of libndp.
Comment 34 Dan Williams 2013-07-17 16:45:07 UTC
We should also start separate bugs for the enhancements/fixes we need to do after this branch is merged, we currently have:

1) privacy extensions
2) possibly reduce frequency of log messages
3) back off RS interval if no RA has been received so that non-IPv6 networks don't keep getting flooded with RSs (and restart interval when an RA is received)
4) send address lifetimes to the kernel
Comment 35 Pavel Simerda 2013-07-17 20:14:03 UTC
(In reply to comment #33)
> I'm not getting the RDNSS or DNSSL information from the RA put into
> resolv.conf, also I don't see it in the debug dump of the RA info:
> 
> NetworkManager[2226]: <debug> [1374079219.222859] [rdisc/nm-rdisc.c:103]
> config_changed(): (wlan0): router discovery configuration changed [RSD]:
> NetworkManager[2226]: <debug> [1374079219.222920] [rdisc/nm-rdisc.c:104]
> config_changed():   dhcp-level managed
> NetworkManager[2226]: <debug> [1374079219.222967] [rdisc/nm-rdisc.c:109]
> config_changed():   gateway fe80::201:2eff:fe47:3b0f pref 2 exp 17171
> NetworkManager[2226]: <debug> [1374079219.223016] [rdisc/nm-rdisc.c:115]
> config_changed():   address 3ffe:b80:17e2:0:223:15ff:fe08:7eac exp 86400
> NetworkManager[2226]: <debug> [1374079219.223107] [rdisc/nm-rdisc.c:121]
> config_changed():   route 3ffe:b80:17e2:: pref 0 exp 86400
> 
> I've verified that information *is* sent in the RA using wireshark though, so
> somehow either libndp isn't parsing it, or it's not getting read correctly out
> of libndp.

Fixed. Items were removed in cleanup_* functions because their timestamps were left zero.
Comment 36 Dan Williams 2013-07-17 22:52:00 UTC
Looks better, though now I get:

NetworkManager[17666]: <debug> [1374101372.258341] [rdisc/nm-rdisc.c:127] config_changed():   dns_server 3ffe:b80:17e2::1

and in resolv.conf:

nameserver 3ffe:b80:17e2::

is that expected?  Or is my radvd.conf just wrong?  I thought that only zeros could be omitted from addresses, and thus there should be a "1" at the end in resolv.conf.
Comment 37 Pavel Simerda 2013-07-18 10:07:50 UTC
(In reply to comment #36)
> Looks better, though now I get:
> 
> NetworkManager[17666]: <debug> [1374101372.258341] [rdisc/nm-rdisc.c:127]
> config_changed():   dns_server 3ffe:b80:17e2::1
> 
> and in resolv.conf:
> 
> nameserver 3ffe:b80:17e2::

That means rdisc has the correct value but it disappears on the way to resolv.conf. I don't see any obvious bug. I would suspect a off-by-one or INET_ADDRSTRLEN instead of INET6_ADDRSTRLEN.
Comment 38 Dan Winship 2013-07-18 12:00:52 UTC
in nm-device.c:rdisc_config_changed():

	NMRDiscDNSServer *discovered_server = &g_array_index (rdisc->routes, NMRDiscDNSServer, i);

rdisc->routes should be rdisc->dns_servers of course.
Comment 39 Pavel Simerda 2013-07-18 14:19:39 UTC
Just a reminder, we should not merge before any renaming [as suggested by Dan Winship].
Comment 40 Pavel Simerda 2013-07-18 14:24:13 UTC
(In reply to comment #38)
> in nm-device.c:rdisc_config_changed():
> 
>     NMRDiscDNSServer *discovered_server = &g_array_index (rdisc->routes,
> NMRDiscDNSServer, i);
> 
> rdisc->routes should be rdisc->dns_servers of course.

Looks like you found the problem mentioned above. Thanks!

Branch updated.
Comment 41 Pavel Simerda 2013-07-18 14:25:11 UTC
When building and running on Gentoo, I get:

# NetworkManager --debug
NetworkManager: error while loading shared libraries: libndp.so.0: cannot open shared object file: No such file or directory
Comment 42 Dan Williams 2013-07-18 14:36:49 UTC
(In reply to comment #41)
> When building and running on Gentoo, I get:
> 
> # NetworkManager --debug
> NetworkManager: error while loading shared libraries: libndp.so.0: cannot open
> shared object file: No such file or directory

Yeah, that's because we're adding libndp.so.0 but not telling the link where to find the library at runtime, and so it of course assumes we mean /usr/lib64.  We have three choices when using private libndp:

1) build libndp as a static library (pass --enable-static at configure time) and link it into NM; this is pretty evil but it works for distributors too

2) use rpath to tell the link where our private copy of libndp is inside the NM tree so it can actually use it; this only works for developers who run NM from the build tree

3) always use system libndp

A perfect world is #3 (and we have it in Fedora but jpirko needs to update it).  Until then, #1 might be a cleaner solution, except that we'd need more Makefile magic because system libndp is a shared object while the static library would require different Makefile flags.
Comment 43 Dan Williams 2013-07-19 21:50:53 UTC
danw: would it be possible to merge the branch before doing a rename of the various rdisc stuff?
Comment 44 Dan Winship 2013-07-20 12:44:44 UTC
Sure, just make sure there's a bug filed for it...
Comment 45 Pavel Simerda 2013-07-20 15:11:47 UTC
Pushed pavlix/rdisc together with a couple of [hopefully] safe patches from pavlix/runtime to avoid the conflict. The libndp library from git submodule is now linked statically.

Keeping the bug open until we process various suggestions and missing parts.
Comment 46 Pavel Simerda 2013-07-22 11:08:25 UTC
Pushed two new patches to master.
Comment 47 Pavel Simerda 2013-07-30 20:33:04 UTC
Filed bug 705170 (temporary addresses) and bug 705171 (router solicitation timing).

(In reply to comment #44)
> Sure, just make sure there's a bug filed for it...

danw: Could you please start a bug report for the renaming, you have the best insight in what you would like to change.

Unless there's anything else, this bug can be closed as RESOLVED/FIXED.
Comment 48 Dan Williams 2013-08-13 14:27:22 UTC
I think this one is done for now...
Comment 49 Pavel Simerda 2013-08-20 15:37:54 UTC
Adding the original idea link for reference:

https://fedoraproject.org/wiki/Networking/Ideas/AutomaticConfiguration