After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 699751 - [review] pavlix/platform-use: switch the whole codebase to nm-platform
[review] pavlix/platform-use: switch the whole codebase to nm-platform
Status: RESOLVED FIXED
Product: NetworkManager
Classification: Platform
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: Pavel Simerda
NetworkManager maintainer(s)
Depends on: 702647
Blocks: 682872 690212
 
 
Reported: 2013-05-06 11:58 UTC by Pavel Simerda
Modified: 2013-07-07 20:57 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
a large patch for switching link management to nm-platform (32.31 KB, patch)
2013-05-24 10:54 UTC, Pavel Simerda
none Details | Review
a large patch for switching link management to nm-platform (32.31 KB, patch)
2013-05-24 11:37 UTC, Pavel Simerda
none Details | Review
a large patch for switching link management to nm-platform (32.31 KB, patch)
2013-05-24 11:40 UTC, Pavel Simerda
none Details | Review
a large patch for switching link management to nm-platform (30.77 KB, patch)
2013-05-24 11:42 UTC, Pavel Simerda
none Details | Review
a large patch for switching link management to nm-platform (30.77 KB, patch)
2013-05-24 11:43 UTC, Pavel Simerda
reviewed Details | Review
a large patch for switching link management to nm-platform (30.30 KB, patch)
2013-05-24 17:16 UTC, Pavel Simerda
committed Details | Review

Description Pavel Simerda 2013-05-06 11:58:22 UTC
We are finalizing the nm-platform module in the master branch. The rest of the code doesn't use it yet. Patches to switch to nm-platform are in 'pavlix/platform-use' and still a work in progress.

Those changes, as well as other modifications using nm-platform (e.g. to support temporary connections as in https://bugzilla.gnome.org/show_bug.cgi?id=682872), can and will affect the code stability and introduce bugs that will have to be solved.
Comment 1 Pavel Simerda 2013-05-24 10:54:34 UTC
Created attachment 245223 [details] [review]
a large patch for switching link management to nm-platform
Comment 2 Pavel Simerda 2013-05-24 11:37:29 UTC
Created attachment 245230 [details] [review]
 a large patch for switching link management to nm-platform
Comment 3 Pavel Simerda 2013-05-24 11:40:42 UTC
Created attachment 245231 [details] [review]
a large patch for switching link management to nm-platform
Comment 4 Pavel Simerda 2013-05-24 11:42:15 UTC
Created attachment 245233 [details] [review]
a large patch for switching link management to nm-platform
Comment 5 Pavel Simerda 2013-05-24 11:43:39 UTC
Created attachment 245234 [details] [review]
a large patch for switching link management to nm-platform
Comment 6 Dan Winship 2013-05-24 14:09:48 UTC
Comment on attachment 245234 [details] [review]
a large patch for switching link management to nm-platform

>-	success = nm_system_bond_enslave (nm_device_get_ip_ifindex (device),
>-	                                  iface,
>-	                                  nm_device_get_ip_ifindex (slave),
>-	                                  slave_iface);
>+	success = nm_platform_link_enslave (nm_device_get_ip_ifindex (device),
>+	                                    nm_device_get_ip_ifindex (slave));

optionally you could get rid of the iface and slave_iface variables now, like the corresponding code in release_slave(). (Likewise in bridge.)

> 		// FIXME: Convert this into a no-export property so type can be specified
> 		//        when the device is created.
>-		itype = nm_system_get_iface_type (nm_device_get_ifindex (self), nm_device_get_iface (self));
>-		g_assert (itype == NM_IFACE_TYPE_UNSPEC);
>+		g_assert (nm_platform_link_get_type (ifindex) == NM_LINK_TYPE_ETHERNET);

The comment is out of date; we will have different device subtypes for different NMPlatformLinkType values.

> 		if (!device_has_capability (dev, NM_DEVICE_CAP_NONSTANDARD_CARRIER)) {
>+			NMPlatform *platform = nm_platform_get ();
> 
>+			priv->link_changed_id = g_signal_connect (platform, "link-changed",
>+			                                          G_CALLBACK (link_changed_cb), dev);
> 		}

Possibly replace this with the code from http://cgit.freedesktop.org/NetworkManager/NetworkManager/diff/src/devices/nm-device.c?h=danw/moredevs&id=e0565e4f, so we don't end up just changing it twice? (The new method is used so that the devices can emit proper properties-changed signals in response to a link-changed event.

(If you merge that in, remember to add back the check for NONSTANDARD_CARRIER in link_changed())

> 	int ifindex = nm_device_get_ip_ifindex (device);
> 
>-	return ifindex > 0 ? nm_system_iface_is_up (ifindex) : TRUE;
>+	return ifindex ? nm_platform_link_is_up (nm_device_get_ip_ifindex (device)) : TRUE;

You don't need to call nm_device_get_ip_ifindex() again...

> 	/* Store carrier immediately. */
> 	if (result && device_has_capability (device, NM_DEVICE_CAP_CARRIER_DETECT))
>-		check_carrier (device);
>+		nm_device_set_carrier (device, nm_platform_link_is_connected (nm_device_get_ip_ifindex (device)));

This loses the NONSTANDARD_CARRIER check (that was in check_carrier()), meaning it will overwrite NMDeviceAdsl's carrier with garbage.

>+		result = !nm_platform_bridge_add (iface);
>+		if (!result && nm_platform_get_error () != NM_PLATFORM_ERROR_EXISTS) {

The "!" in the first line is wrong I think? (It's very confusing if not...)

>@@ -2196,7 +2213,7 @@ udev_device_added_cb (NMUdevManager *udev_mgr,
> 		device = find_device_by_ifindex (self, ifindex);
> 		if (device) {
> 			/* If it's a virtual device we may need to update its UDI */
>-			if (nm_system_get_iface_type (ifindex, iface) != NM_IFACE_TYPE_UNSPEC)
>+			if (nm_platform_link_get_type (ifindex) != NM_LINK_TYPE_UNKNOWN)
> 				g_object_set (G_OBJECT (device), NM_DEVICE_UDI, sysfs_path, NULL);

This is wrong... same as the one I noticed in nm-device last night. NM_IFACE_TYPE_UNSPEC means "not bond, bridge, or vlan".

Given that you removed the check in nm-device, we can probably just re-set the property unconditionally here?
Comment 7 Pavel Simerda 2013-05-24 17:16:03 UTC
Created attachment 245256 [details] [review]
a large patch for switching link management to nm-platform

Leaving out any of the optional stuff for later fixes.
Comment 8 Pavel Simerda 2013-05-24 17:45:14 UTC
Review of attachment 245256 [details] [review]:

Pushed to master.
Comment 9 Pavel Simerda 2013-05-24 17:47:47 UTC
Leaving the bug open for continuous work on nm-platform integration, namely address and route management.
Comment 10 Pavel Simerda 2013-05-27 19:10:32 UTC
Opening the rest of the patches (pavlix/platform-use) for review. This one might be a bit tough, so as long as you are happy with *any* single patch, please let me know. You can safely ignore the cleanup patches (I'll probably move them at the and and partially squash them anyway).
Comment 11 Dan Winship 2013-05-28 16:43:16 UTC
>    platform: address utility functions
>
>+		if (!memcmp (&g_array_index (addresses, NMPlatformIP4Address, i), address, sizeof (*address)))

g_array_index() is kinda ugly. A lot of times I find it nicer to do something like:

    NMPlatformIP4Address *addrs = (NMPlatformIP4Address *)addresses->data;

    for (i = 0; i < addresses->len; i++) {
        if (!memcmp (&addrs[i], address, sizeof (*address)))
            ...

YMMV.

The patch as a whole makes me really feel like we should only have a singe NMPlatformIPAddress though, since the ip4 and ip6 versions are exact duplicates of each other. That can happen later though.

>+	if (!nm_platform_ip4_address_delete (ifindex, address->address, address->plen))
>+		return FALSE;

Note that nm-system would just log an error and keep going in this case rather than bailing out.

>+nm_platform_ip6_address_sync (int ifindex, const GArray *known_addresses)

nm-system ignores IPv6 link-local addresses when syncing... I suppose we don't actually have to do that, since nm_ip6_manager_get_ip6_config() includes the link-local addresses too... but this is something to watch out that might break.

>+nm_platform_address_flush (int ifindex)

I know you just copied the name from nm-system, but "flush" to me suggests pushing out queued changes or something. "clear"?

>+	return nm_platform_ip4_address_sync (ifindex, NULL)

nm_platform_ip[46]_address_sync() will crash if you pass NULL



>    core: use nm-platform for address management

Hm... we always call nm_platform_address_flush() and nm_platform_route_flush() together, and it's not clear there's any reason you'd ever want to call them separately, so maybe they should be merged together? And in all but one of the places they're called, we call nm_platform_link_set_down() too, and there's no situation where we ever call nm_platform_link_set_down() without also flushing addresses and routes, so we should just have link_set_down() call the flushing function(s) itself, to simplify the nm-modem and nm-vpn-connection call sites. (nm-device still wants to be able to flush without downing though, so we still need a separate flushing func too.)



>    policy: use nm-platform for routing

>+	/* Add direct route to external gateway and set default route through
>+	 * internal gateway.
>+	 */

That's not quite what the code does; it tries to set the default route first, and only if that fails does it add a direct route to the external gateway. (Although in the VPN case, it seems like maybe it *should* be the way you commented it...)



>    vpn: use nm-platform for routing

>+	NMPlatformIP4Route *route = g_new0 (NMPlatformIP4Route, 1);

it's generally considered bad style to directly g_new()/g_free() somebody else's datatype. If you need to do this here there should be nm_platform_ip4_route_new() and nm_platform_ip4_route_free().

Actually, it's also pretty weird that you're using NMPlatform's data type to store the data, but never actually passing that object to NMPlatform...

Also, you leak the route in the "if (!parent_gw) return NULL;" case.(Contrariwise, in the ip6 version you never allocate route.)

The nm-vpn-connection code has multiple leaks.



>    platform: add nm_platform_route_set_metric()

it's inconsistent that you don't have separate ip4 and ip6 versions of this...

Also, you don't implement this in either NMFakePlatform or NMLinuxPlatform...

Hm, but you don't need this anyway; you should just be setting the correct metric in the route structs before passing them to sync().


>    core: use route convenience functions

>-	family = tried_ipv6 ? AF_UNSPEC : AF_INET;
> 	if (ifindex > 0) {
>-		nm_system_iface_flush_routes (ifindex, family);
>+		nm_platform_route_flush (ifindex);

This partially breaks NM_SETTING_IP6_CONFIG_METHOD_IGNORE. I'm not sure we care at this point given that we've decided it's doomed.
Comment 12 Dan Williams 2013-06-04 14:47:18 UTC
> core: use nm-platform for address management

We seem to be completely losing the Point-to-Point address stuff here (rtnl_addr_set_peer).  That will break PPP handling; if the address is meant for a PPP interface, then we get a local address and a remote address, and we need to tell the kernel to use the remote address as a the peer via rtnl_addr_set_peer().  Obviously there can only be one PTP address for an interface, which is what the "did_gw" stuff in add_ip4_addresses() used to do that's now removed.

For IPv6 we don't really have PTP addresses, I don't think, because on PPP interfaces we still get a prefix and router advertisements, so everything gets directed to the router.  Not 100% sure on the exact kernel setup though.
Comment 13 Dan Williams 2013-06-04 14:52:37 UTC
> vpn: use nm-platform for routing

whitespace issues

ip4_gw_route and ip6_gw_route should get freed in dispose() like the old gw_route was.
Comment 14 Dan Williams 2013-06-04 14:54:34 UTC
> platform: add nm_platform_route_set_metric()

No implementation for this yet?  If it gets called then we'll return_val_if_fail().  Should only get added when we have an implementation right?
Comment 15 Pavel Simerda 2013-06-06 07:28:00 UTC
(In reply to comment #11)
> >    platform: address utility functions
> >
> >+		if (!memcmp (&g_array_index (addresses, NMPlatformIP4Address, i), address, sizeof (*address)))
> 
> g_array_index() is kinda ugly. A lot of times I find it nicer to do something
> like:
> 
>     NMPlatformIP4Address *addrs = (NMPlatformIP4Address *)addresses->data;
> 
>     for (i = 0; i < addresses->len; i++) {
>         if (!memcmp (&addrs[i], address, sizeof (*address)))
>             ...
> 
> YMMV.

I think we can affort to call it a cosmetic issue and fix it up later...

> The patch as a whole makes me really feel like we should only have a singe
> NMPlatformIPAddress though, since the ip4 and ip6 versions are exact duplicates
> of each other. That can happen later though.

Those structures are very simple and are of different size. The decision was made to avoid polymorphy in the API. While I don't mind switching to a polymorphic model later, I don't think it will give us any real value.

I feel slightly against this change unless we have a good reason. Even then that would be done later, feel free to track it in a separate ticket then.

> >+	if (!nm_platform_ip4_address_delete (ifindex, address->address, address->plen))
> >+		return FALSE;
> 
> Note that nm-system would just log an error and keep going in this case rather
> than bailing out.

Switched to the old behavior.
 
> >+nm_platform_ip6_address_sync (int ifindex, const GArray *known_addresses)
> 
> nm-system ignores IPv6 link-local addresses when syncing...

My impression is that we should ignore link-local addresses entirely and leave them up to the kernel. Likewise we should avoid tinkering with routes which belong to them.

TODO I will look into it.

> I suppose we don't
> actually have to do that, since nm_ip6_manager_get_ip6_config() includes the
> link-local addresses too... but this is something to watch out that might
> break.

Better safe than sorry. We might want to avoid them in all NetworkManager code except maybe published reporting/debugging information. I'll change the code so that we avoid adding/deleting link local addresses and their corresponding device routes.

> >+nm_platform_address_flush (int ifindex)
> 
> I know you just copied the name from nm-system, but "flush" to me suggests
> pushing out queued changes or something. "clear"?

Could be. Flush also corresponds to iproute2 naming, but that doesn't mean much. Could also be _delete_all or something along that line.

> >+	return nm_platform_ip4_address_sync (ifindex, NULL)
> 
> nm_platform_ip[46]_address_sync() will crash if you pass NULL

Fixed.

> >    core: use nm-platform for address management
> 
> Hm... we always call nm_platform_address_flush() and nm_platform_route_flush()
> together, and it's not clear there's any reason you'd ever want to call them
> separately, so maybe they should be merged together?

I'm not sure.

> And in all but one of the
> places they're called, we call nm_platform_link_set_down() too, and there's no
> situation where we ever call nm_platform_link_set_down() without also flushing
> addresses and routes, so we should just have link_set_down() call the flushing
> function(s) itself, to simplify the nm-modem and nm-vpn-connection call sites.

> (nm-device still wants to be able to flush without downing though, so we still
> need a separate flushing func too.)

Actually, link_set_down is a low-level interface and you never know when you need it. But we could possibly have a new function that would flush addresses and routes and optionally (via a boolean argument) also set interface down. That would be a convenience function in nm-platform.c.

And because it's not strictly bound to a link, it could just be called nm_platform_cleanup(). Alternatively it could be nm_platform_link_cleanup() and would break the unwritten rule that _link_ functions don't do anything about addresses and routes, which we might need to break anyway for some reason or other.


> >    policy: use nm-platform for routing
> 
> >+	/* Add direct route to external gateway and set default route through
> >+	 * internal gateway.
> >+	 */
> 
> That's not quite what the code does; it tries to set the default route first,
> and only if that fails does it add a direct route to the external gateway.

Ah, it's a left-over. Removing the comments, the code is self-descriptive.

> (Although in the VPN case, it seems like maybe it *should* be the way you
> commented it...)

Maybe?

> >    vpn: use nm-platform for routing
> 
> >+	NMPlatformIP4Route *route = g_new0 (NMPlatformIP4Route, 1);
> 
> it's generally considered bad style to directly g_new()/g_free() somebody
> else's datatype.

In this case we're just borrowing a type to avoid creating our own. 

> If you need to do this here there should be
> nm_platform_ip4_route_new() and nm_platform_ip4_route_free().

From the nm-platform's point of view, the struct is actually intended for read-only use from signals and _get_all() functions.

> Actually, it's also pretty weird that you're using NMPlatform's data type to
> store the data, but never actually passing that object to NMPlatform...

It naturally follows from the facts stated above.

> Also, you leak the route in the "if (!parent_gw) return NULL;"
> case.(Contrariwise, in the ip6 version you never allocate route.)

Fixed.

> The nm-vpn-connection code has multiple leaks.

TODO Will look into it.

> >    platform: add nm_platform_route_set_metric()
> 
> it's inconsistent that you don't have separate ip4 and ip6 versions of this...

address_flush/route_flush are also protocol agnostic. I think this is more about usefulness than about having all functions protocol specific.

> Also, you don't implement this in either NMFakePlatform or NMLinuxPlatform...
> 
> Hm, but you don't need this anyway;

I'll see whether I can remove it.

> you should just be setting the correct
> metric in the route structs before passing them to sync().

That's true. I'll either remove it or make it a convenience function that doesn't need separate implementations.

> >    core: use route convenience functions
> 
> >-	family = tried_ipv6 ? AF_UNSPEC : AF_INET;
> > 	if (ifindex > 0) {
> >-		nm_system_iface_flush_routes (ifindex, family);
> >+		nm_platform_route_flush (ifindex);
> 
> This partially breaks NM_SETTING_IP6_CONFIG_METHOD_IGNORE. I'm not sure we care
> at this point given that we've decided it's doomed.

I'm just curious whether IGNORE was broken already by flushing all routes instead of only IPv4 routes. *If* IGNORE was still necessary, we'd probably need to either add back the family argument, or use nm_platform_ip4_route_sync(NULL) instead.
Comment 16 Pavel Simerda 2013-06-06 07:43:36 UTC
(In reply to comment #12)
> > core: use nm-platform for address management
> 
> We seem to be completely losing the Point-to-Point address stuff here
> (rtnl_addr_set_peer).  That will break PPP handling; if the address is meant
> for a PPP interface, then we get a local address and a remote address, and we
> need to tell the kernel to use the remote address as a the peer via
> rtnl_addr_set_peer().  Obviously there can only be one PTP address for an
> interface, which is what the "did_gw" stuff in add_ip4_addresses() used to do
> that's now removed.

True. We could certainly add a peer address and keep it zero when ordinary prefix is used. I would just like to know whether we care at all. The peer is afaik only used to set up a direct route which we can set up directly and which we see if another tool sets the IP address. Anything I missed?

> For IPv6 we don't really have PTP addresses, I don't think, because on PPP
> interfaces we still get a prefix and router advertisements, so everything gets
> directed to the router.  Not 100% sure on the exact kernel setup though.

Ah, router advertisements used with PPP? I see another example of blatant overengineering after IPv6 autoconfiguration of Ethernet.

(In reply to comment #14)
> > platform: add nm_platform_route_set_metric()
> 
> No implementation for this yet?  If it gets called then we'll
> return_val_if_fail().  Should only get added when we have an implementation
> right?

Yep, that's missing, we shouldn't push this nor dependant commits before it's done.
Comment 17 Dan Winship 2013-06-06 11:33:15 UTC
(In reply to comment #12)
> For IPv6 we don't really have PTP addresses, I don't think, because on PPP
> interfaces we still get a prefix and router advertisements

[citation required]
Comment 18 Dan Williams 2013-06-13 15:26:31 UTC
(In reply to comment #17)
> (In reply to comment #12)
> > For IPv6 we don't really have PTP addresses, I don't think, because on PPP
> > interfaces we still get a prefix and router advertisements
> 
> [citation required]

http://www.ietf.org/rfc/rfc2472.txt

IPV6CP (the PPP IPv6 address negotiation protocol) *only* negotiates an Interface Identifier, nothing else; no peer address, no gateway, whatever.  You then listen for Router Advertisements on the PPP interface to get the rest of the configuration options, like prefix, domains, DNS, etc.
Comment 19 Pavel Simerda 2013-06-14 22:33:31 UTC
Ready for next round of review. Tried to fix all outstanding issues. Ignored cosmetic issues due to lack of time, those can be handled afterwards. What's still unsolved is the IPv4 peer address. As stated above, I'm not sure we really need that as IMO it can be easily implemented using a device route, comments welcome.
Comment 20 Dan Williams 2013-06-18 18:39:39 UTC
(In reply to comment #19)
> Ready for next round of review. Tried to fix all outstanding issues. Ignored
> cosmetic issues due to lack of time, those can be handled afterwards. What's
> still unsolved is the IPv4 peer address. As stated above, I'm not sure we
> really need that as IMO it can be easily implemented using a device route,
> comments welcome.

I don't really know whether the peer address stuff can be changed or not, but I'm happy to test out some kind of fix here.  That would still mean that we need to specially handle these devices somehow.
Comment 21 Dan Williams 2013-06-18 20:09:48 UTC
Ok, testing reveals that nm_platform_ip[4|6]_address_sync() need two changes:

First, GArrays need to be freed with g_array_free(addresses, TRUE);

Second, there's a C&P error in the "Add missing addresses" block:

-known_address = &g_array_index (addresses, NMPlatformIP4Address, i);
+known_address = &g_array_index (known_addresses, NMPlatformIP4Address, i);

which results in a use-after-free since 'addresses' was already freed.  Clearly C&P.
Comment 22 Dan Williams 2013-06-18 20:27:23 UTC
Next, here's the PPP output; there are some bugs we should fix here. However, the routing table and ifconfig output look correct for some reason:

CHAP authentication succeeded
CHAP authentication succeeded
** Message: nm-ppp-plugin: (nm_phasechange): status 8 / phase 'network'
Could not determine remote IP address: defaulting to 10.64.64.64
local  IP address 100.252.128.55
remote IP address 10.64.64.64
primary   DNS address 10.177.0.34
secondary DNS address 10.168.191.116
** Message: nm-ppp-plugin: (nm_phasechange): status 9 / phase 'running'
** Message: nm-ppp-plugin: (nm_ip_up): ip-up event
** Message: nm-ppp-plugin: (nm_ip_up): sending Ip4Config to NetworkManager...
NetworkManager[11420]: <info> PPP manager(IP Config Get) reply received.
NetworkManager[11420]: <info> Activation (ttyACM0) Stage 5 of 5 (IPv4 Configure Commit) scheduled...
NetworkManager[11420]: <info> Activation (ttyACM0) Stage 5 of 5 (IPv4 Commit) started...
NetworkManager[11420]: <info> (ttyACM0): device state change: ip-config -> secondaries (reason 'none') [70 90 0]
NetworkManager[11420]: <info> Activation (ttyACM0) Stage 5 of 5 (IPv4 Commit) complete.
NetworkManager[11420]: <info> (ttyACM0): device state change: secondaries -> activated (reason 'none') [90 100 0]
NetworkManager[11420]: refresh_object: assertion `kernel_object' failed
NetworkManager[11420]: refresh_object: assertion `kernel_object' failed
NetworkManager[11420]: <error> [1371586385.370965] [nm-policy.c:666] update_ip4_routing(): Failed to set default route.
NetworkManager[11420]: <info> Policy set 'T-Mobile Internet' (ppp0) as default for IPv4 routing and DNS.
NetworkManager[11420]: <info> Activation (ttyACM0) successful, device activated.
NetworkManager[11420]: <warn> Dispatcher script failed: Script '/etc/NetworkManager/dispatcher.d/04-iscsi' exited with error status 1.
NetworkManager[11420]: <warn> Dispatcher script failed: Script '/etc/NetworkManager/dispatcher.d/12-dhcpd' exited with error status 1.
NetworkManager[11420]: refresh_object: assertion `kernel_object' failed
NetworkManager[11420]: refresh_object: assertion `kernel_object' failed
NetworkManager[11420]: <error> [1371586387.121417] [nm-policy.c:666] update_ip4_routing(): Failed to set default route.
NetworkManager[11420]: <info> Policy set 'T-Mobile Internet' (ppp0) as default for IPv4 routing and DNS.

default via 10.64.64.64 dev ppp0  proto static 
10.64.64.64 dev ppp0  proto static  scope link 
10.64.64.64 dev ppp0  proto kernel  scope link  src 100.252.128.55 

22: ppp0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 3
    link/ppp 
    inet 100.252.128.55 peer 10.64.64.64/32 scope global ppp0
       valid_lft forever preferred_lft forever

but this is *despite* NetworkManager, because pppd actually sets the ppp0 IP address and default route behind NetworkManager.  We can't rely on it doing that (and we'd rather it didn't, but that requires changing pppd) so we should fix this in the platform.

I hacked up the platform to delete all addresses on the interface and manually add the PPP address, which appeared to work with a device route and no peer address:

24: ppp0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 3
    link/ppp 
    inet 100.252.128.55/32 scope global ppp0
       valid_lft forever preferred_lft forever

default dev ppp0  scope link 

this configuration does allow pings to go through.  So perhaps as long as we fix up creating the default route and the two other issues above we might be good for PPP.
Comment 23 Dan Williams 2013-06-18 20:28:29 UTC
Oh, same g_array_free() and c&p problems in nm_platform_ip[4|6]_route_sync() too.
Comment 24 Dan Williams 2013-06-18 22:25:02 UTC
Also, the return value of add_kernel_object() should be 'int' because that's what all the libnl stuff in that function returns.

Now, the default route setting error.  This is happening because nl_cache_search() matches more properties than we're setting on the defualt route we just added, but the kernel fills some stuff in for us, so the comparisons in get_kernel_object() fail:

NetworkManager[6374]: get_kernel_object: ******* needle
inet 0.0.0.0/0 table main 
    priority 0 
    nexthop via 192.168.1.1 dev 5

NetworkManager[6374]: get_kernel_object: ------- haystack
inet default table main type unicast 
    scope global priority 0 protocol static 
    nexthop via 192.168.1.1 dev 5 <>

build_rtnl_route() needs two changes:

1) set the rest of the route attributes:

+       rtnl_route_set_scope (rtnlroute, RT_SCOPE_UNIVERSE);
+       rtnl_route_set_type (rtnlroute, RTN_UNICAST);
+       rtnl_route_set_protocol (rtnlroute, RTPROT_STATIC);

2) for default routes, either (a) don't set 'dst' at all, or (b) set the addresslen of 'dst' to zero; libnl decides a route is default if:

	if (!(r->ce_mask & ROUTE_ATTR_DST) ||
	    nl_addr_get_len(r->rt_dst) == 0)
		nl_dump(p, "default ");

and clearly the route we added does not pass that test, hence it was returned 0.0.0.0/0.  But of course the kernel knows what we mean, and converts that into a default route.  This should really get fixed in libnl to mean the same thing the kernel means, but we also need to workaround that for now.

Unfortunately build_rtnl_route() is generic and 'network' is a void* so we can't de-reference it to find out whether it's 0 or not, but that would defeat the purpose of making it generic too.  So maybe the callers could push down a "valid network" boolean which if FALSE tells build_rtnl_route() to set dst's address length to zero.

Doing these two things makes the needle match the haystack route:

NetworkManager[25793]: get_kernel_object: ******* needle
inet default table main type unicast 
    scope global priority 0 protocol static 
    nexthop via 192.168.1.1 dev 5

NetworkManager[25793]: get_kernel_object: ------- haystack
inet default table main type unicast 
    scope global priority 0 protocol static 
    nexthop via 192.168.1.1 dev 5 <>
Comment 25 Dan Williams 2013-06-18 22:30:55 UTC
I pushed fixup patches for the issues I've mentioned, the last patch is a hack-up fix for the default route setting to demonstrate how to fix it for default routes, but I have no idea whether the added attributes are appropriate for non-default routes or not, you know more about that than I.
Comment 26 Pavel Simerda 2013-06-19 13:04:04 UTC
(In reply to comment #25)
> I pushed fixup patches for the issues I've mentioned

It's better to push under a separate branch name (in your namespace) to avoid clashes.

I just filed another bug report that prevents me from effectively testing the default route:

https://bugzilla.gnome.org/show_bug.cgi?id=702647

I created a patch to test the default route, tested it on test-route-fake and published in 'pavlix/platform' branch.

I'm curious why you want to pass valid_network argument which is apparently redundant as it's derived from the network argument. Also, I'm not sure what you are trying to solve.

A route with destination zero plen is a default route (and all address bytes are by definition zeroes), while a route with non-zero plen is a non-default route. We shouldn't look at the address bytes to distinguish a default route and we don't need to carry additional arguments as plen itself distinguishes default and non-default routes.

What about the following patch?

        struct rtnl_route *rtnlroute = rtnl_route_alloc ();
        struct rtnl_nexthop *nexthop = rtnl_route_nh_alloc ();
        int addrlen = (family == AF_INET) ? sizeof (in_addr_t) : sizeof (struct in6_addr);
-       auto_nl_addr struct nl_addr *dst = nl_addr_build (family, network, addrlen);
+       auto_nl_addr struct nl_addr *dst = plen ? nl_addr_build (family, network, addrlen) : NULL;
        auto_nl_addr struct nl_addr *gw = gateway ? nl_addr_build (family, gateway, addrlen) : NULL;
 
-       g_assert (rtnlroute && dst && nexthop);
-
-       nl_addr_set_prefixlen (dst, plen);
+       g_assert (rtnlroute && nexthop);
 
        rtnl_route_set_table (rtnlroute, RT_TABLE_MAIN);
        rtnl_route_set_tos (rtnlroute, 0);
-       rtnl_route_set_dst (rtnlroute, dst);
        rtnl_route_set_priority (rtnlroute, metric);
 
+       if (dst) {
+               nl_addr_set_prefixlen (dst, plen);
+               rtnl_route_set_dst (rtnlroute, dst);
+       }
+
        rtnl_route_nh_set_ifindex (nexthop, ifindex);
        if (gw && !nl_addr_iszero (gw))
                rtnl_route_nh_set_gateway (nexthop, gw);
Comment 27 Pavel Simerda 2013-06-19 13:07:10 UTC
(In reply to comment #24)
>     if (!(r->ce_mask & ROUTE_ATTR_DST) ||
>         nl_addr_get_len(r->rt_dst) == 0)
>         nl_dump(p, "default ");

This looks like a pretty trivial bug in libnl.
Comment 28 Dan Williams 2013-06-19 13:25:48 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > I pushed fixup patches for the issues I've mentioned
> 
> It's better to push under a separate branch name (in your namespace) to avoid
> clashes.

True.  The proliferation of branches continues.

> I'm curious why you want to pass valid_network argument which is apparently
> redundant as it's derived from the network argument. Also, I'm not sure what
> you are trying to solve.

Partly because I wasn't 100% sure that we'd never see 0.0.0.0 with a prefix, but if you say that plen=0 is all we care about for default route, great.


> A route with destination zero plen is a default route (and all address bytes
> are by definition zeroes), while a route with non-zero plen is a non-default
> route. We shouldn't look at the address bytes to distinguish a default route
> and we don't need to carry additional arguments as plen itself distinguishes
> default and non-default routes.

Ok, excellent.  Skip my patch then.

> What about the following patch?
> 
>         struct rtnl_route *rtnlroute = rtnl_route_alloc ();
>         struct rtnl_nexthop *nexthop = rtnl_route_nh_alloc ();
>         int addrlen = (family == AF_INET) ? sizeof (in_addr_t) : sizeof (struct
> in6_addr);
> -       auto_nl_addr struct nl_addr *dst = nl_addr_build (family, network,
> addrlen);
> +       auto_nl_addr struct nl_addr *dst = plen ? nl_addr_build (family,
> network, addrlen) : NULL;
>         auto_nl_addr struct nl_addr *gw = gateway ? nl_addr_build (family,
> gateway, addrlen) : NULL;
> 
> -       g_assert (rtnlroute && dst && nexthop);
> -
> -       nl_addr_set_prefixlen (dst, plen);
> +       g_assert (rtnlroute && nexthop);
> 
>         rtnl_route_set_table (rtnlroute, RT_TABLE_MAIN);
>         rtnl_route_set_tos (rtnlroute, 0);
> -       rtnl_route_set_dst (rtnlroute, dst);
>         rtnl_route_set_priority (rtnlroute, metric);
> 
> +       if (dst) {
> +               nl_addr_set_prefixlen (dst, plen);
> +               rtnl_route_set_dst (rtnlroute, dst);
> +       }
> +
>         rtnl_route_nh_set_ifindex (nexthop, ifindex);
>         if (gw && !nl_addr_iszero (gw))
>                 rtnl_route_nh_set_gateway (nexthop, gw);

Yeah, that should avoid the matching problem in libnl, but I can check if you like just to  make sure.
Comment 29 Pavel Simerda 2013-06-19 13:55:03 UTC
Updated the branch with fixes for all of the above stuff except the peer addresses. I would like to amend the testsuite before pushing the commits marked as UNTESTED.

(In reply to comment #28)
> Partly because I wasn't 100% sure that we'd never see 0.0.0.0 with a prefix,
> but if you say that plen=0 is all we care about for default route, great.
> > What about the following patch?
> > 
> >         struct rtnl_route *rtnlroute = rtnl_route_alloc ();
> >         struct rtnl_nexthop *nexthop = rtnl_route_nh_alloc ();
> >         int addrlen = (family == AF_INET) ? sizeof (in_addr_t) : sizeof (struct
> > in6_addr);
> > -       auto_nl_addr struct nl_addr *dst = nl_addr_build (family, network,
> > addrlen);
> > +       auto_nl_addr struct nl_addr *dst = plen ? nl_addr_build (family,
> > network, addrlen) : NULL;
> >         auto_nl_addr struct nl_addr *gw = gateway ? nl_addr_build (family,
> > gateway, addrlen) : NULL;
> > 
> > -       g_assert (rtnlroute && dst && nexthop);
> > -
> > -       nl_addr_set_prefixlen (dst, plen);
> > +       g_assert (rtnlroute && nexthop);
> > 
> >         rtnl_route_set_table (rtnlroute, RT_TABLE_MAIN);
> >         rtnl_route_set_tos (rtnlroute, 0);
> > -       rtnl_route_set_dst (rtnlroute, dst);
> >         rtnl_route_set_priority (rtnlroute, metric);
> > 
> > +       if (dst) {
> > +               nl_addr_set_prefixlen (dst, plen);
> > +               rtnl_route_set_dst (rtnlroute, dst);
> > +       }
> > +
> >         rtnl_route_nh_set_ifindex (nexthop, ifindex);
> >         if (gw && !nl_addr_iszero (gw))
> >                 rtnl_route_nh_set_gateway (nexthop, gw);
> 
> Yeah, that should avoid the matching problem in libnl, but I can check if you
> like just to  make sure.

Thomas indicated that this wouldn't work because of auto-created nl_addr. I've posted an alternative patch:

        struct rtnl_route *rtnlroute = rtnl_route_alloc ();
        struct rtnl_nexthop *nexthop = rtnl_route_nh_alloc ();
        int addrlen = (family == AF_INET) ? sizeof (in_addr_t) : sizeof (struct in6_addr);
-       auto_nl_addr struct nl_addr *dst = nl_addr_build (family, network, addrlen);
+       /* Workaround a libnl bug by using zero destination address length for default routes */
+       auto_nl_addr struct nl_addr *dst = nl_addr_build (family, network, plen ? addrlen : 0);
        auto_nl_addr struct nl_addr *gw = gateway ? nl_addr_build (family, gateway, addrlen) : NULL;
 
        g_assert (rtnlroute && dst && nexthop);
Comment 30 Dan Williams 2013-06-25 03:20:50 UTC
The PPP bits look fine now.  I also don't get default route failures, *except* in one case, where the VPN doesn't connect:

NetworkManager[27180]: <error> [1372128572.546574] [vpn-manager/nm-vpn-connection.c:1387] get_secrets_cb(): Failed to request VPN secrets #2: (6) No agents were available for this request.
NetworkManager[27180]: <error> [1372128572.547393] [nm-policy.c:666] update_ip4_routing(): Failed to set default route.
NetworkManager[27180]: <info> Policy set 'Auto el chiefo grande' (wlan0) as default for IPv4 routing and DNS.

and also when the VPN succeeds:

NetworkManager[27180]: <info> VPN connection 'rh-openvpn' (IP Config Get) complete.
NetworkManager[27180]: <error> [1372128712.5281] [nm-policy.c:666] update_ip4_routing(): Failed to set default route.
NetworkManager[27180]: <info> Policy set 'Auto el chiefo grande' (wlan0) as default for IPv4 routing and DNS.
NetworkManager[27180]: <info> VPN plugin state changed: started (4)
NetworkManager[27180]: <error> [1372128712.14011] [nm-policy.c:666] update_ip4_routing(): Failed to set default route.
NetworkManager[27180]: <info> Policy set 'Auto el chiefo grande' (wlan0) as default for IPv4 routing and DNS.

But this isn't a big issue since the default route is still set, and the code is (I *think*) complaining that it can't replace the default route with an identical default route.  We should fix this, but it's not a block for the merge.

--------------
What is a merge blocker is fixed by this:

--- a/src/vpn-manager/nm-vpn-connection.c
+++ b/src/vpn-manager/nm-vpn-connection.c
@@ -168,7 +168,7 @@ vpn_cleanup (NMVPNConnection *connection)
                nm_platform_address_flush (priv->ip_ifindex);
        }
 
-       if (priv->ip4_gw_route->ifindex) {
+       if (priv->ip4_gw_route) {
                nm_platform_ip4_route_delete (
                                priv->ip4_gw_route->ifindex,
                                priv->ip4_gw_route->network,

otherwise we get a segfault on the vpn-fails-to-connect case.

----------
Everything else seems to work OK, Wi-Fi  + DHCP, WWAN PPP, WWAN static, and VPN.  So once the above patch is added I think we're good to merge.
Comment 31 Dan Williams 2013-06-25 03:47:25 UTC
Also, rebase one more time before you push since ab8ca2dbe269952fcecef05c17b43b6bf038dc55 might conflict, but it'll be easy to resolve.
Comment 32 Pavel Simerda 2013-06-25 08:12:57 UTC
> Everything else seems to work OK, Wi-Fi  + DHCP, WWAN PPP, WWAN static, and
> VPN.  So once the above patch is added I think we're good to merge.

Thanks for the great review work. Fixed the above issue. Merged pavlix/platform-use. Keeping this ticket open for any non-blockers.
Comment 33 Pavel Simerda 2013-07-07 20:57:22 UTC
A couple of bugs resulting from pavlix/platform-use changes have been fixed. New issues should be filed as separate bug reports.