Bug 727382 – [review] th/bgo727382_platform_fix_addr_lifetime - fix invalid lifetime of cached platform addresses

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 727382 - [review] th/bgo727382_platform_fix_addr_lifetime - fix invalid lifetime of cached platform addresses


Summary:	[review] th/bgo727382_platform_fix_addr_lifetime - fix invalid lifetime of ca...


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-03-31 11:38 UTC by Thomas Haller
Modified:	2014-06-24 19:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
no-timestamp.patch (9.01 KB, patch) 2014-05-01 22:17 UTC, Dan Williams	none	Details \| Review
[PATCH] do calculations in seconds, instead nsec (6.15 KB, patch) 2014-05-19 11:00 UTC, Thomas Haller	none	Details \| Review
const-timestamp.patch (4.48 KB, patch) 2014-06-05 18:29 UTC, Dan Williams	none	Details \| Review

Description Thomas Haller 2014-03-31 11:38:59 UTC

Please review branch th/bgoXYZ_platform_fix_addr_lifetime


Quoting the commit message:

    platform: fix wrong lifetime of addresses when accessing platform cache
    
    This is a hack to fix invalid lifetime (valid and prefered) of
    addresses. Platform currently caches libnl objects (struct rtnl_addr),
    which natively have relative lifetimes without start timestamp.
    
    As workaround, convert the relative timestamps to absolute ones before
    caching, and convert it back in init_address_ip4() and
    init_address_ip6().
    
    The real solution will be to refactor platform caching not to cache
    libnl objects, but NMPlatform objects instead. Then we have more control
    over the fields.

Comment 1 Thomas Haller 2014-04-02 15:29:35 UTC

Branch repushed after major rework.

Comment 2 Thomas Haller 2014-04-04 15:44:57 UTC

note that with commit 
> platform: fix missing changed signals for address lifetime update

we might get platform signals now, that we did not get before -- only because the lifetime changes.

I think this is correct, but also potentially dangerous that we don't end up in a loop to configure the same addresses over and over again.

Comment 3 Dan Williams 2014-04-05 03:21:24 UTC

> platform: fix preferred and valid lifetimes in platform cache

/* timestamp wraps every 497 days. Try to compensate for that.*/

That seems pretty low.  Shouldn't we just use less precision than nano-seconds?  With the lifetimes we're working with, a second here or there doesn't really mean much, so I'm wondering why we need so much precision here.  If it doesn't really matter, then having so much precision makes the code more complicated for little gain.

Comment 4 Thomas Haller 2014-04-16 10:43:03 UTC

(In reply to comment #3)
> > platform: fix preferred and valid lifetimes in platform cache
> 
> /* timestamp wraps every 497 days. Try to compensate for that.*/
> 
> That seems pretty low.  Shouldn't we just use less precision than nano-seconds?
>  With the lifetimes we're working with, a second here or there doesn't really
> mean much, so I'm wondering why we need so much precision here.  If it doesn't
> really matter, then having so much precision makes the code more complicated
> for little gain.

The timestamp as it comes from the kernel wraps every 497 days and the compensation is for that. An uptime of 16 months is not so unthinkable and without compensating all addresses will have invalid timestamps after 497 days (which can only be recovered with a reboot).

The timestamps come in at different scales and they have to be brought to the same scale. Using nsec/gint64 makes sense because it makes it easy to calculate them together (without worrying about overflow).
Converting the nsec back to sec happens at exactly places:

+    shift_s = (now - a_timestamp) / NM_UTILS_NS_PER_SECOND;

+    a_timestamp = (a_timestamp + (ts->monotonic_timestamp_ns - ts->monotonic_ns)) / NM_UTILS_NS_PER_SECOND;



Also note how the code tries to figure out the offset between CLOCK_MONOTONIC and get_monotonic_timestamp().

+         ts->monotonic_ns = _clock_gettime_ns ();
+         ts->monotonic_timestamp_ns = nm_utils_get_monotonic_timestamp_ns ();

This is shaky anyway (e.g. if the process sleeps between the two commands). By converting earlier to seconds you get easily errors of +-1 second. Which is huge, I mean, 1 second!!!



I rebased the branch on master and squashed all the fixups. If you care about the previous version, look at th/bgo727382_platform_fix_addr_lifetime-1
The remaining fixup! commit is new

Comment 5 Dan Winship 2014-04-17 16:38:15 UTC

> core: add nm_utils_get_monotonic_timestamp_ns() function

The _ns, _us, _ms, and _s functions all return the time from the *same* offset, right? (ie, get_monotonic_timestamp_us() == get_monotonic_timestamp_ns() % 1000, always?) Might be good to document that.



> platform: change address_to_string functions to show remaining lifetime/preferred times

>+		if (!(n = *now))
>+			*now = n = nm_utils_get_monotonic_timestamp_s ();

just set it from nm_platform_ip4_address_to_string()/nm_platform_ip6_address_to_string(). The functions are only used for debugging. It won't hurt anything if we make a syscall and don't end up using the result, and it makes the code cleaner.



> platform: fix preferred and valid lifetimes in platform cache

Why do we need the _normalize_timestamps() step? Doesn't the last_update_time field get cached with the rest of the nl_object, allowing us to figure out the absolute expiration time later on without doing anything at caching time?

Comment 6 Thomas Haller 2014-04-17 20:41:59 UTC

(In reply to comment #5)
> > core: add nm_utils_get_monotonic_timestamp_ns() function
> 
> The _ns, _us, _ms, and _s functions all return the time from the *same* offset,
> right? (ie, get_monotonic_timestamp_us() == get_monotonic_timestamp_ns() %
> 1000, always?) Might be good to document that.

yes, they do (as the name implies). I pushed a fixup! to mention this in the source code documentation.



> > platform: change address_to_string functions to show remaining lifetime/preferred times
> 
> >+		if (!(n = *now))
> >+			*now = n = nm_utils_get_monotonic_timestamp_s ();
> 
> just set it from
> nm_platform_ip4_address_to_string()/nm_platform_ip6_address_to_string(). The
> functions are only used for debugging. It won't hurt anything if we make a
> syscall and don't end up using the result, and it makes the code cleaner.

Yes, possible. The idea is indeed to save the syscall for the not uncommon case of permanent addresses. ... needless to say, I would do it this way because I think the complexity is "acceptable". Strong opinions?



> > platform: fix preferred and valid lifetimes in platform cache
> 
> Why do we need the _normalize_timestamps() step? Doesn't the last_update_time
> field get cached with the rest of the nl_object, allowing us to figure out the
> absolute expiration time later on without doing anything at caching time?

example:

a new address gets added, we get libnl-event:
  last-updated=200s,
  valid=50s,
  now-CLOCK_MONOTONIC=200s,
  now-monotonic-ts=100s,
  expires@monot-ts=150s (OK)


refetch the address with get_kernel_object(), 10 seconds later:
  last-updated=200s,
  valid=40s,
  now-CLOCK_MONOTONIC=210s,
  now-monotonic-ts=110s,
  expires@monot-ts=150s

At any later moment when we look into the cache, we don't have now-CLOCK_MONOTONIC anymore (because it does not get saved anywere). So we can no longer calculate the original time difference between CLOCK_MONOTONIC and monotonic-ts when the object was received.
_normalize_timestamps() shifts valid back to 50s. So that valid means "valid since last-updated" instead of "valid since now-CLOCK_MONOTONIC".



Actually, I noticed another bug now :(
the timestamp in last-updated is CLOCK_MONOTONIC, while monotonic_timestamp is CLOCK_BOOTTIME (+ some offset). If we send the computer to sleep, the offset between those two timestamps will no longer be the same as when the address was received and the timeout will be computed wrongly. A proper fix would be to save get_monotonic_timestamp() when we received the address. But where (without major changes)??

It could also be fixed by setting last-updated to get_monotonic_timestamp(), but that would change the meaning of last-updated (because it is now in a different time scale). This seems error prone.

This could be easily fixed, if we would not cache libnl objects but NMPlatformIPXAddresss objects. I have another branch where I want to change that, so I guess this current fix is good enough (also because during sleep the interface gets disconnected anyway and all the addresses become invalid).

Comment 7 Thomas Haller 2014-04-18 09:12:20 UTC

again rebased to master and repushed.

New commit is now:
  "core: add code comment to nm_utils_get_monotonic_timestamp_*s() functions"
and the fixup commits from before (slightly modified).

Comment 8 Dan Williams 2014-05-01 22:11:18 UTC

> platform: fix preferred and valid lifetimes in platform cache

'struct timestamp' in ip6_address_get_all() and ip4_address_get_all() isn't actually used anywhere in the function.  All other callers of init_ip4_address()/init_ip6_address() pass NULL, so I think we can just completely get rid of the 'struct timestamp' argument to all three of init_ip4_address(), init_ip6_address(), and _rtnladdr_get_timestamp() and simplify the code a little.  _rtnladdr_get_timestamp() would only need to use ts_local then.

I tested that out, and it saves " 1 file changed, 17 insertions(+), 29 deletions(-)" and reduces the diff quite a bit, especially if you collapse 'struct timestamp' into _rtnladdr_get_timestamp().

The rest of the patches look OK to me...

------

What I meant about precision earlier was that we're only storing our route expiry in seconds, so ultimately, the timestamps actually don't care about nanoseconds or microseconds or milliseconds.  And the reasons we're using timestamps, like emitting router solicitations or expiring addresses, also don't care about great precision at all.

So the main question is, what if we we just used milliseconds everywhere for the timestamp calculations in nm-linux-platform.c, like the kernel uses for the cstamp and tstamp?  We wouldn't need to care about wrapping at all if we used gint64 since no IP address will be valid for 100,000+ years.  Would this simplify the code?

Comment 9 Dan Williams 2014-05-01 22:17:35 UTC

Created attachment 275583 [details] [review]
no-timestamp.patch

Collapse 'struct timestamp' and localize it.

Comment 10 Thomas Haller 2014-05-02 11:31:51 UTC

(In reply to comment #8)
> > platform: fix preferred and valid lifetimes in platform cache
> 
> 'struct timestamp' in ip6_address_get_all() and ip4_address_get_all() isn't
> actually used anywhere in the function.  All other callers of
> init_ip4_address()/init_ip6_address() pass NULL, so I think we can just
> completely get rid of the 'struct timestamp' argument to all three of
> init_ip4_address(), init_ip6_address(), and _rtnladdr_get_timestamp() and
> simplify the code a little.  _rtnladdr_get_timestamp() would only need to use
> ts_local then.

The struct timestamp is to cache and reuse the timestamps.

We try to calculate the offset between CLOCK_MONOTONIC and get_monotonic_timestamp() by fetching both timestamps in short succession.

    ts->monotonic_ns = _clock_gettime_ns ();
    ts->monotonic_timestamp_ns = nm_utils_get_monotonic_timestamp_ns ();

This cannot be exact because between the two operations an arbitrary long time could have passed.

As a small mitigation, I pushed a fixup to average the timestamps [1]:
    ts->monotonic_ns = _clock_gettime_ns ();
    ts->monotonic_timestamp_ns = nm_utils_get_monotonic_timestamp_ns ();
    ts->monotonic_ns = (_clock_gettime_ns () + ts->monotonic_ns) / 2;


When getting all addresses, I would like to get the timestamps *once* so that we normalize all addresses using the identical time offset -- which might be slightly off, but at least all addresses have the same error.


> I tested that out, and it saves " 1 file changed, 17 insertions(+), 29
> deletions(-)" and reduces the diff quite a bit, especially if you collapse
> 'struct timestamp' into _rtnladdr_get_timestamp().

I pushed some fixups, first applying [1] on the old version, then your attached patch, then again the change [1].

Arguably, the call to clock_gettime() normally takes only 1/10,000,000 of a second (on my computer). So it doesn't matter that much. And probably the error will be small in general.

So, do you prefer your patch?


> What I meant about precision earlier was that we're only storing our route
> expiry in seconds, so ultimately, the timestamps actually don't care about
> nanoseconds or microseconds or milliseconds.  And the reasons we're using
> timestamps, like emitting router solicitations or expiring addresses, also
> don't care about great precision at all.
> 
> So the main question is, what if we we just used milliseconds everywhere for
> the timestamp calculations in nm-linux-platform.c, like the kernel uses for the
> cstamp and tstamp?  We wouldn't need to care about wrapping at all if we used
> gint64 since no IP address will be valid for 100,000+ years.  Would this
> simplify the code?

cstamp and tstamp is not even in milliseconds but 1/100th of a second. So, yes, we could convert all to milliseconds first, do the calculation and in the end convert to seconds.

Instead I convert first to nanoseconds, do the calculation, and convert back to seconds. I don't see how this is any more complicated then doing the calculation in milliseconds.

The code already does not care at all about wrapping of gint64 nanoseconds (294 years). In fact, the code everywhere assumes that no overflow will happen!!
But it must care about wrapping of tstamp (guint32 of 1/100th of a second, 497 days), otherwise after an uptime of 497 all timestamps will be completely off.



Also, with a previous fixup, I convert already earlier to seconds. It does so, in two places immediately after _timestamp_nl_to_ns().

Doing it even earlier (i.e. _clock_gettime_s() instead of _clock_gettime_ns()) would loose precision, because you could easily get +/-1 second errors.






Actually, it quite bothers me that _normalize_timestamps() looses already precision of up to one second. The problem is, that there is no rtnl_addr_set_last_update_time to normalize tstamp (1/100th sec precision), instead it normalizes lifetime/precision (sec precision).

The alternative now is only to store the times somewhere outside of the rtnl_addr object. But for now I would not do this. Instead, at a later point, cache the NMPlatform objects instead (so we have full control over the timestamps).

Comment 11 Dan Winship 2014-05-06 19:59:15 UTC

The underlying data (how long the DHCP server expects the address to last) only has second precision (and, in real-life usage, the administrator almost certainly only cares about minute precision, and usually only hour precision). We don't have to worry about being off by a few nanoseconds.

Comment 12 Thomas Haller 2014-05-07 09:08:50 UTC

(In reply to comment #11)
> The underlying data (how long the DHCP server expects the address to last) only
> has second precision (and, in real-life usage, the administrator almost
> certainly only cares about minute precision, and usually only hour precision).
> We don't have to worry about being off by a few nanoseconds.

Actually, with this patch it's not off by merely nanoseconds, but up to 1 second. This is not only about administrators requirements for lifetime but also about what NM and the kernel consider the expiry of an address. It has potential for subtle bugs if their understanding differs.

The code already looses precision at places where it's hard to avoid. At other places it's easy to make it exact and not risk any additional uncertainties.


Ack to the branch? :)

Comment 13 Dan Williams 2014-05-16 22:31:15 UTC

Ok, probably I'm just dense...

But we have to compensate for two things, correct?

a) the time lag between when the kernel sets 'tstamp' and when we receive and parse the netlink message.  If we don't do this, then the lifetime we assign will be too long.  But it's never going to be *that* long, and much much less than 1 second.  So we can compensate for this lag by subtracting one second from the valid/preferred lifetimes

b) the difference between the units the kernel uses for tstamp (CLOCK_MONOTONIC/jiffies) and the units NM uses for monotonic_timestamp_get() (which are based on CLOCK_BOOTTIME and then adjusted).  Same thing here, we don't need great precision, one second less lifetime doesn't matter at all if it makes the code simpler.

---

So we need to convert between the netlink units and the NM units when reading netlink messages.  Which these patches attempt to do.  But shouldn't setting the NMPlatformIPAddress timestamp/lifetime/preferred be as simple as:

1) read CLOCK_MONOTONIC

2) read our internal monotonic clock (monotonic_timestamp_get)

3) nm_clock_diff_ns = internal monotonic clock - CLOCK_MONOTONIC
     (for reason (b))

4) netlink_lag_ns = CLOCK_MONOTONIC - tstamp

5) addr->timestamp = internal monotonic clock (ie, "now") rounded down to the last full second

6) addr->lifetime = (valid_lft - (netlink_lag_ns + nm_clock_diff_ns / CENTISECONDS_PER_NS)) / 100

The integer math already rounds down for us, so we don't over-estimate the time.  Plus, the small values (netlink_lag_ns and nm_clock_diff_ns) won't ever overflow either.  Maximum precision we loose is about a second, right?  I don't really see why we need to convert the netlink timestamps to nano-seconds and deal with any kind of wrapping, because if we keep them in centiseconds they won't ever wrap.

Is there something I'm missing?  I guess I just don't think the complexity of converting the netlink timestamps from centiseconds -> nanoseconds, then doing the conversion, is worth it.  I think we'd be fine with somewhat simpler code that just converted CLOCK_MONOTONIC and monotonic_timestamp_get() into centiseconds instead, since at most we'd loose 1/100th of a second precision in that calculation, right?

Comment 14 Dan Williams 2014-05-16 22:33:23 UTC

(but for the record, I can't find anything wrong with the current state of the branch.  The problem I have is that it takes a lot of time when reviewing to figure out what units each timestamp is in, when they need to be converted between, why rounding happens, why wrapping compensation happens, etc...)

Comment 15 Thomas Haller 2014-05-19 11:00:50 UTC

Created attachment 276761 [details] [review]
[PATCH] do calculations in seconds, instead nsec

(In reply to comment #13)
> Ok, probably I'm just dense...
> 
> But we have to compensate for two things, correct?

I don't think so.


> a) the time lag between when the kernel sets 'tstamp' and when we receive and
> parse the netlink message.  If we don't do this, then the lifetime we assign
> will be too long.  But it's never going to be *that* long, and much much less
> than 1 second.  So we can compensate for this lag by subtracting one second
> from the valid/preferred lifetimes

Not really. 

A) See again comment 6. The timestamp 'tstamp' is not set by the kernel to *now*, but when the address was last modified. On the other hand, the lifetime counts starting from *now*, but this anchor is nowhere stored inside the netlink message. We only know, that *now* was CLOCK_MONOTONIC when the kernel created the message and "when the kernel sets 'tstamp'" ("t0").

This is fixed by _normalize_timestamps(), which extends lifetime, to be anchored at last_update_time(), instead of "t0". Here we make an error of up to +/- 1 second. This error could be avoided, if we would not cache libnl objects, but NMPlatformObjects instead (because then we can just store it exactly, and do all the normalizations A,B,C in one step early on).

What you called a) was the difference between "t0" and "t1" (i.e. the moment, "when we receive and parse the netlink message"). In general we would assume that the difference is small, and we can ignore the error. But just to see again, how important it is not to block the main loop. E.g. when we kill the dhclient child process, we block the main loop for up to 2 seconds. So, we might have significant bigger errors here (meaning, NM thinks the lifetime is 2 seconds *longer* then kernel). [Note to self: avoid blocking the main loop]


B) we must compensate for int32 overflow of tstamp. Done by _timestamp_nl_to_ns(). Without this, after an uptime of 497 days, all timestamps would be completely off. There is no additional error here, unless we guess wrong during the compensation (which I think cannot happen(?)).
 

> b) the difference between the units the kernel uses for tstamp
> (CLOCK_MONOTONIC/jiffies) and the units NM uses for monotonic_timestamp_get()
> (which are based on CLOCK_BOOTTIME and then adjusted).  Same thing here, we
> don't need great precision, one second less lifetime doesn't matter at all if
> it makes the code simpler.

C) Right. Done by _rtnladdr_get_timestamp(). There is no error here, unless you insist on doing the calculation in second (see attached patch).



> So we need to convert between the netlink units and the NM units when reading
> netlink messages.  Which these patches attempt to do.

What is the moment "when reading netlink messages"?

There is the moment when we receive the message via netlink and put it in the cache (get_kernel_object()). Here we have to account for A) and B).

Then there is the moment, when we access the netlink object from the cache and convert it to NMPlatformIPAddress (init_ip4_address). Here we have to account for B) and C).


> But shouldn't setting
> the NMPlatformIPAddress timestamp/lifetime/preferred be as simple as:

You seem to only account for B). Which is done in _rtnladdr_get_timestamp()


> 1) read CLOCK_MONOTONIC
> 
> 2) read our internal monotonic clock (monotonic_timestamp_get)

See src/platform/nm-linux-platform.c:976


> 3) nm_clock_diff_ns = internal monotonic clock - CLOCK_MONOTONIC
>      (for reason (b))

internal monotonic clock needs compensate for reason B).
Done in src/platform/nm-linux-platform.c:981


> 4) netlink_lag_ns = CLOCK_MONOTONIC - tstamp
> 
> 5) addr->timestamp = internal monotonic clock (ie, "now") rounded down to the
> last full second
> 
> 6) addr->lifetime = (valid_lft - (netlink_lag_ns + nm_clock_diff_ns /
> CENTISECONDS_PER_NS)) / 100

you suggest to adjust the lifetime instead of the timestamp. Possible, instead the code adjusts timestamp instead:

  addr->timestamp = convert tstamp to get_monotonic_timestamp_s().

This has the advantage to yield identical NMPlatformAddress instances when re-reading the same address later. (modulo the unlikely error of +- 1 second in line src/platform/nm-linux-platform.c:978)

Also, because addr->timestamp should always be positive, we might have to adjust the lifetimes too. Which is done in line 986 to 996).


> The integer math already rounds down for us, so we don't over-estimate the
> time.  Plus, the small values (netlink_lag_ns and nm_clock_diff_ns) won't ever
> overflow either.  Maximum precision we loose is about a second, right?  I don't
> really see why we need to convert the netlink timestamps to nano-seconds and
> deal with any kind of wrapping, because if we keep them in centiseconds they
> won't ever wrap.

We don't deal at all with the wrapping of int64-nanoseconds. We must deal with B) -- the wrapping of uint32-centiseconds. 


> Is there something I'm missing?  I guess I just don't think the complexity of
> converting the netlink timestamps from centiseconds -> nanoseconds, then doing
> the conversion, is worth it.  I think we'd be fine with somewhat simpler code
> that just converted CLOCK_MONOTONIC and monotonic_timestamp_get() into
> centiseconds instead, since at most we'd loose 1/100th of a second precision in
> that calculation, right?

Do you suggest doing the calculations in $COMMONSCALE=centiseconds? What would that simplify? It is not simpler to convert get_monotonic_timestamp_ns() and clock_gettime to centiseconds, instead of converting last_update_timestamp to nsec. The only thing that is in centi-seconds is last_update_timestamp.


Or do you suggest to have $COMMONSCALE=second [[see attached patch]]?

It does not save a single line (the patch has actually one line less, only because it removes the mitigation for line 978, which we could also do for nsec if we choose too -- but why would we?).

The only thing that is simpler is that two places after _clock_gettime_ns() do everything in seconds (no nsec) -- I claim, that having _clock_gettime_s() instead of _clock_gettime_ns() is not simpler, because both functions entirely operate in one scale. IMO nsec is simpler, because we have so much precision/range that rounding/wrapping errors are off the table.

However, $COMMONSCALE=seconds adds an additional rounding error of +/- 1 second by converting too early from centiseconds to seconds.

Comment 16 Thomas Haller 2014-05-19 11:08:16 UTC

(In reply to comment #15)

> You seem to only account for B). Which is done in _rtnladdr_get_timestamp()

I meant "C)".
... You seem to only account for C). Which is done in _rtnladdr_get_timestamp()
 

> Done in src/platform/nm-linux-platform.c:981

patch and line numbers are based on commit 7ad78ebfaf3d25769929c78d67abbd01016a229a


> This has the advantage to yield identical NMPlatformAddress instances when
> re-reading the same address later. (modulo the unlikely error of +- 1 second in
> line src/platform/nm-linux-platform.c:978)

and a potential error during suspend, where CLOCK_MONOTONIC drifts from get_monotonic_timestamp().

Comment 17 Thomas Haller 2014-05-19 11:13:00 UTC

(In reply to comment #9)
> Created an attachment (id=275583) [details] [review]
> no-timestamp.patch
> 
> Collapse 'struct timestamp' and localize it.

As I said in comment 10, the timestamp is cached, so that all addresses use exactly the same offset in line src/platform/nm-linux-platform.c:978

I still would drop the last two fixups
9e278c343a9281e2f4a3cb30a0105f129eadba88
7ad78ebfaf3d25769929c78d67abbd01016a229a
?

Comment 18 Thomas Haller 2014-05-23 08:52:24 UTC

(In reply to comment #15)
> What you called a) was the difference between "t0" and "t1" (i.e. the moment,
> "when we receive and parse the netlink message"). In general we would assume
> that the difference is small, and we can ignore the error. But just to see
> again, how important it is not to block the main loop. E.g. when we kill the
> dhclient child process, we block the main loop for up to 2 seconds. So, we
> might have significant bigger errors here (meaning, NM thinks the lifetime is 2
> seconds *longer* then kernel). [Note to self: avoid blocking the main loop]

Ah no. We only get netlink objects via get_kernel_object() which requests the object synchronously via netlink. During that, NM cannot block on something else, so no additional delay.

The objects from event_notification() might have this delay when we block the main thread on something else. But these objects are not cached.

So, this unknown difference (that we cannot compensate for) should really be very small.

Comment 19 Jiri Klimes 2014-06-04 11:08:35 UTC

> platform: change address_to_string functions to show remaining lifetime/preferred times
Ternary operators are too long. Consider to break them into more lines and use parenthesis to make it more readable.

Also, it is quite hard to go through the bunch of fixups. It may be better to squash the commits instantly.

There are some conflicts with master branch, so be careful when merging.

Comment 20 Thomas Haller 2014-06-04 12:42:47 UTC

(In reply to comment #19)
> > platform: change address_to_string functions to show remaining lifetime/preferred times
> Ternary operators are too long. Consider to break them into more lines and use
> parenthesis to make it more readable.

done


> Also, it is quite hard to go through the bunch of fixups. It may be better to
> squash the commits instantly.
> 
> There are some conflicts with master branch, so be careful when merging.

Rebased on master, and sorted the fixup-commits directly after their destination.


I still would drop the fixup! commits that get rid of "struct timestamp" because I dislike the added uncertainty of ip4_address_get_all().


Also, note that
  "platform: fix missing changed signals for address lifetime update"
might be wrong. Could it lead to a update-reconfigure-cycle, where NM thinks some addresses changes and it syncs anew? My feeling is, that the commit does the right thing though. Opinions?

Comment 21 Dan Williams 2014-06-05 18:29:24 UTC

Created attachment 277968 [details] [review]
const-timestamp.patch

I agree with you now that having the 'struct timestamp' is worthwhile, to get a consistent timestamp for calls like *_get_all().  I believe the original source of my confusion, and the thing I don't like, is that passing the 'struct timestamp' up to _rtnladdr_get_timestamp() has two behaviors: (1) if it's  not already set, then set it from the local structure, and (2) if it's already set, then use it.

I think that's somewhat confusing, since if you're lower down in the callstack it's not easily apparent what's happening.  I'd much rather that _rtnladdr_set_timestamp() didn't touch the passed in 'ts', but if was NULL, then create a local timestamp.

How about this patch, which should apply on top of the branch, if you drop my fixup! commit and your fixup to the fixup?  It's +5LoC but I think it's worth it because it's straightforward what's happening with 'ts' and what happens when 'ts' is *not* given.

Comment 22 Thomas Haller 2014-06-06 13:57:59 UTC

Rebases new version:

>> platform: fix preferred and valid lifetimes in platform cache

Reverted this commit (it's still on the branch, but will be dropped on final rebas).

Replaced by:
>>  platform: fix preferred and valid lifetimes for addresses from netlink/kernel

Indeed this version is much more concise.


Major changes are also:
>> platform: raise address changed signals for lifetime update

Comment 23 Dan Winship 2014-06-06 15:27:52 UTC

> platform: fix preferred and valid lifetimes for addresses from netlink/kernel

>+ * Immediately at the time of modification, #NOW and @last_update_time is equal, so

"modification"? I think this is supposed to say "at the time the kernel sends the message" or something?

(Also, if you're going to edit this commit anyway:

>+	return MIN (t, NM_PLATFORM_LIFETIME_PERMANENT -1);

space after the "-")

>+	 * Remember: timestamp has no meaning, beyond anchoring the relative lifetimes
>+	 * at some point in time. We choose this point to be "1 second".

Probably _lifetime_summary_to_string() should special-case timestamp==1 then?

And maybe we should rename it to "timebase" or something?

> 	guint32 timestamp;  /* nm_utils_get_monotonic_timestamp_s() */ \
>-	guint32 lifetime;   /* seconds */ \
>-	guint32 preferred;  /* seconds */ \
>+	guint32 lifetime;   /* seconds since timestamp */ \
>+	guint32 preferred;  /* seconds since timestamp */ \

the comment on timestamp is now wrong



> platform: fix handling of preferred address time

as discussed on irc, this needs to explain the user-visible impact of the bug. (That addresses could be removed from the interface too soon or too late.)

>+_readjust_relative_time (guint32 timestamp, guint32 now, gboolean has_lifetime, guint32 reltime)

needs a comment explaining what it does / when you would need to use it



> platform: refactor setting the source of platform objects to NM_PLATFORM_SOURCE_KERNEL

the init_ip4_route() and init_ip6_route() parts of this are now no-ops after the merge of danw/ipv6-route-match-alt; you set the source to KERNEL, but then they get re-set from the rtnl_route a few lines further down.

However, the corresponding changes to ip4_route_get_all() and ip6_route_get_all() clearly should have been made as part of ipv6-route-match-alt... ugh. That's going to require some testing. It probably doesn't belong in this commit now though.

Comment 24 Dan Williams 2014-06-06 16:38:56 UTC

> platform: fix handling of preferred address time

Let's add some stuff to the commit message (and a bit of code doc to nm_platform_ip[4|6]_address_sync()) about the problem the original code fixed..

Perhaps a variation on "This tries to ensure that the address preferred/valid lifetime is updated in the kernel if it changed due to an RA." and in the commit message state why it was wrong before or what prompted this cleanup.  That wasn't 100% clear to me just reading the code/commit log.  The original code should have had some comments about this, but better late than never.

Comment 25 Dan Williams 2014-06-06 16:40:05 UTC

(In reply to comment #23)
> > platform: refactor setting the source of platform objects to NM_PLATFORM_SOURCE_KERNEL
> 
> the init_ip4_route() and init_ip6_route() parts of this are now no-ops after
> the merge of danw/ipv6-route-match-alt; you set the source to KERNEL, but then
> they get re-set from the rtnl_route a few lines further down.
> 
> However, the corresponding changes to ip4_route_get_all() and
> ip6_route_get_all() clearly should have been made as part of
> ipv6-route-match-alt... ugh. That's going to require some testing. It probably
> doesn't belong in this commit now though.

To clarify what danw means...

danw: drop the route-related parts of that patch, but keep the address-related parts
danw: and then later, we can fix the fact that route_get_all() is claiming SOURCE_KERNEL for every route

Comment 26 Thomas Haller 2014-06-06 17:26:13 UTC

(In reply to comment #23)
> > platform: fix preferred and valid lifetimes for addresses from netlink/kernel
> 
> >+	 * Remember: timestamp has no meaning, beyond anchoring the relative lifetimes
> >+	 * at some point in time. We choose this point to be "1 second".
> 
> Probably _lifetime_summary_to_string() should special-case timestamp==1 then?

How? The change to address_to_string() already prints the remaining valid/preferred times from now. This is IMO more useful in most cases then the previous version (because you would look at the logfile and see timestamp, but not knowing what *now* at that moment was).

Additionally it prints now all the 4 raw values "now-timestamp[preferred,valid]". This is not nice to look at, but to show you what is really inside the struct -- for debugging.

e.g.
192.168.2.1/24 lft 2000sec pref 1000sec lifetime 3600-1[4601,5601] dev nm-test-device src kernel

which format would you prefer instead?



> And maybe we should rename it to "timebase" or something?
> 
> > 	guint32 timestamp;  /* nm_utils_get_monotonic_timestamp_s() */ \
> >-	guint32 lifetime;   /* seconds */ \
> >-	guint32 preferred;  /* seconds */ \
> >+	guint32 lifetime;   /* seconds since timestamp */ \
> >+	guint32 preferred;  /* seconds since timestamp */ \
> 
> the comment on timestamp is now wrong

Comment extended.


> > platform: fix handling of preferred address time
> 
> as discussed on irc, this needs to explain the user-visible impact of the bug.
> (That addresses could be removed from the interface too soon or too late.)
> 
> >+_readjust_relative_time (guint32 timestamp, guint32 now, gboolean has_lifetime, guint32 reltime)
> 
> needs a comment explaining what it does / when you would need to use it

I split this commit, in one that fixes the actual bug. The remaining parts are now completely rewritten. See news commit:

>> platform: fix setting preferred time for address
>> platform: refactor calculating valid/preferred times when adding address



 
> > platform: refactor setting the source of platform objects to NM_PLATFORM_SOURCE_KERNEL
> 
> the init_ip4_route() and init_ip6_route() parts of this are now no-ops after
> the merge of danw/ipv6-route-match-alt; you set the source to KERNEL, but then
> they get re-set from the rtnl_route a few lines further down.
> 
> However, the corresponding changes to ip4_route_get_all() and
> ip6_route_get_all() clearly should have been made as part of
> ipv6-route-match-alt... ugh. That's going to require some testing. It probably
> doesn't belong in this commit now though.



Repushed. Previous fixup!s all squashed (there is a new one though)

Comment 27 Thomas Haller 2014-06-06 17:28:53 UTC

(In reply to comment #24)
> > platform: fix handling of preferred address time
> 
> Let's add some stuff to the commit message (and a bit of code doc to
> nm_platform_ip[4|6]_address_sync()) about the problem the original code fixed..
> 
> Perhaps a variation on "This tries to ensure that the address preferred/valid
> lifetime is updated in the kernel if it changed due to an RA." and in the
> commit message state why it was wrong before or what prompted this cleanup. 
> That wasn't 100% clear to me just reading the code/commit log.  The original
> code should have had some comments about this, but better late than never.

I split this part.

I also removed the previous check for expired addresses in array_contains_ip6_address(), because in the worst case we delete an address that are already expired -- in the best case, we *think* the address is expired, but it actually isn't.

That left only the real bug fix and a second commit to refactor the calculation of the durations.

Comment 28 Dan Winship 2014-06-06 17:35:14 UTC

(In reply to comment #26)
> Additionally it prints now all the 4 raw values
> "now-timestamp[preferred,valid]". This is not nice to look at, but to show you
> what is really inside the struct -- for debugging.
> 
> e.g.
> 192.168.2.1/24 lft 2000sec pref 1000sec lifetime 3600-1[4601,5601] dev
> nm-test-device src kernel
> 
> which format would you prefer instead?

The "1" is essentially just noise. It could omit it. It's not a necessity, but it seems like it would be nice.

> I split this commit, in one that fixes the actual bug. The remaining parts are
> now completely rewritten. See news commit:
> 
> >> platform: fix setting preferred time for address
> >> platform: refactor calculating valid/preferred times when adding address

Contents of the commits are good, although the commit message still doesn't explain what the actual effect of the bug is.

Comment 29 Dan Williams 2014-06-06 17:50:10 UTC

I'm fine with the branch, except for some clarification in the commit messages like danw requests.  After that, squash and merge it!

Comment 30 Thomas Haller 2014-06-06 18:07:11 UTC

finally, I moved the change to the to_string() functions later and reused _rebase_relative_time_on_now().

The advantage being, shared code and using exactly the same function to convert the time duration as done in nm_platform_ip[46]_address_sync()



Merged to master as:
http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=fcc34ef77bc98239392a74112376ca620cf3e445

Comment 31 Thomas Haller 2014-06-09 09:36:15 UTC

Reopening.

There are more fixes on branch:

th/bgo727382_platform_fix_addr_lifetime

Comment 32 Dan Williams 2014-06-13 21:08:33 UTC

> platform: fix off-by-two error converting lifetimes in _init_ip_address_lifetime()

The comment in _get_remaining_time() refers to @lifetime which is not an argument in this function...  Should that piece of the comment to into _init_ip_address_lifetime() ?


> platform: set timestamp in platform addresses to last_update_time()

In _rtnl_addr_last_update_time_to_nm() there's a variable "now_rt"; could we change that to "now_nl" to be consistent with other stuff that references netlink like _timestamp_nl_to_ms()?


Still thinking about a few of the patches...

Comment 33 Dan Williams 2014-06-18 16:07:00 UTC

I have no problem with the current fixes.

But I'm also going to do a quick & dirty patch to see if all this could be radically simpler if we stored preferred/valid in NMPlatformIpAddress as absolute values that have no relation to the timestamp.  I'm finding it hard to keep it straight when the timestamp is used and when it's not, since sometimes it's factored into decisions (expiry_cmp() and the DHCP code) and other times it's not.  I wonder if it would a lot simpler to just keep timestamp opaque with no relation whatsoever to valid/preferred.

Comment 34 Thomas Haller 2014-06-18 17:44:19 UTC

(In reply to comment #33)
> I have no problem with the current fixes.
> 
> But I'm also going to do a quick & dirty patch to see if all this could be
> radically simpler if we stored preferred/valid in NMPlatformIpAddress as
> absolute values that have no relation to the timestamp.  I'm finding it hard to
> keep it straight when the timestamp is used and when it's not, since sometimes
> it's factored into decisions (expiry_cmp() and the DHCP code) and other times
> it's not.  I wonder if it would a lot simpler to just keep timestamp opaque
> with no relation whatsoever to valid/preferred.


Pushed two fixup commits. One with a better explanation what @timestamp means.

I don't think it's a good idea dropping the timestamp. For example a DHCP lease has this notion of "valid from timestamp when received". This is expressed by the data structure and with this branch, you can see the original lease duration in the logfile.

Anyway, the current branch fixes (to my knowledge) all issues with the timestamps. I think it should be merged. If it is decided to refactor the timestamps, we can do that in another branch, because it seems like a bigger work.

Comment 35 Dan Williams 2014-06-19 19:36:25 UTC

Hmm, I can't see those fixup commits anywhere?  Forget to push maybe?

Comment 36 Dan Williams 2014-06-19 20:12:49 UTC

(In reply to comment #34)
> (In reply to comment #33)
> > I have no problem with the current fixes.
> > 
> > But I'm also going to do a quick & dirty patch to see if all this could be
> > radically simpler if we stored preferred/valid in NMPlatformIpAddress as
> > absolute values that have no relation to the timestamp.  I'm finding it hard to
> > keep it straight when the timestamp is used and when it's not, since sometimes
> > it's factored into decisions (expiry_cmp() and the DHCP code) and other times
> > it's not.  I wonder if it would a lot simpler to just keep timestamp opaque
> > with no relation whatsoever to valid/preferred.
> 
> 
> Pushed two fixup commits. One with a better explanation what @timestamp means.
> 
> I don't think it's a good idea dropping the timestamp. For example a DHCP lease
> has this notion of "valid from timestamp when received". This is expressed by
> the data structure and with this branch, you can see the original lease
> duration in the logfile.

I tried prototyping some stuff yesterday and my approach was to make 'lifetime' (renamed to valid in my branch) and 'preferred' absolute based on internal NM monotonic timestamps.  Since the absolute times are in seconds we don't have a problem with wrapping at all.

This does appear to remove a bunch of code.  For the DHCP case this isn't a problem because the timestamp is set to nm_utils_get_monotonic_timestamp_s() and the valid/preferred are set to nm_utils_get_monotonic_timestamp_s() + lifetime from the DHCP lease.

Comment 37 Dan Williams 2014-06-19 20:13:03 UTC

I'm happy with the fixups of the comments.

Comment 38 Thomas Haller 2014-06-19 21:22:45 UTC

Branch merged as http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=7475d8c28d43bc85231f89d0537fafe6f8c3dcaf