Bug 731570 – [review] two fixes for master/slave typed connections (th/bgo731570_master_slave)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 731570 - [review] two fixes for master/slave typed connections (th/bgo731570_master_slave)


Summary:	[review] two fixes for master/slave typed connections (th/bgo731570_master_sl...


Status:	RESOLVED FIXED

Product:	NetworkManager
Classification:	Platform
Component:	general
Version:	git master
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	NetworkManager maintainer(s)
QA Contact:	NetworkManager maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2014-06-12 12:35 UTC by Thomas Haller
Modified:	2014-06-18 15:28 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Thomas Haller 2014-06-12 12:35:09 UTC

The branch for review fixes two bugs and does some refactoring in nm-device.

(More details below)

Comment 1 Thomas Haller 2014-06-12 12:44:08 UTC

Please review branch th/bgo731570_master_slave.

Fixes two bugs:



>> core: fix activation of slave when master is not active, but device exists

See commit message. I think that this was broken for a long time.



It might have some unexpected behaviour:

- Suppose NM is not running.
  Create the master device outside of nm.
  Have (only) a master and slave connection, with the slave being autoconnect

When you start NM, it will autoconnect the slave on the device, which brings up the master as well -- effectively modifying the upped master. Does that disturb the existing device?. Before that would have failed (with an assertion).


>> core: preserve reason on device deactivation (fix tearing down slave when deactivating master)

See commit message. But maybe we should also revise the nm_device_slave_notify_release(), which checks 
  "reason != NM_DEVICE_STATE_REASON_NONE"
Why this check, is it correct? Or could we add a comment explaining why?



>> fixup! device: refactor by combining dispatcher callback functions
>> device: refactor by combining dispatcher callback functions
>> core/trivial: move function up

some optional refactoring. I would do this. I would also apply the last fixup!.
Opinions?

Comment 2 Dan Winship 2014-06-12 14:19:37 UTC

> dispatcher: improve logging

>+		else if (!strncmp (script, NMD_SCRIPT_DIR "/", STRLEN (NMD_SCRIPT_DIR "/")))
>+			script += STRLEN (NMD_SCRIPT_DIR "/"),

This should just use strrchr(), since now the script might actually be in a subdirectory of NMD_SCRIPT_DIR, so stripping off NMD_SCRIPT_DIR+"/" isn't always enough.


> device: refactor by combining dispatcher callback functions

>+	NMDeviceState   dispatcher_aspired_state;

How about "post_dispatcher_state"? (Or if not that, then "desired" rather than "aspired". "Aspired" sounds weird.)


(In reply to comment #1)
> See commit message. But maybe we should also revise the
> nm_device_slave_notify_release(), which checks 
>   "reason != NM_DEVICE_STATE_REASON_NONE"
> Why this check, is it correct? Or could we add a comment explaining why?

It gets called with REASON_NONE in the case where NM observed someone else releasing the slave, and we just need to update our own internal bookkeeping, rather than actually modifying things. Maybe we should add a new explicit NMDeviceStateReason for that instead...

Comment 3 Dan Williams 2014-06-12 21:09:06 UTC

> core: fix activation of slave when master is not active, but device exists

All other callsites that return NULL also set 'error', but the new g_return_val_if_fail() won't set error.  Should it do that?


> core: preserve reason on device deactivation (fix tearing down slave when deactivating master)

Instead of a couple private data members what about a small struct that gets passed to the callback?  That way we don't have to worry about clearing the values if the callback gets cancelled.  The values are pretty local to the dispatcher stuff anyway, so I think that makes more sense than private data members.  (yeah, would be nice if C supported tuples, but oh well...)

(In reply to comment #2)
> > See commit message. But maybe we should also revise the
> > nm_device_slave_notify_release(), which checks 
> >   "reason != NM_DEVICE_STATE_REASON_NONE"
> > Why this check, is it correct? Or could we add a comment explaining why?
> 
> It gets called with REASON_NONE in the case where NM observed someone else
> releasing the slave, and we just need to update our own internal bookkeeping,
> rather than actually modifying things. Maybe we should add a new explicit
> NMDeviceStateReason for that instead...

Yeah, a new reason for that seems cleaner to me.

Comment 4 Thomas Haller 2014-06-13 11:36:13 UTC

(In reply to comment #2)
> > dispatcher: improve logging
> 
> >+		else if (!strncmp (script, NMD_SCRIPT_DIR "/", STRLEN (NMD_SCRIPT_DIR "/")))
> >+			script += STRLEN (NMD_SCRIPT_DIR "/"),
> 
> This should just use strrchr(), since now the script might actually be in a
> subdirectory of NMD_SCRIPT_DIR, so stripping off NMD_SCRIPT_DIR+"/" isn't
> always enough.

I reworked the last commits. Now it will strip all leading directories (leaving only the script name). ... except(!) if the directories are unexpected. This adds some lines of code, but it helps to validate the returned parameters.

I also added commit
  "dispatcher: save callouts for empty script directory based on dispatcher action type"



> > device: refactor by combining dispatcher callback functions
> 
> >+	NMDeviceState   dispatcher_aspired_state;
> 
> How about "post_dispatcher_state"? (Or if not that, then "desired" rather than
> "aspired". "Aspired" sounds weird.)

Now: dispatcher_post_state/dispatcher.post_state.


> (In reply to comment #1)
> > See commit message. But maybe we should also revise the
> > nm_device_slave_notify_release(), which checks 
> >   "reason != NM_DEVICE_STATE_REASON_NONE"
> > Why this check, is it correct? Or could we add a comment explaining why?
> 
> It gets called with REASON_NONE in the case where NM observed someone else
> releasing the slave, and we just need to update our own internal bookkeeping,
> rather than actually modifying things. Maybe we should add a new explicit
> NMDeviceStateReason for that instead...

I did not introduce another reason. Instead I added a fixup to g_warn in such a case, and force the reason to not NONE. I think this is useful, because it would have helped to catch the previous bug.



(In reply to comment #3)
> > core: fix activation of slave when master is not active, but device exists
> 
> All other callsites that return NULL also set 'error', but the new
> g_return_val_if_fail() won't set error.  Should it do that?

Maybe. But note, that the first use actually replaces an "assert"... so previously the program even core dumped. You are right about the second, new g_return_val_if_fail().


Our policies for handling g_return* are not consistent anyway:
  - should the program behave similar with G_DISABLE_CHECKS or do we allow to 
    crash then?
  - when g_return_*() hits, should we restore a valid state? (what you are 
    suggesting here with setting @error).

I think it is Ok to be inconsistent, depending on the effort and likelihood of assertion-failure.

The only requirement I have is that a failed g_return/g_warn is *really* a bug and should not ever be anticipated by the programmer. 

So, in general, I say if a g_return/g_warn ever fails, all bets are off... but we could be more graceful in this case (which we often try to be...). Should we, in this case??







> > core: preserve reason on device deactivation (fix tearing down slave when deactivating master)
> 
> Instead of a couple private data members what about a small struct that gets
> passed to the callback?  That way we don't have to worry about clearing the
> values if the callback gets cancelled.  The values are pretty local to the
> dispatcher stuff anyway, so I think that makes more sense than private data
> members.  (yeah, would be nice if C supported tuples, but oh well...)

Hm. Possible. But note that we also cancel the dispatcher callbacks. This makes it more difficult to cleanup the user_data: we would need a reference to the user_data to free it ourselves.


Btw. do you like the fixup! to create a "struct {} dispatcher;" member? I think it neatly binds the parameters together, so it's already relatively clear(??).



> (In reply to comment #2)
> > > See commit message. But maybe we should also revise the
> > > nm_device_slave_notify_release(), which checks 
> > >   "reason != NM_DEVICE_STATE_REASON_NONE"
> > > Why this check, is it correct? Or could we add a comment explaining why?
> > 
> > It gets called with REASON_NONE in the case where NM observed someone else
> > releasing the slave, and we just need to update our own internal bookkeeping,
> > rather than actually modifying things. Maybe we should add a new explicit
> > NMDeviceStateReason for that instead...
> 
> Yeah, a new reason for that seems cleaner to me.

I solved it differently. See new fixup.

Comment 5 Dan Winship 2014-06-13 14:48:27 UTC

(In reply to comment #4)
> > All other callsites that return NULL also set 'error', but the new
> > g_return_val_if_fail() won't set error.  Should it do that?
> 
> Maybe. But note, that the first use actually replaces an "assert"... so
> previously the program even core dumped. You are right about the second, new
> g_return_val_if_fail().
> 
> 
> Our policies for handling g_return* are not consistent anyway:
>   - should the program behave similar with G_DISABLE_CHECKS or do we allow to 
>     crash then?

If a g_return_* triggers, it is always supposed to mean a bug in the program, so if you compile with G_DISABLE_CHECKS, anything could happen.

>   - when g_return_*() hits, should we restore a valid state? (what you are 
>     suggesting here with setting @error).

The standard rule is that you don't set GErrors on return-if-fail. (And also, for async functions, you don't bother triggering the async callback on return-if-fail.) The goal is "try to let the program limp along a little bit longer before crashing (so as to hopefully not lose user data, etc)", not "try to make things work perfectly anyway despite the bug".

Comment 6 Dan Williams 2014-06-13 14:56:53 UTC

(In reply to comment #4)
> (In reply to comment #3)
> > > core: fix activation of slave when master is not active, but device exists
> > 
> > All other callsites that return NULL also set 'error', but the new
> > g_return_val_if_fail() won't set error.  Should it do that?
> 
> Maybe. But note, that the first use actually replaces an "assert"... so
> previously the program even core dumped. You are right about the second, new
> g_return_val_if_fail().
> 
> 
> Our policies for handling g_return* are not consistent anyway:
>   - should the program behave similar with G_DISABLE_CHECKS or do we allow to 
>     crash then?
>   - when g_return_*() hits, should we restore a valid state? (what you are 
>     suggesting here with setting @error).
> 
> I think it is Ok to be inconsistent, depending on the effort and likelihood of
> assertion-failure.
> 
> The only requirement I have is that a failed g_return/g_warn is *really* a bug
> and should not ever be anticipated by the programmer. 
> 
> So, in general, I say if a g_return/g_warn ever fails, all bets are off... but
> we could be more graceful in this case (which we often try to be...). Should
> we, in this case??

Note that a lot of the code that calls ensure_master_active_connection() also does this:

			if (!master_ac) {
				if (error)
					g_assert (*error);
				return FALSE;
			}

so if we hit that g_return_val_if_fail () (which you're right, should log) we'll just crash anyway in the callers of that function.  I'm not sure that's the right behavior here though, since there might be cases where legitimately cannot find the master.  I think you're right, the first one is fine.

But I think the second should actually be a real g_assert(), because if the device is in ACTIVATING or ACTIVATED states, we really really expect it to have a connection, and if it doesn't, the world is really really wrong.  I'm not sure pushing the assert() to the caller of ensure_master_active_connection() is the best thing.  Maybe we don't even need the return_if_fail/assert here at all?

> > > core: preserve reason on device deactivation (fix tearing down slave when deactivating master)
> > 
> > Instead of a couple private data members what about a small struct that gets
> > passed to the callback?  That way we don't have to worry about clearing the
> > values if the callback gets cancelled.  The values are pretty local to the
> > dispatcher stuff anyway, so I think that makes more sense than private data
> > members.  (yeah, would be nice if C supported tuples, but oh well...)
> 
> Hm. Possible. But note that we also cancel the dispatcher callbacks. This makes
> it more difficult to cleanup the user_data: we would need a reference to the
> user_data to free it ourselves.

Yeah, canceling doesn't call the callback, so it would be hard to clean stuff up with the existing code.  We could add a 'cancelled' argument to the callback prototype later, but I'm fine with what you have now.  But one request: let's zero out the 'post_state' and 'post_state_reason' in dispatcher_cleanup() too.

> Btw. do you like the fixup! to create a "struct {} dispatcher;" member? I think
> it neatly binds the parameters together, so it's already relatively clear(??).

Yes, I agree!

> > (In reply to comment #2)
> > > > See commit message. But maybe we should also revise the
> > > > nm_device_slave_notify_release(), which checks 
> > > >   "reason != NM_DEVICE_STATE_REASON_NONE"
> > > > Why this check, is it correct? Or could we add a comment explaining why?
> > > 
> > > It gets called with REASON_NONE in the case where NM observed someone else
> > > releasing the slave, and we just need to update our own internal bookkeeping,
> > > rather than actually modifying things. Maybe we should add a new explicit
> > > NMDeviceStateReason for that instead...
> > 
> > Yeah, a new reason for that seems cleaner to me.
> 
> I solved it differently. See new fixup.

I think that seems fine?  danw should give his +1 (or not) since he knows more about the master/slave code in this specific case.


----

> dispatcher: save callouts for empty script directory based on dispatcher action type

 typedef struct {
-	const char *dir;
+	const char * const dir;
 	GFileMonitor *monitor;

No space after the "*".

_get_monitor_by_action(): I was never a huge fan of compact case statements like this, I don't mind having the 'return xxx' on a line below and adding 3LoC.  I think that's a bit clearer?

> dispatcher: for debug logging, truncate the (expected) script name when showing dispatcher results

At the risk of being OCD, micro-optimization suggestion: add a "dir_len" member to 'monitors' and then initialize with sizeof(NMD_SCRIPT_DIR), then we don't need a strlen() every time we run dispatcher_results_process().  I'm not sure the compiler is smart enough to know that 'monitors' is static and the strings are static; might as well make the complier to the strlen() work for us at build time instead of runtime?

Comment 7 Dan Williams 2014-06-13 20:18:24 UTC

Current state of the branch looks good to me.

Comment 8 Dan Williams 2014-06-16 17:10:31 UTC

Updated patches review:

> dispatcher: better detection for dispatcher scripts

Instead of while (TRUE), how about:

		item->has_scripts = FALSE;
		while ((name = g_dir_read_name (dir)) && !item->has_scripts) {
			char *full_name;

			full_name = g_build_filename (item->dir, name, NULL);
			if (g_file_test (full_name, G_FILE_TEST_IS_EXECUTABLE))
				item->has_scripts = TRUE;
			g_free (full_name);
			errno = 0;
		}
		errsv = errno;
		g_dir_close (dir);

and also, about:

	if (error->domain == G_FILE_ERROR && error->code == G_FILE_ERROR_NOENT) {

how about:

	if (g_error_matches (error, G_FILE_ERROR, G_FILE_ERROR_NOENT))

Comment 9 Dan Williams 2014-06-16 17:11:34 UTC

Or even:

item->has_scripts = g_file_test (full_name, G_FILE_TEST_IS_EXECUTABLE);

Comment 10 Dan Williams 2014-06-16 17:12:57 UTC

Yeah, your "fixup! dispatcher: better detection for dispatcher scripts" looks fine for the while() bits.

Comment 11 Thomas Haller 2014-06-16 17:23:44 UTC

Merged as:

http://cgit.freedesktop.org/NetworkManager/NetworkManager/commit/?id=d71ec13123a0acc80bce9ed92bf48aad93c527ee