GNOME Bugzilla – Bug 791306
factory: race condition leading to: e_client_set_bus_name: assertion 'bus_name != NULL && *bus_name != '\0'' failed
Last modified: 2021-05-19 11:03:25 UTC
Created attachment 365105 [details] a copy of the test failure with 3.18, includes D-Bus traffic dump This occurred when running SyncEvolution tests against 3.18 (on Ubuntu Xenial) and against 3.26 (on Debian Testing). The first problem is that the factory calls GetNameOwner before the subprocess has actually connected to D-Bus, which fails: method call time=1512481369.064102 sender=:1.11 -> destination=org.freedesktop.DBus serial=774 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=AddMatch string "type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',path='/org/freedesktop/DBus',arg0='org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46'" method return time=1512481369.064121 sender=org.freedesktop.DBus -> destination=:1.11 serial=551 reply_serial=774 method call time=1512481369.064127 sender=:1.11 -> destination=org.freedesktop.DBus serial=775 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=GetNameOwner string "org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46" error time=1512481369.064136 sender=org.freedesktop.DBus -> destination=:1.11 error_name=org.freedesktop.DBus.Error.NameHasNoOwner reply_serial=775 string "Could not get owner of name 'org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46': no such name" method call time=1512481369.285505 sender=:1.291 -> destination=org.freedesktop.DBus serial=1 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=Hello method return time=1512481369.285525 sender=org.freedesktop.DBus -> destination=:1.291 serial=1 reply_serial=1 string ":1.291" ... method call time=1512481369.304730 sender=:1.291 -> destination=org.freedesktop.DBus serial=16 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RequestName string "org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46" uint32 0 ... Not sure what the factory does in this case and whether it causes the actual error. The actual error that breaks the testing later on is that 1.291 disconnects right before the client tries to contact it: signal time=1512481369.983462 sender=org.freedesktop.DBus -> destination=(null destination) serial=13 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameOwnerChanged string ":1.291" string ":1.291" string "" error time=1512481369.983484 sender=org.freedesktop.DBus -> destination=:1.11 error_name=org.freedesktop.DBus.Error.NoReply reply_serial=789 string "Message recipient disconnected from message bus without replying" method call time=1512481369.986571 sender=:1.11 -> destination=org.freedesktop.DBus serial=790 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RemoveMatch string "type='signal',sender='org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46',interface='org.freedesktop.DBus.Properties',member='PropertiesChanged',path='/org/gnome/evolution/dataserver/Subprocess/Backend/Calendar/88/46',arg0='org.gnome.evolution.dataserver.Subprocess.Backend'" method return time=1512481369.986587 sender=org.freedesktop.DBus -> destination=:1.11 serial=561 reply_serial=790 method call time=1512481369.986604 sender=:1.11 -> destination=org.freedesktop.DBus serial=791 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RemoveMatch string "type='signal',sender='org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46',interface='org.gnome.evolution.dataserver.Subprocess.Backend',path='/org/gnome/evolution/dataserver/Subprocess/Backend/Calendar/88/46'" method return time=1512481369.986614 sender=org.freedesktop.DBus -> destination=:1.11 serial=562 reply_serial=791 method call time=1512481369.986636 sender=:1.11 -> destination=org.freedesktop.DBus serial=792 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RemoveMatch string "type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',path='/org/freedesktop/DBus',arg0='org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46'" method return time=1512481369.986646 sender=org.freedesktop.DBus -> destination=:1.11 serial=563 reply_serial=792 method return time=1512481369.986651 sender=:1.11 -> destination=:1.289 serial=793 reply_serial=100 string "/org/gnome/evolution/dataserver/Subprocess/9544/3" string "" method call time=1512481370.020996 sender=:1.289 -> destination=org.freedesktop.DBus serial=101 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RemoveMatch string "type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',path='/org/freedesktop/DBus',arg0='org.gnome.evolution.dataserver.Calendar7'" method return time=1512481370.021024 sender=org.freedesktop.DBus -> destination=:1.289 serial=72 reply_serial=101 method call time=1512481370.021030 sender=:1.289 -> destination=org.freedesktop.DBus serial=102 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RemoveMatch string "type='signal',sender='org.gnome.evolution.dataserver.Calendar7',interface='org.freedesktop.DBus.Properties',member='PropertiesChanged',path='/org/gnome/evolution/dataserver/CalendarFactory',arg0='org.gnome.evolution.dataserver.CalendarFactory'" method return time=1512481370.021039 sender=org.freedesktop.DBus -> destination=:1.289 serial=73 reply_serial=102 method call time=1512481370.022245 sender=:1.289 -> destination=org.freedesktop.DBus serial=103 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RemoveMatch string "type='signal',sender='org.gnome.evolution.dataserver.Calendar7',interface='org.gnome.evolution.dataserver.CalendarFactory',path='/org/gnome/evolution/dataserver/CalendarFactory'" method return time=1512481370.022261 sender=org.freedesktop.DBus -> destination=:1.289 serial=74 reply_serial=103 method call time=1512481370.024676 sender=:1.289 -> destination=org.freedesktop.DBus serial=104 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=AddMatch string "type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',path='/org/freedesktop/DBus',arg0=''" method return time=1512481370.024693 sender=org.freedesktop.DBus -> destination=:1.289 serial=75 reply_serial=104 method call time=1512481370.024713 sender=:1.289 -> destination=org.freedesktop.DBus serial=105 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=GetNameOwner string "" error time=1512481370.024723 sender=org.freedesktop.DBus -> destination=:1.289 error_name=org.freedesktop.DBus.Error.NameHasNoOwner reply_serial=105 string "Could not get owner of name '': no such name" signal time=1512481370.036026 sender=:1.286 -> destination=(null destination) serial=146 path=/org/syncevolution/Server; interface=org.syncevolution.Server; member=LogOutput object path "/org/syncevolution/Session/10232410671512481356" string "ERROR" string "e-data-server: e_client_set_bus_name: assertion 'bus_name != NULL && *bus_name != '\0'' failed" string "" Perhaps the error already is in the factory where it replies with an empty string: method call time=1512481369.902700 sender=:1.289 -> destination=:1.11 serial=100 path=/org/gnome/evolution/dataserver/CalendarFactory; interface=org.gnome.evolution.dataserver.CalendarFactory; member=OpenMemoList string "system-memo-list" ... method return time=1512481369.986651 sender=:1.11 -> destination=:1.289 serial=793 reply_serial=100 string "/org/gnome/evolution/dataserver/Subprocess/9544/3" string ""
Thanks for a bug report. I am not sure how to reproduce this, though it probably requires proper thread/process interleaving or anything like that. I do not know D-Bus internals, neither I'm sure how to decipher the log, but let's focus on the first thing, as the rest can be just due to the first failure. The code itself works as this: a) an EClient descendant calls to open some ESource on the client side b) this request is received by evolution-calendar-factory or evolution-addressbook-factory, where either a bus name of an existing -subprocess is returned or a new -subprocess is run c) the bus name to use is passed from the factory to the -subprocess and the factory waits for it to appear, while the -subprocess calls to own the name. d) once the name appear, it's returned to the client which works with it in some way That's what your log also reflects. There's only not shown which process tried which call. I'd expect that once any process received "NameOwnerChanged", then any other process, definitely this one, can call "GetNameOwner" on the passed-in name and receive proper information, but it's not the case here. Would it be a problem in D-Bus protocol itself? Or improper thread/process interleaving? I'm really not sure. I'd need a reliable reproducer to debug this further, which would help me to understand what is going on in the background and eventually come with something usable.
(In reply to Milan Crha from comment #1) > Thanks for a bug report. I am not sure how to reproduce this, though it > probably requires proper thread/process interleaving or anything like that. > I do not know D-Bus internals, neither I'm sure how to decipher the log, but > let's focus on the first thing, as the rest can be just due to the first > failure. > > The code itself works as this: > a) an EClient descendant calls to open some ESource on the client side > b) this request is received by evolution-calendar-factory or > evolution-addressbook-factory, where either a bus name of an existing > -subprocess is returned or a new -subprocess is run > c) the bus name to use is passed from the factory to the -subprocess and > the factory waits for it to appear, while the -subprocess calls to own > the name. Agreed, it looks like the failed GetNameOwner method call can be ignored, because the factory then waits for the signal that it set a watch on. > d) once the name appear, it's returned to the client which works with it in > some way > > That's what your log also reflects. Except that it goes wrong at some point. > There's only not shown which process > tried which call. I'd expect that once any process received > "NameOwnerChanged", then any other process, definitely this one, can call > "GetNameOwner" on the passed-in name and receive proper information, but > it's not the case here. 1.11 is the factory. 1.291 is the sub process. 1.289 is the client. Here's the part of the D-Bus log where it goes wrong: method return time=1512481369.986651 sender=:1.11 -> destination=:1.289 serial=793 reply_serial=100 string "/org/gnome/evolution/dataserver/Subprocess/9544/3" string "" method call time=1512481370.020996 sender=:1.289 -> destination=org.freedesktop.DBus serial=101 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RemoveMatch string "type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',path='/org/freedesktop/DBus',arg0='org.gnome.evolution.dataserver.Calendar7'" method return time=1512481370.021024 sender=org.freedesktop.DBus -> destination=:1.289 serial=72 reply_serial=101 method call time=1512481370.021030 sender=:1.289 -> destination=org.freedesktop.DBus serial=102 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RemoveMatch string "type='signal',sender='org.gnome.evolution.dataserver.Calendar7',interface='org.freedesktop.DBus.Properties',member='PropertiesChanged',path='/org/gnome/evolution/dataserver/CalendarFactory',arg0='org.gnome.evolution.dataserver.CalendarFactory'" method return time=1512481370.021039 sender=org.freedesktop.DBus -> destination=:1.289 serial=73 reply_serial=102 method call time=1512481370.022245 sender=:1.289 -> destination=org.freedesktop.DBus serial=103 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=RemoveMatch string "type='signal',sender='org.gnome.evolution.dataserver.Calendar7',interface='org.gnome.evolution.dataserver.CalendarFactory',path='/org/gnome/evolution/dataserver/CalendarFactory'" method return time=1512481370.022261 sender=org.freedesktop.DBus -> destination=:1.289 serial=74 reply_serial=103 method call time=1512481370.024676 sender=:1.289 -> destination=org.freedesktop.DBus serial=104 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=AddMatch string "type='signal',sender='org.freedesktop.DBus',interface='org.freedesktop.DBus',member='NameOwnerChanged',path='/org/freedesktop/DBus',arg0=''" method return time=1512481370.024693 sender=org.freedesktop.DBus -> destination=:1.289 serial=75 reply_serial=104 method call time=1512481370.024713 sender=:1.289 -> destination=org.freedesktop.DBus serial=105 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=GetNameOwner string "" error time=1512481370.024723 sender=org.freedesktop.DBus -> destination=:1.289 error_name=org.freedesktop.DBus.Error.NameHasNoOwner reply_serial=105 string "Could not get owner of name '': no such name" The factory is telling the client to use a sub-process with empty name, which then of course fails. The reason probably is that the sub-process quit shortly before that: signal time=1512481369.983462 sender=org.freedesktop.DBus -> destination=(null destination) serial=13 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameOwnerChanged string ":1.291" string ":1.291" string "" error time=1512481369.983484 sender=org.freedesktop.DBus -> destination=:1.11 error_name=org.freedesktop.DBus.Error.NoReply reply_serial=789 string "Message recipient disconnected from message bus without replying" I don't know whether the sub-process crashed or just decided that it is time to stop. Either way, this is something that needs to be handled by the factory, for example by restarting the sub-process instead of returning an empty response to the client. But I'm not convinced that this is a full solution. How is it guaranteed that the sub-process keeps running until the client had a chance to connect to it? > Would it be a problem in D-Bus protocol itself? Or improper thread/process > interleaving? I'm really not sure. I'd need a reliable reproducer to debug > this further, which would help me to understand what is going on in the > background and eventually come with something usable. I'm not sure how to reproduce this easily. Perhaps you can enforce this particular execution flow by inserting sleeps into factory, helper and/or client?
(In reply to Patrick Ohly from comment #2) > I don't know whether the sub-process crashed or just decided that it is time > to stop. Either way, this is something that needs to be handled by the > factory, for example by restarting the sub-process instead of returning an > empty response to the client. I do not like an idea of automatic restarts. When there's a real issue with the process start then the restart just means a busy loop with constant crashes in the background. That's nothing one should do. > How is it guaranteed that the sub-process keeps running until the client had > a chance to connect to it? The subprocess runs a main loop and it's kept running until something explicitly doesn't ask to close the subprocess or the factory process is closed. > I'm not sure how to reproduce this easily. Perhaps you can enforce this > particular execution flow by inserting sleeps into factory, helper and/or > client? Yes, I've been thinking of "cheating" in the code, to imitate certain execution order, but I'm still lost what to try to achieve. I believe the error "Message recipient disconnected from message bus without replying" is the main clue.
(In reply to Milan Crha from comment #3) > (In reply to Patrick Ohly from comment #2) > > I don't know whether the sub-process crashed or just decided that it is time > > to stop. Either way, this is something that needs to be handled by the > > factory, for example by restarting the sub-process instead of returning an > > empty response to the client. > > I do not like an idea of automatic restarts. When there's a real issue with > the process start then the restart just means a busy loop with constant > crashes in the background. That's nothing one should do. True, it should be limited to the case were the subprocess stopped normally. The factory can track that based on the exit code. > > How is it guaranteed that the sub-process keeps running until the client had > > a chance to connect to it? > > The subprocess runs a main loop and it's kept running until something > explicitly doesn't ask to close the subprocess or the factory process is > closed. What about the case where a subprocess gets reused? That seems to have happened here, i.e. 1.291 was used, then quits (normally, as far as I can tell) right at the time when the factory wants to use it again. > > I'm not sure how to reproduce this easily. Perhaps you can enforce this > > particular execution flow by inserting sleeps into factory, helper and/or > > client? > > Yes, I've been thinking of "cheating" in the code, to imitate certain > execution order, but I'm still lost what to try to achieve. I believe the > error "Message recipient disconnected from message bus without replying" is > the main clue. I'm not so sure about that, because that seems to happen several times without affecting the factory. First example: method call time=1512481368.981935 sender=:1.9 -> destination=:1.290 serial=314 path=/org/gnome/evolution/dataserver/Subprocess/Backend/AddressBook/70/20; interface=org.gnome.evolution.dataserver.Subprocess.Backend; member=Close ... signal time=1512481369.025259 sender=org.freedesktop.DBus -> destination=(null destination) serial=11 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameOwnerChanged string ":1.290" string ":1.290" string "" error time=1512481369.025284 sender=org.freedesktop.DBus -> destination=:1.9 error_name=org.freedesktop.DBus.Error.NoReply reply_serial=314 string "Message recipient disconnected from message bus without replying" Yes, it looks like the subprocess doesn't reply properly or lacks flushing of its pending D-Bus reply message, but the actual error is further down with a different subprocess (1.291). Is this perhaps the relevant race? It looks like the subprocess is asked to close while there's a "Create" method call pending: method call time=1512481369.902700 sender=:1.289 -> destination=:1.11 serial=100 path=/org/gnome/evolution/dataserver/CalendarFactory; interface=org.gnome.evolution.dataserver.CalendarFactory; member=OpenMemoList string "system-memo-list" method call time=1512481369.902986 sender=:1.11 -> destination=:1.291 serial=788 path=/org/gnome/evolution/dataserver/Subprocess/Backend/Calendar/88/46; interface=org.gnome.evolution.dataserver.Subprocess.Backend; member=Create string "system-memo-list" string "ECalBackendFileJournalFactory" string "/usr/lib/evolution-data-server/calendar-backends/libecalbackendfile.so" signal time=1512481369.906493 sender=:1.291 -> destination=(null destination) serial=39 path=/org/gnome/evolution/dataserver/Subprocess/Backend/Calendar/88/46; interface=org.gnome.evolution.dataserver.Subprocess.Backend; member=BackendClosed string ":1.289" method call time=1512481369.906520 sender=:1.291 -> destination=:1.6 serial=40 path=/org/gnome/evolution/dataserver/SourceManager/Source_34; interface=org.gnome.evolution.dataserver.Source; member=UnsetLastCredentialsRequiredArguments method call time=1512481369.906926 sender=:1.11 -> destination=:1.291 serial=789 path=/org/gnome/evolution/dataserver/Subprocess/Backend/Calendar/88/46; interface=org.gnome.evolution.dataserver.Subprocess.Backend; member=Close method return time=1512481369.908076 sender=:1.6 -> destination=:1.291 serial=555 reply_serial=40 method return time=1512481369.935984 sender=:1.291 -> destination=:1.11 serial=41 reply_serial=788 string "/org/gnome/evolution/dataserver/Subprocess/9544/3" method call time=1512481369.941889 sender=:1.291 -> destination=org.freedesktop.DBus serial=42 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=ReleaseName string "org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46" signal time=1512481369.941908 sender=org.freedesktop.DBus -> destination=:1.291 serial=23 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameLost string "org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46" signal time=1512481369.941917 sender=org.freedesktop.DBus -> destination=(null destination) serial=559 path=/org/freedesktop/DBus; interface=org.freedesktop.DBus; member=NameOwnerChanged string "org.gnome.evolution.dataserver.Subprocess.Backend.Calendarx88x46" string ":1.291" string "" It replies to the "Create" method call, but then proceeds with shutting down.
I tried to reproduce here the simple thing, quit the factory-subprocess main loop, but wait few seconds to unown the old D-Bus name (the long one, not the numbered), and it didn't trigger the error for me. I verified that the D-Bus name (the login one) changes every time the backend is closed (and its factory-subprocess quits), thus it's should not be a problem that D-Bus assigns the same "numbered bus name (source & destination) as the previous process. It's up to the D-Bus to quarantee that they won't clash. I managed to compile syncevolution with unit tests and then ran: $ ./client-test Client::Source::eds_event \ Client::Source::eds_task Client::Source::eds_memo with the test sources being On This Computer calendar/memo/task list. None of them triggered the issue, each test succeeded. I thought I'll try also with Client::Sync:: counter parts, but that failed with an error about the On This Computer sources cannot handle it. I didn't try more, but it would be interesting to know what types your sources in eds are, maybe it has its influence on it.
Have you looked at my analysis of the race condition in comment #4? Do you agree that the subprocess shouldn't quit because it just replied to a Create method call? Or is it the factory which shouldn't call Close because it just sent a Create? I don't know how to enforce this particular situation. Perhaps a stress test which repeatedly closes and immediately opens an ESource triggers it. The Client::Source tests do that indirectly, but not intentionally, i.e. they just happen to open and close a lot. There's no dedicated test that calls open and close in a loop. I run the tests on a heavily loaded system, one which tests as many unrelated things at once. So that may be a contributing factor.
(In reply to Patrick Ohly from comment #6) > Have you looked at my analysis of the race condition in comment #4? I did, but I'm rather confused with the D-Bus call logs, thus I'd also want the reproducer, with which I can play a bit more. > I don't know how to enforce this particular situation. Perhaps a stress test > which repeatedly closes and immediately opens an ESource triggers it. The > Client::Source tests do that indirectly, but not intentionally, i.e. they > just happen to open and close a lot. There's no dedicated test that calls > open and close in a loop. I see, I thought eds has its own test for it, but I cannot find it. It might be a good opportunity to add it. > I run the tests on a heavily loaded system, one which tests as many > unrelated things at once. So that may be a contributing factor. Right, I know that many things work differently when the system is heavy loaded. I'll make a note about it.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/Community/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/evolution-data-server/-/issues/ Thank you for your understanding and your help.