GNOME Bugzilla – Bug 790531
Changing the background leaks a timerfd
Last modified: 2021-07-05 14:18:16 UTC
When a background object is destroyed, its associated clock object is also destroyed, but the clock object leaves a timerfd behind. If backgrounds are being created and removed in a regular cycle, this causes the system to eventually run out of file descriptors. I am attaching a patch to fix this by creating one clock object that's associated with the cache and reusing it for every new background.
Created attachment 363968 [details] [review] Fix timerfd leak - V1
hey thanks for looking into this! (In reply to Jonathan Dieter from comment #0) > When a background object is destroyed, its associated clock object is also > destroyed It's not actually, so I think fixing that, instead, is the better way to go than attachment 363968 [details] [review]. I have that fix as attachment 365590 [details] [review] on bug 791655, so I'm going to duplicate this bug to that one. *** This bug has been marked as a duplicate of bug 791655 ***
I've just checked to make sure this actually fixes the problem and I'm still getting leaked timerfds. Is the GnomeWallClock object not stopping its timers when it's destroyed? I'm working on an extension that changes the background at regular intervals, and gnome-shell runs out of file descriptors overnight. If I completely comment out the clock and timezone code from the Background class, the number of timers stays constant.
maybe you're leaking background objects ?
I don't think that's the case. My current code replaces GnomeDesktop.WallClock with a dummy class, and I'm no longer leaking timerfds, so I'm pretty sure this is a bug in GnomeWallClock.
well sure, but that would be true if you were leaking background objects too, right ? can you ssh in from another machine, gdb attach to gnome-shell then do something like: (gdb) break gnome_wall_clock_init Breakpoint 1 at 0xc00lc0de (gdb) commands 1 Type commands for breakpoint(s) 1, one per line. End with a line saying just "end". print self continue end (gdb) break gnome_wall_clock_finalize (gdb) commands 2 Breakpoint 2 at 0xbadc0de Type commands for breakpoint(s) 2, one per line. End with a line saying just "end". print object continue end (gdb) continue and see if you get a finalize for ever init ? also adding a log call to background.js _init and destroy methods would be useful
Created attachment 367321 [details] Simple test case that demonstrates the bug Sorry, I think I misunderstood you. I thought you were suggesting that I was leaking instances of the Background class in background.js, but, if I understand you correctly, you're suggesting that I may be leaking the GnomeWallClock object that lies under the GnomeDesktop.WallClock class? I believe you're correct. The output from gdb shows that there's no finalize, just inits. I've put together a small test file that demonstrates the bug. It has an infinite loop that creates a new instance of GnomeDesktop.WallClock(), waits a second, and then sets it to null (which should release the references and allow it to be garbage collected, correct?). On my system (up-to-date Fedora 27), this creates a new timerfd every second. GDB output: (gdb) break gnome_wall_clock_init Breakpoint 1 at 0x7fcc7d5ccc90 (gdb) commands 1 Type commands for breakpoint(s) 1, one per line. End with a line saying just "end". >print self >continue >end (gdb) break gnome_wall_clock_finalize Breakpoint 2 at 0x7fcc7d5cc6a0 (gdb) commands 2 Type commands for breakpoint(s) 2, one per line. End with a line saying just "end". >print object >continue >end (gdb) continue Continuing. Thread 1 "gjs" hit Breakpoint 1, 0x00007fcc7d5ccc90 in gnome_wall_clock_init () from /lib64/libgnome-desktop-3.so.12 No symbol table is loaded. Use the "file" command. (gdb) continue Continuing. Thread 1 "gjs" hit Breakpoint 1, 0x00007fcc7d5ccc90 in gnome_wall_clock_init () from /lib64/libgnome-desktop-3.so.12 No symbol table is loaded. Use the "file" command. (gdb) continue Continuing. Thread 1 "gjs" hit Breakpoint 1, 0x00007fcc7d5ccc90 in gnome_wall_clock_init () from /lib64/libgnome-desktop-3.so.12 No symbol table is loaded. Use the "file" command. (gdb) quit A debugging session is active. Inferior 1 [process 4778] will be detached.
well either really, i'm trying to pin point the problem with you. Two thoughts: 1) can you add the the log calls to background.js that I mentioned in comment 5 to see if the background objects are getting cleaned up? 2) can you change your test case to - use GLib.timeout_add instead of sleep - call System.gc() and report if that changes anything.
I won't be able to get on the system to do the logging until tomorrow, so I'll try (1) then. As for changing my test case, it has a very different effect. Now there is one gnome_wall_clock_finalize() called for every gnome_wall_clock_init(), which seems to indicate the bug *is* in my code somewhere and I'm not cleaning up my background objects like I thought I was. Tomorrow I'll see if I can track this down. I wonder if the problem is in _swapBackgroundActor, where we don't seem to actually destroy the background itself. Perhaps there should be a oldBackgroundActor.background._delegate.destroy() before destroying the oldBackgroundActor itself? Thanks so much for pointing me in the right direction.
Created attachment 367340 [details] Error logged by gnome-shell on background change Perhaps I can add to this as I'm seeing the same leak of timerfd file handles if I change the background through gsettings with: /usr/bin/gsettings set org.gnome.desktop.background picture-uri "file://path/to/image" The attached file shows the error logged by gnome-shell along with a stack trace - which is hopefully of some help. The addition of a timerfd handle is clear in an lsof of the gnome-shell process. Seeing this on the latest build of 3.26.2 on ArchLinux (3.26.2+14+g64c857e3f-1).
It could be that the GnomeWallClock object is kept in the GC indefinitely. Leaving it there will not close the timerfd, thus "leaking" it (until GC happens to sweep it away). If I read the code correctly GnomeWallCock always have 1 timerfd active, and it won't clean it up until dispose, and dispose won't be called until GC. If we call run_dispose() on the wall clock before unsetting it, we should then avoid the timerfd leak.
(In reply to Ray Strode [halfline] from comment #8) > 1) can you add the the log calls to background.js that I mentioned in > comment 5 to see if the background objects are getting cleaned up? I've done this and the background objects are not getting cleaned up. Background._init() is run, but Background.destroy() is not. I can now verify that the problem is that BackgroundManager._swapBackgroundActor() doesn't actually destroy the background. Adding oldBackgroundActor.background._delegate.destroy() before we run oldBackgroundActor.background.run_dispose() fixes this problem. But: (In reply to Jonas Ådahl from comment #11) > It could be that the GnomeWallClock object is kept in the GC indefinitely. > Leaving it there will not close the timerfd, thus "leaking" it (until GC > happens to sweep it away). > > If I read the code correctly GnomeWallCock always have 1 timerfd active, and > it won't clean it up until dispose, and dispose won't be called until GC. If > we call run_dispose() on the wall clock before unsetting it, we should then > avoid the timerfd leak. This is completely correct. I ran a test where I changed the background every 0.1 second (after removing the Tweener in _swapBackgroundActor and cleaning up oldBackgroundActor directly), and we rapidly accumulate almost 1000 timerfds before the garbage collector kicks in and cleans it up. The good news is that the garbage collector works as expected and we never run out of file descriptors, but the bad news is that I tested running this._clock.run_dispose() before setting this._clock = null, and it *doesn't* dispose of the timerfds. I don't know that I care how many timerfds are open as long as we're not running out of file descriptors, but that's probably because I don't know if there's any real cost to leaving them there. I will submit a patch that makes sure we destroy the background in _swapBackgroundActor()
Created attachment 367360 [details] [review] Fix timerfd leak - V2
Review of attachment 367360 [details] [review]: hey thanks for the debugging and the patch! I think this is probably okay, though I wish we had a more generic technique in the codebase for handling the _delegate pattern we could crib off of. Another, (better?) approach might be to ditch the .background._delegate complexity and add "Extends: Meta.Background," to the background class. It would be a bigger patch, but would simplify the code... is that something you'd up for doing? (no worries if you don't want to) ::: js/ui/background.js @@ +720,3 @@ transition: 'easeOutQuad', onComplete: function() { + oldBackgroundActor.background._delegate.destroy(); so (assuming you don't want to do the Extends: thing), I'd prefer if you did: let background = oldBackgroundActor.background._delegate; background.destroy(); then did this.background.run_dispose(); from the destroy function instead of here. just reads better, imo.
Let me see what I can do. I haven't done much JavaScript programming, so you'll want to review my code carefully. If I do Extends: Meta.Background the background class, do I need to run this.run_dispose() in destroy()?
hmm, on the surface running this.run_dispose() sounds like it could cause crashes. To be honest, I'm not sure why it's calling run_dispose() at the moment anyway. It should get disposed automatically when the garbage collector runs (unless it's to hide a reference count leak?). Can you do a little code spelunking in the git history and see if there's a rationale for why that's there? Assuming there's no good rationale we can probably just drop the call.
Looks like the run_dispose was added to resolve bug 766353 From the commit: "The memory which MetaBackground resides is freed automatically, when its wrapper object Background is freed by garbage collection. But because the memory for MetaBackground is huge, changing the background or changing the background mode repeatedly makes a lot of memory to be in use until garbage collection runs. This patch frees the MetaBackground object explicitly when its MetaBackgroundActor is destroyed which is the moment that the memory in the object is not used anymore." https://github.com/GNOME/gnome-shell/commit/e72d388a674d4f7c42f5b532ffd9c43aa29bcc27
I did also spot there is an existing call to destroy the Background object fired by the 'changed' event in this block on line 597: background._changedId = background.connect('changed', Lang.bind(this, function() { background.disconnect(background._changedId); background.destroy(); delete this._backgrounds[monitorIndex]; })); Could it be that the error I'm seeing in attachment 367340 [details] is aborting the processing of the 'changed' event preventing that from being executed?
hmm, i don't think so, we can see the code you reference in comment 18 is connected in: top of_createBackgroundActor→getBackground→'changed' and the code leading to the spew is connected later in the function: middle of _createBackgroundActor→'changed'→_updateBackgroundActor→_swapBackgroundActor→onComplete So the code you mention should be running first! If anything, it might be the other way around, the code you mention above might lead to to the spew getting generated later? but if that were the case i wouldn't expect attachment 367360 [details] [review] to change anything, and it fixes the problem for you right ?. Anyway if you add some log calls you should be able to get a pretty good picture of the order of things getting run and if anything is failing to get run at all because of an exception getting thrown. one idea...try rooting the oldBackgroundActor to the background manager object. in _swapBackgroundActor 1) change "let oldBackgroundActor = ..." to "this._oldBackgroundActor" 2) change all instances of oldBackgroundActor to this._oldBackgroundActor 3) change "onComplete: function() {" to "onComplete: () => {" (to add the this object to its scope) 4) set this._oldBackgroundActor to null at the bottom of onComplete
I've just tried a build with attachment 367360 [details] [review] applied and unfortuantely it doesn't appear to fix the timerfd "leak". Interestingly it does change the error pushed. I'm now getting this: JS ERROR: Error calling onComplete: TypeError: can't convert null to object Background<.destroy@resource:///org/gnome/shell/ui/background.js:282:20 wrapper@resource:///org/gnome/gjs/modules/_legacy.js:82:22 BackgroundManager<._swapBackgroundActor/<.onComplete@resource:///org/gnome/shell/ui/background.js:722:32 _addHandler/params[name]@resource:///org/gnome/shell/ui/tweener.js:91:13 _callOnFunction@resource:///org/gnome/gjs/modules/tweener/tweener.js:203:13 _updateTweenByIndex@resource:///org/gnome/gjs/modules/tweener/tweener.js:332:9 _updateTweens@resource:///org/gnome/gjs/modules/tweener/tweener.js:345:18 _onEnterFrame@resource:///org/gnome/gjs/modules/tweener/tweener.js:360:10 _emit@resource:///org/gnome/gjs/modules/signals.js:126:27 ClutterFrameTicker<._onNewFrame@resource:///org/gnome/shell/ui/tweener.js:208:9 wrapper@resource:///org/gnome/gjs/modules/_legacy.js:82:22 ClutterFrameTicker<._init/<@resource:///org/gnome/shell/ui/tweener.js:183:17 From my reading of the code it would seem the only way that error occurs is if destroy has already been called. That would suggest the code I referenced in comment 18 is working and the additional call to destroy is redundant.
thanks for checking. John did the patch work for you?
(In reply to Ray Strode [halfline] from comment #21) > thanks for checking. John did the patch work for you? Yes, I have it applied to four multiseat systems in our test room, ran overnight with images changing every 30 seconds, and each gnome-shell session never exceeded 600 timerfds. Please note, though, that I've written an extension that automatically changes a background every 30 seconds if picture-uri is a directory, and I'm using it to change the screen shield background (I've patched gnome-shell to allow extensions in the shield), not the actual background. Mike, can you verify that you're actually running out of file descriptors as opposed to accumulating timerfds that will eventually be garbage collected?
So it does look like the GC has swept up the timerfd's overnight, unlocking this morning I had only 5 timerfd's. This has now grown to ~70 over the past hour. Jon - how often are you seeing a GC sweep? To give some background, I have a shell script (started via .config/autostart) that changes the desktop background to a randomly selected image from a directory every 60 seconds. This I had been using for years without problem. However after updating from 3.22->3.24 in the summer I've been getting regular crashes from gnome-shell. When I looked at it then it seemed to be a pretty non-specific segfault and the situation improved after updating extensions but I was still seeing it crash on unlock every 24-48 hours. Looking at it again this week (and with the move to 3.26) I discover the "too many files" errors and the tweener stack trace which led me here to Jon's bug report! I haven't done an overnight test with Ray's original patch attachment 365590 [details] [review] (this only made it into the Arch build this week). I'll try this over the weekend and confirm whether or not attachment 367360 [details] [review] is helping or not. Jon - I'm interested if you're seeing the same errors I am? These show up in the systemd journal which I'm retrieving running 'journalctl -n 100' as the user running gnome-shell.
To follow-up with my findings from the weekend... With or without attachment 367360 [details] [review] the timerfd's are always cleaned up by a GC sweep. However it does sometimes take many hours before the GC kicks in. I left my system idle overnight last night with the script switching the background every minute it took over 15 hours before the GC ran by which point there were 951 timerfds! The timerfds can be reset back to 6 at any point by running imports.system.gc() through Looking Glass. Is there any way we could encourage the GC to run more often because 951 timerfd handles is an awful lot? I've traced this error: JS ERROR: Error calling onComplete: TypeError: oldBackgroundActor.background is null It originates with call to oldBackgroundActor.background.run_dispose() which it would appear now does nothing. I've removed this call in another build and the error disappears. As this implies that the background property of oldBackgroundActor is null when onComplete is called I don't see how attachment 367360 [details] [review] can be doing anything effective. Jon - I'm wondering if the reason your seeing different behavior to me is down to how your extension changes the background? Maybe it's not causing the 'changed' event to be fired on the Background object which means it's not being destroyed by the handler on line 572? Unfortunately it looks like there also slow memory leak from background changes (likely the return of bug 766353 as run_dispose isn't working). I'll raise a separate bug report with details of this.
Created attachment 367687 [details] [review] Fix timerfd leak - V3 This version of the patch Extends: the Meta.Background class, so it should be more elegant to work with. I've only tested it for a few hours, but I have set it up to change backgrounds as quickly as it possibly can, and, once again, we get up to ~1000 timerfds before garbage collection kicks in. I'm not seeing a memory leak, but that might be because all of my backgrounds are in the cache.
(In reply to Ray Strode [halfline] from comment #16) > hmm, on the surface running this.run_dispose() sounds like it could cause > crashes. > > To be honest, I'm not sure why it's calling run_dispose() at the moment > anyway. It should get disposed automatically when the garbage collector > runs (unless it's to hide a reference count leak?). > > Can you do a little code spelunking in the git history and see if there's a > rationale for why that's there? > > Assuming there's no good rationale we can probably just drop the call. Because run_dipose() was in there for a reason, I am running it in destroy(), and I haven't hit any crashes. I don't know if that's luck or if it's working as expected.
(In reply to Mike Salway from comment #24) > To follow-up with my findings from the weekend... > > With or without attachment 367360 [details] [review] [review] the timerfd's are > always cleaned up by a GC sweep. However it does sometimes take many hours > before the GC kicks in. I left my system idle overnight last night with the > script switching the background every minute it took over 15 hours before > the GC ran by which point there were 951 timerfds! Yep, I'm up to 990 at the moment. > The timerfds can be reset back to 6 at any point by running > imports.system.gc() through Looking Glass. > > Is there any way we could encourage the GC to run more often because 951 > timerfd handles is an awful lot? Ray, what would you say to manually running system.gc() after every ten picture changes? Is this a bit crazy? > I've traced this error: > > JS ERROR: Error calling onComplete: TypeError: oldBackgroundActor.background > is null > > It originates with call to oldBackgroundActor.background.run_dispose() which > it would appear now does nothing. I've removed this call in another build > and the error disappears. As this implies that the background property of > oldBackgroundActor is null when onComplete is called I don't see how > attachment 367360 [details] [review] [review] can be doing anything effective. > > Jon - I'm wondering if the reason your seeing different behavior to me is > down to how your extension changes the background? Maybe it's not causing > the 'changed' event to be fired on the Background object which means it's > not being destroyed by the handler on line 572? I'm only using getBackground on the initial load, and then switching Backgrounds internally, so background.destroy() was only run once. With V3 of the patch, we remove the attempt to destroy the background in line 572 and let _swapBackgroundActor (which emits the 'changed' event) deal with it. > Unfortunately it looks like there also slow memory leak from background > changes (likely the return of bug 766353 as run_dispose isn't working). I'll > raise a separate bug report with details of this. I'm not seeing this. How are you checking for a memory leak?
Ray, have you had a chance to check my patch yet?
sorry, not yet ! I appreciate the work you put into this, and want to look soon. As an aside, if you're getting up to 990 timerfds, maybe we should revisit attachment 363968 [details] [review] after all.
Created attachment 368482 [details] [review] Fix timerfd leak - V4 You know, even though I wrote that original patch, it didn't even occur to me that it might be a more elegant fix than running system.gc(). Yes, I think it might be time to resurrect it. I do like what I've done in V3 to make sure we're Extending the Meta.Background class, and it fixes a bug where Background.destroy() wasn't always being run, so I've combined the two patches into V4.
I've tested V4 on my systems, and we never exceed three timerfds.
I tried the last patch. It fixed the recurring "too many open files" crash I occasionally get. However instead of the crash, gnome-shell memory usage suddenly goes up from 290 to 560MB (the moment it previously crashed).
(In reply to Hussam Al-Tayeb from comment #32) > I tried the last patch. It fixed the recurring "too many open files" crash I > occasionally get. However instead of the crash, gnome-shell memory usage > suddenly goes up from 290 to 560MB (the moment it previously crashed). Is there anything in your gnome-shell logs? If so, do you mind posting them here?
Created attachment 368762 [details] [review] Fix timerfd leak - V5 Ok, there was a bug in V4 that had us attempting to free the background twice. Fixed in V5.
It won't apply cleanly to latest revision in master branch.
Created attachment 368764 [details] [review] Fix timerfd leak - V6 It seems there's been a cleanup effort in master that's changed how functions are defined. This should apply cleanly.
(In reply to Jonathan Dieter from comment #33) > (In reply to Hussam Al-Tayeb from comment #32) > > I tried the last patch. It fixed the recurring "too many open files" crash I > > occasionally get. However instead of the crash, gnome-shell memory usage > > suddenly goes up from 290 to 560MB (the moment it previously crashed). > > Is there anything in your gnome-shell logs? If so, do you mind posting them > here? https://pastebin.com/raw/ZRftwpht Previously gnome-shell would crash with 'too many open files' and gnome-session? would restart it. now it just jumped to 550MB memory usage from mid 200s.
Created attachment 369003 [details] [review] Run garbage collection after destroying a background (In reply to Hussam Al-Tayeb from comment #37) > https://pastebin.com/raw/ZRftwpht > Previously gnome-shell would crash with 'too many open files' and > gnome-session? would restart it. now it just jumped to 550MB memory usage > from mid 200s. This is an incredibly ugly patch that runs garbage collection after we destroy a background. I didn't put it Background.destroy(), because I'm assuming it won't free up all the memory while we're still in a function belonging to the object, so I've sprinkled it throughout background.js. This patch should be applied on top of "Fix timerfd leak - V6". Could you please test whether you're still seeing a huge memory usage jump?
Ok, I applied both patches in that order. I'll let you know when I hit the issue again. Thank you.
No "too many open files" gnome-shell errors since applying both patches. Now I hope I did not jinx it but it looks good so far.
GNOME is going to shut down bugzilla.gnome.org in favor of gitlab.gnome.org. As part of that, we are mass-closing older open tickets in bugzilla.gnome.org which have not seen updates for a longer time (resources are unfortunately quite limited so not every ticket can get handled). If you can still reproduce the situation described in this ticket in a recent and supported software version, then please follow https://wiki.gnome.org/GettingInTouch/BugReportingGuidelines and create a new ticket at https://gitlab.gnome.org/GNOME/gnome-shell/-/issues/ Thank you for your understanding and your help.