After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 737507 - Switching from ArrayList to ConcurrentList causes 100% CPU usage
Switching from ArrayList to ConcurrentList causes 100% CPU usage
Status: RESOLVED INCOMPLETE
Product: libgee
Classification: Platform
Component: general
0.16.x
Other Linux
: High normal
: ---
Assigned To: libgee-maint
libgee-maint
Depends on:
Blocks:
 
 
Reported: 2014-09-28 04:13 UTC by Michael Catanzaro
Modified: 2014-10-28 22:38 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Workaround 100% CPU usage issue (999 bytes, patch)
2014-10-04 21:53 UTC, Michael Catanzaro
committed Details | Review
gprof report (7.06 KB, application/octet-stream)
2014-10-22 01:03 UTC, Michael Catanzaro
  Details

Description Michael Catanzaro 2014-09-28 04:13:03 UTC
After 31f5096f5f21e97081642542a150b747d1702532 we now have a thread running forever using 100% CPU after using print multiple sudokus... I guess we should revert that commit, but it looks like a bug in libgee to me.
Comment 1 Michael Catanzaro 2014-10-04 19:06:37 UTC
This also happens when generating a single Sudoku, so on every new game. Drat.

The thread is successfully joined. Something else is going wrong.
Comment 2 Michael Catanzaro 2014-10-04 21:53:43 UTC
Created attachment 287728 [details] [review]
Workaround 100% CPU usage issue

No clue what's going wrong here....
Comment 3 Michael Catanzaro 2014-10-04 21:53:59 UTC
Comment on attachment 287728 [details] [review]
Workaround 100% CPU usage issue

Attachment 287728 [details] pushed as f89fb33 - Workaround 100% CPU usage issue
Comment 4 Michael Catanzaro 2014-10-04 22:36:13 UTC
Switching the list "threads" (a DIFFERENT list used only in the UI thread, two declarations down from boards_list, the one that caused this problem) from an ArrayList to a ConcurrentList, while leaving boards_list as an ArrayList, is sufficient to reproduce this bug. Replacing all the ArrayLists in SudokuBoard.vala with ConcurrentLists is also sufficient to introduce this bug, but replacing the ArrayLists in SudokuGame.vala is not.

I notice we block the UI thread waiting for each thread to finish unless they all finish in the order they started in, so this code needs rewritten anyway, but I'm going to move this bug over to libgee for debugging assistance.
Comment 5 Maciej (Matthew) Piechotka 2014-10-05 09:51:56 UTC
I'll try to look at this today but as I'm currently travelling I might have trouble to find time - it might happened that I'll look on it tomorrow or even next weekend.

To speed things up which version of sudoku should I use and are there any dependencies beyond GNOME 3.12?
Comment 6 Michael Catanzaro 2014-10-05 15:56:18 UTC
Thanks for looking into this (eventually :). Sudoku has been completely rewritten for 3.14.  You can debug with 3.14.0, or the tip of the gnome-3-14 branch, or master, though it might be easiest not to use master as we've added a clock and that adds a bunch of timeouts to the main loop that aren't relevant to this bug.
Comment 7 Michael Catanzaro 2014-10-05 18:13:48 UTC
(In reply to comment #6)
> Thanks for looking into this (eventually :). Sudoku has been completely
> rewritten for 3.14.  You can debug with 3.14.0, or the tip of the gnome-3-14
> branch, or master

I just rewrote this code on both gnome-3-14 and master, but there is still an ArrayList in lib/sudoku-generator.vala and you can still reproduce CPU usage issue when using Print Multiple Sudokus by switching it to a ConcurrentList.

(In reply to comment #4)
> Replacing all the ArrayLists in
> SudokuBoard.vala with ConcurrentLists is also sufficient to introduce this bug,
> but replacing the ArrayLists in SudokuGame.vala is not.

If you replace the ArrayLists in src/sudoku-game.vala with ConcurrentLists, use Print Multiple Sudokus (I test this with 100 sudokus), press Print and then press Print Preview, you'll trigger 100% CPU usage.

At this point I'm now convinced it's a Gee bug. :)
Comment 8 Maciej (Matthew) Piechotka 2014-10-12 06:32:30 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > Thanks for looking into this (eventually :).

I just luxuries such as sleeping more then 2h during 48h period ;)
Sorry - the email caught me during transfer between flights and even if I did managed to look at it I wouldn't be very helpful.

>> Sudoku has been completely
> > rewritten for 3.14.  You can debug with 3.14.0, or the tip of the gnome-3-14
> > branch, or master
> 
> I just rewrote this code on both gnome-3-14 and master, but there is still an
> ArrayList in lib/sudoku-generator.vala and you can still reproduce CPU usage
> issue when using Print Multiple Sudokus by switching it to a ConcurrentList.
> 

I run into problems with building dependencies ATM so I mostly rely on guesswork for now. I have run into problems with Gnome Pi support on my distro - sorry for further delays.

> (In reply to comment #4)
> > Replacing all the ArrayLists in
> > SudokuBoard.vala with ConcurrentLists is also sufficient to introduce this bug,
> > but replacing the ArrayLists in SudokuGame.vala is not.
> 
> If you replace the ArrayLists in src/sudoku-game.vala with ConcurrentLists, use
> Print Multiple Sudokus (I test this with 100 sudokus), press Print and then
> press Print Preview, you'll trigger 100% CPU usage.
> 
> At this point I'm now convinced it's a Gee bug. :)

Hmm. There are two ways it might happen:
 - The GC thread is busy. I thought that there is a smarter policy in place but unless I missing something, there isn't.
 - Something in ConcurrentList is busy

I'd go with the first one. There are a few things which can be done on gnome-sudoku side to workaround the issue - which are worth to do even if the libgee issue will be fixed:
 - Use ReleasePolicy.MAIN_LOOP making the GC being performed in main loop
 - Use contexts in smart way. Otherwise the data is pushed to GC queue on end of each operation instead of pushing at the end of context lifetime (note that contexts are thread-specific).

For now I made the thread sleep 100 ms, though it's far from being ideal. 

commit 14ada432f82a5f1d4b65918e2a2de67a28e7e9db
Author: Maciej Piechotka <uzytkownik2@gmail.com>
Date:   Sat Oct 11 23:23:28 2014 -0700

    Sleep 100 ms in GC thread when there is nothing to do
    
    Temporal fix for bug 737507 until something more sane would come up
    while still being lock free.
Comment 9 Michael Catanzaro 2014-10-18 20:00:18 UTC
Thanks for looking into this! I tried out your fix today, but unfortunately it doesn't make any difference.

> I'd go with the first one. There are a few things which can be done on
> gnome-sudoku side to workaround the issue - which are worth to do even if the
> libgee issue will be fixed:
>  - Use ReleasePolicy.MAIN_LOOP making the GC being performed in main loop

This didn't help either. :(

>  - Use contexts in smart way. Otherwise the data is pushed to GC queue on end
> of each operation instead of pushing at the end of context lifetime (note that
> contexts are thread-specific).

Eh, that's looking a little complicated for a simple list; we'll probably stick to mutexes I guess.
Comment 10 Maciej (Matthew) Piechotka 2014-10-20 04:26:09 UTC
(In reply to comment #9)
> Thanks for looking into this! I tried out your fix today, but unfortunately it
> doesn't make any difference.
> 

Hmm. I don't have access to qqwing yet in my distro and I failed to compile. I could not reproduce it with dead simple usage:

int main() {
    Gee.ConcurrentList<Gee.ArrayList<int>?> list = new Gee.ConcurrentList<Gee.ArrayList<int>>();
    list.add(new Gee.ArrayList<int>());
    Idle.add(() => {
        list.remove_at (0);
        return false;
    });
    GLib.MainLoop loop = new GLib.MainLoop();
    loop.run();
    return 0;
}

Would it be possible to get a minimal repro example? Also what's profiling is indicating is causing the 100% CPU time?

> > I'd go with the first one. There are a few things which can be done on
> > gnome-sudoku side to workaround the issue - which are worth to do even if the
> > libgee issue will be fixed:
> >  - Use ReleasePolicy.MAIN_LOOP making the GC being performed in main loop
> 
> This didn't help either. :(
> 
> >  - Use contexts in smart way. Otherwise the data is pushed to GC queue on end
> > of each operation instead of pushing at the end of context lifetime (note that
> > contexts are thread-specific).
> 
> Eh, that's looking a little complicated for a simple list; we'll probably stick
> to mutexes I guess.

Ok. One way or another I'm interested in fixing it as both should be improvements not methods of avoiding 100% CPU boost. Contexts are complicated, true, but this was necessary choice as GObject does not have a true GC.
Comment 11 Michael Catanzaro 2014-10-21 02:11:04 UTC
Your example does not cause the problem for me either, though I should note that I get the following warning about a bajillion times when I run it:

** (process:28136): CRITICAL **: gee_hazard_pointer_try_free: assertion 'to_free != NULL' failed

Here is a small reproducer, based on yours, but using GApplication. Switching from ConcurrentList to ArrayList reduces my CPU usage from 100% of one CPU to 6% of one CPU:

// valac test.vala --pkg gio-2.0 --pkg gee-0.8
public class Test : Application {

    public Test() {
        Object (application_id: "org.gnome.bug737507");
    }

    protected override void activate() {
        var list = new Gee.ConcurrentList<int>();
        list.add(42);
        Idle.add(() => {
            list.remove_at (0);
            return false;   // Source.CONTINUE
        });

        this.hold();
    }
}

 int main() {
     return new Test().run();
 }

> Would it be possible to get a minimal repro example? Also what's profiling is
> indicating is causing the 100% CPU time?

I've just been opening System Monitor.

> Ok. One way or another I'm interested in fixing it as both should be
> improvements not methods of avoiding 100% CPU boost. Contexts are complicated,
> true, but this was necessary choice as GObject does not have a true GC.

I actually am not really sure what the contexts are. The only documentation I could find is http://valadoc.org/#!api=gee-0.10/Gee.HazardPointer.Context
Comment 12 Maciej (Matthew) Piechotka 2014-10-21 03:34:29 UTC
(In reply to comment #11)
> Your example does not cause the problem for me either, though I should note
> that I get the following warning about a bajillion times when I run it:
> 
> ** (process:28136): CRITICAL **: gee_hazard_pointer_try_free: assertion
> 'to_free != NULL' failed
> 

Could you post a stacktrace (to a different bug to track it separatly?) - I could not reproduce the issue (yay multithreading).

> Here is a small reproducer, based on yours, but using GApplication. Switching
> from ConcurrentList to ArrayList reduces my CPU usage from 100% of one CPU to
> 6% of one CPU:
> 
> // valac test.vala --pkg gio-2.0 --pkg gee-0.8
> public class Test : Application {
> 
>     public Test() {
>         Object (application_id: "org.gnome.bug737507");
>     }
> 
>     protected override void activate() {
>         var list = new Gee.ConcurrentList<int>();
>         list.add(42);
>         Idle.add(() => {
>             list.remove_at (0);
>             return false;   // Source.CONTINUE
>         });
> 
>         this.hold();
>     }
> }
> 
>  int main() {
>      return new Test().run();
>  }
> 

I couldn't still reproduce the issue.

> > Would it be possible to get a minimal repro example? Also what's profiling is
> > indicating is causing the 100% CPU time?
> 
> I've just been opening System Monitor.
> 

Sorry - I meant could you run application under gprof or something similar which should indicate a method causing it.

> > Ok. One way or another I'm interested in fixing it as both should be
> > improvements not methods of avoiding 100% CPU boost. Contexts are complicated,
> > true, but this was necessary choice as GObject does not have a true GC.
> 
> I actually am not really sure what the contexts are. The only documentation I
> could find is http://valadoc.org/#!api=gee-0.10/Gee.HazardPointer.Context

Possibly more in-depth explanation in non-reference format: http://blog.piechotka.com.pl/2014/03/01/lock-free-collection-in-libgee-hazard-pointer/
Comment 13 Michael Catanzaro 2014-10-22 01:03:10 UTC
(In reply to comment #12) 
> Could you post a stacktrace (to a different bug to track it separatly?) - I
> could not reproduce the issue (yay multithreading).

I can't reproduce it today... weird.

> Sorry - I meant could you run application under gprof or something similar
> which should indicate a method causing it.

I think I'm doing something wrong. Here are the commands I ran:

valac -C test.vala --pkg gio-2.0 --pkg gee-0.8 --pkg posix
cc -pg `pkg-config --cflags --libs gio-2.0 gee-0.8` test.c
./a.out
gprof ./a.out gmon.out

Here's a modified test.vala that terminates after five seconds, so that gmon.out is produced:

public class Test : Application {

    public Test() {
        Object (application_id: "org.gnome.bug737507");
    }

    protected override void activate() {
        var list = new Gee.ConcurrentList<int>();
        list.add(42);
        Idle.add(() => {
            list.remove_at (0);
            return false;   // Source.CONTINUE
        });

	Timeout.add_seconds(5, () => {
            Posix.exit(0);
            assert_not_reached ();
        });

        this.hold();
    }
}

 int main() {
     return new Test().run();
 }

My results are attached, but they are surely not right, since it says no time elapsed....

> Possibly more in-depth explanation in non-reference format:
> http://blog.piechotka.com.pl/2014/03/01/lock-free-collection-in-libgee-hazard-pointer/

Thanks.
Comment 14 Michael Catanzaro 2014-10-22 01:03:39 UTC
Created attachment 289098 [details]
gprof report
Comment 15 Maciej (Matthew) Piechotka 2014-10-22 03:32:54 UTC
(In reply to comment #14)
> Created an attachment (id=289098) [details]
> gprof report

Has libgee been compiled with profiling support? I don't see any results for it.
Comment 16 Michael Catanzaro 2014-10-22 15:14:34 UTC
No, it wasn't, good catch, but that doesn't seem to help either. I cleaned my libgee checkout with 'jhbuild cleanone libgee', added CFLAGS='-pg' and LDFLAGS='-pg' in my jhbuildrc, then ran 'jhbuild buildone -naf libgee' to reconfigure and rebuild, but the result was identical....
Comment 17 Maciej (Matthew) Piechotka 2014-10-27 08:38:15 UTC
(In reply to comment #16)
> No, it wasn't, good catch, but that doesn't seem to help either. I cleaned my
> libgee checkout with 'jhbuild cleanone libgee', added CFLAGS='-pg' and
> LDFLAGS='-pg' in my jhbuildrc, then ran 'jhbuild buildone -naf libgee' to
> reconfigure and rebuild, but the result was identical....

Are you sure you're loading correct version (you can check by LD_DEBUG=libs)? Otherwise it would mean I need to solve a bug which I cannot reproduce and causes 100% CPU utilization by code which is not even called ;)

I'll try to reproduce the original problem as qqwing was just packaged for Gentoo but if I fail I'll need to mark it as NEEDINFO.
Comment 18 Maciej (Matthew) Piechotka 2014-10-28 06:46:41 UTC
I failed to reproduced by following steps:

 1. Checkout commit fa4a06 (i.e. the one before '100% CPU' issue)
 2. Compile
 3. Open ./src/gnome-sudoku
 4. Open menu and print -> multiple sudokus
 5. Generate 100 easy sudokus
 6. Save to file

I get short spike of 100% CPU on one core just before printing but I presume it's just the generation itself.
Comment 19 Michael Catanzaro 2014-10-28 22:38:27 UTC
Hm, that should have been sufficient to trigger the bug... simply starting a game with that commit should have triggered the bug. For me, the CPU usage remains elevated until the application is closed.

OK, so today I checked out that commit and couldn't reproduce the problem either. Then I checked out 3.13.92 (the version where the bug was introduced because I added a use of ConcurrentList) and also could not reproduce the issue.  My first thought was "maybe that gee commit fixed things after all, and I somehow tested it incorrectly," but reverting that commit doesn't help either.  Something must have changed elsewhere in the stack.

So... not sure what changed, but the problem seems to be fixed... somehow. Thanks for working with me on it.