GNOME Bugzilla – Bug 552329
Crash on at-spi while calling to desktop_get_child_at_index
Last modified: 2011-06-02 18:25:27 UTC
Please describe the problem: In concrete environments, if you close an application, and immediately you try to get all the active applications (this could be common on a automatic testing script), desktop_get_child_at_index can causes a crash. In order to allow to reproduce it on a really easy way, I will attach a dogtail test script. I need to note that I was only able to reproduce this bug on a N810 device, making some tests for automatic tests. In order to configure it, you require several packages, as for default this aren't included. A general installation description: http://hildon-test-aut.garage.maemo.org/installation.html (Note: anyway, this is not easy configuration) Steps to reproduce: 1. Configure a testing environment for a n810 device. 2. Run the script crash-atspi.py: "$python crash-atspi.py" 3. Start a gtk-gnome application 4. Close the application Actual results: A crash on at-spi Expected results: at-spi running properly Does this happen every time? It crashes on most of the gtk-applications, but I found some exceptions, like Mah-Jong game Other information: I will attach a testing script I will attach a provisional patch
Created attachment 118730 [details] A testing dogtail script This script basically asks at-spi for all the applications on the desktop 0, it is, it calls continuously desktop.getChildAtIndex
Created attachment 118731 [details] [review] Fixes the bug Using this patch I solve the problem in my concrete environment. IMHO this is a reentrancy problem, although I could be wrong, as I haven't got too many experience with this kind of problems. Basically, when you close an aplication, the atexit callback is called. Among other things, it calls remove_application on desktop.c Meanwhile, the script is calling continuously to get_child_at_index. Looking at the behaviour, it seems that both are called, so it tries to get the child while it is removing the child. I solve that using a similar solution that I found on other at-spi code, like registry: a queue. I add a safeguard on get_at_index, so it it is set, remove_application will postpone the removal, and add the application to a list. After get the child, get_child_at checks if any application need to be removed. If the list has any element. It flush it. Last note: I could upload a backtrace too, but it is very uninformative, as it only indicates that the crash is caused because you call getChildAtIndex, but I can upload that if anyone has interest.
I am curious why crash is in get_child_at_index. Could you build at-spi with debug info (with CFLAGS=-g -O0) and paste the trace? Maybe we just add a check here and return NULL is OK.
Gdb backtrace of the crash: (gdb) bt #0 0x41050e74 in raise () from /lib/libc.so.6 #1 0x41052450 in abort () from /lib/libc.so.6 #2 0x40156a20 in IA__g_logv (log_domain=0x0, log_level=1091791928,format=0x400aadf4 "Attempted to marshal a bogus / dead object %p type",args1=0xbedbe17c) at gmessages.c:502 #3 0x40156a60 in IA__g_log (log_domain=0x0, log_level=1767,format=0x400aadf4 "Attempted to marshal a bogus / dead object %p type") at gmessages.c:522 #4 0x4008a6ec in ORBit_marshal_object (buf=0x3bfe0, obj=0x40b08) at corba-object.c:564 #5 0x4008fc48 in ORBit_marshal_value (buf=0x3bfe0, val=0xbedbe1ec, tc=0x40052604) at corba-any.c:152 #6 0x4008fda8 in ORBit_marshal_arg (buf=0x0, val=0xbedbe230, tc=0x400aadf4) at corba-any.c:364 #7 0x4008843c in ORBit_small_invoke_adaptor (adaptor_obj=0x38c30,recv_buffer=0x3f910, m_data=0x40058860, data=0xbedbe2e0, ev=0xbedbe378)at orbit-small.c:907 #8 0x40096e48 in ORBit_POAObject_handle_request (pobj=0x38c30, opname=0x42b3c "getChildAtIndex", ret=0x0, args=0x0, ctx=0x0, recv_buffer=0x3f910, v=0xbedbe378) at poa.c:1354 #9 0x40097394 in ORBit_POAObject_invoke_incoming_request (pobj=0x38c30, recv_buffer=0x3f910, opt_ev=0xbedbe378) at poa.c:1422 #10 0x40097624 in ORBit_POA_handle_request (poa=0x20e70, recv_buffer=0x3f910, objkey=0x38c30) at poa.c:1644 #11 0x4009bd44 in ORBit_handle_request (orb=0x20dd0, recv_buffer=0x3f910) at orbit-adaptor.c:296 #12 0x40084eb0 in giop_connection_handle_input (lcnx=0x0) at giop-recv-buffer.c:1282 #13 0x400a2c9c in link_connection_io_handler (gioc=0x0, condition=G_IO_IN, data=0x400aadf4) at linc-connection.c:1367 #14 0x400a4aa4 in link_source_dispatch (source=0x410a0, callback=0x400a2bb4 <link_connection_io_handler>, user_data=0x32d98) at linc-source.c:159 #15 0x4014defc in IA__g_main_context_dispatch (context=0x1fdb8) at gmain.c:2045 #16 0x4014fda8 in g_main_context_iterate (context=0x1fdb8, block=1, dispatch=1, self=0x40b08) at gmain.c:2677 #17 0x4015016c in IA__g_main_loop_run (loop=0x3bf30) at gmain.c:2881 #18 0x429ee77c in bonobo_main () from /usr/lib/libbonobo-2.so.0 #19 0x0000efb8 in main (argc=1, argv=0x6e7) at registry-main.c:83
As I said on comment 2, I have no problem to past a backtrace, but as far as I see is really uninformative, as it only point where the crash appears, but no further relevant information. And I think that it is really hard to add a check only on the get_child_at_index. My first thought was try to check if the index that receives impl_desktop_get_child_at_index, but the list still maintains the item at that moment, so "in theory" the index is correct. Looking how the interaction between the three programs (at-spi, gtk-program, at-program(the test)), I think that the interaction is as explained at comment 2, but in detail: 1. call to _get_child_at_index (from the at-program) 2. call to _remove_application (from gtk-application, when it closes) 2.1 The item is freed on _remove_application 3. _get_child_at_index try to interact with the deleted item 3.1 Crash Anyway, as I said previously, I have little experience with reentrancy problems, so I could be wrong.
Set breakpoint at spi_desktop_remove_application, to see what is get_child_at_index doing may be useful. Probably we can add a check in impl_desktop_get_child_at_index to avoid return a dead object. Sorry to ask the trace again because I cannot reproduce the bug.
(In reply to comment #6) > Set breakpoint at spi_desktop_remove_application, to see what is > get_child_at_index doing may be useful. Probably we can add a check in > impl_desktop_get_child_at_index to avoid return a dead object. Here you are, new backtraces reproducing again the bug today: break point at spi_desktop_remove_application (gdb) bt
+ Trace 210177
> Sorry to ask the trace again because I cannot reproduce the bug. No problem, the environment is really concrete. I tried to reproduce it by myself on the desktop by I was not able to do that. I just created the bug because this could be a problem everywhere, although right now it only happens on the concrete environment I pointed on the description.
Created attachment 126247 [details] log from valgrind when launch/exit app more quick, at-spi crashed with: do_unref: assertion failed: (robj->refs < ORBIT_REFCOUNT_MAX && robj->refs > 0)
the patch above does not fix this.
I have put a workaround there, a flag to avoid re-enter impl_registry_notify_event
Created attachment 149590 [details] [review] Update previous patch The previous patch has a silly but important error, with the result that if the queue has two or more elements at-spi enters in a infinite loop, so the a11y support became inoperative. It also adds a g_list_free (so avoiding memory leaks).
Just a quick comment to complement the original description. The original description said that this bug was only detected using a N810 device. Update: It is also detected in a Nokia N900.
Thanks for the patch. There could be a problem if an AT only calls get_child_at_index when there are signals like APPLICATION_REMOVED. The AT could never know the application has been removed.
(In reply to comment #13) > Thanks for the patch. There could be a problem if an AT only calls > get_child_at_index when there are signals like APPLICATION_REMOVED. The AT > could never know the application has been removed. Yes you are right, I didn't take that into account, this patch would work if get_child_at_index is called continuously so the queue could be flushed. It works in the environment where I detected the bug because that it is the case. So get_child_at_index is in fact a bad trigger to flush the applications. Other option could be add an idle when the queue is created, and check the is_queueing_remove safeguard here, and call the flush. In this case remove the application would be just postponed to a idle state, instead of being postponed to the next get_child_at_index. If you think that this would be a good option, I could work in a new patch.
We could meet the same crash if we do it in an idle function. The idle function can be interrupted by the continues get_child_at_index calls also. But I am not very sure.
On these days most of the people are not using at all at-spi with n800 or n900, and two years have passed since then I think that it doesn't worth to keep this bug open. I will close as WONTFIX. We can open it in the future if it is detected on the desktop, although sincerely, the desktop should use at-spi2 these days.