Bug 552329 – Crash on at-spi while calling to desktop_get_child_at_index

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 552329 - Crash on at-spi while calling to desktop_get_child_at_index


Summary:	Crash on at-spi while calling to desktop_get_child_at_index


Status:	RESOLVED WONTFIX

Product:	at-spi
Classification:	Platform
Component:	registry
Version:	unspecified
Hardware:	Other All

Importance:	Normal critical
Target Milestone:	---
Assigned To:	Li Yuan
QA Contact:	Li Yuan

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2008-09-15 10:09 UTC by Alejandro Piñeiro Iglesias (IRC: infapi00)
Modified:	2011-06-02 18:25 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
A testing dogtail script (266 bytes, text/x-python) 2008-09-15 10:12 UTC, Alejandro Piñeiro Iglesias (IRC: infapi00)		Details
Fixes the bug (1.77 KB, patch) 2008-09-15 10:19 UTC, Alejandro Piñeiro Iglesias (IRC: infapi00)	none	Details \| Review
log from valgrind (225.92 KB, text/plain) 2009-01-12 03:42 UTC, lavi		Details
Update previous patch (2.35 KB, patch) 2009-12-11 15:09 UTC, Alejandro Piñeiro Iglesias (IRC: infapi00)	none	Details \| Review

Description Alejandro Piñeiro Iglesias (IRC: infapi00) 2008-09-15 10:09:09 UTC

Please describe the problem:
In concrete environments, if you close an application, and immediately you try to get all the active applications (this could be common on a automatic testing script), desktop_get_child_at_index can causes a crash.

In order to allow to reproduce it on a really easy way, I will attach a dogtail test script.

I need to note that I was only able to reproduce this bug on a N810 device, making some tests for automatic tests. In order to configure it, you require several packages, as for default this aren't included. A general installation description:
http://hildon-test-aut.garage.maemo.org/installation.html

(Note: anyway, this is not easy configuration)

Steps to reproduce:
1. Configure a testing environment for a n810 device.
2. Run the script crash-atspi.py: "$python crash-atspi.py"
3. Start a gtk-gnome application
4. Close the application


Actual results:
A crash on at-spi

Expected results:
at-spi running properly

Does this happen every time?
It crashes on most of the gtk-applications, but I found some exceptions, like Mah-Jong game

Other information:
I will attach a testing script

I will attach a provisional patch

Comment 1 Alejandro Piñeiro Iglesias (IRC: infapi00) 2008-09-15 10:12:00 UTC

Created attachment 118730 [details]
A testing dogtail script

This script basically asks at-spi for all the applications on the desktop 0, it is, it calls continuously desktop.getChildAtIndex

Comment 2 Alejandro Piñeiro Iglesias (IRC: infapi00) 2008-09-15 10:19:15 UTC

Created attachment 118731 [details] [review]
Fixes the bug

Using this patch I solve the problem in my concrete environment.

IMHO this is a reentrancy problem, although I could be wrong, as I haven't got too many experience with this kind of problems.

Basically, when you close an aplication, the atexit callback is called. Among other things, it calls remove_application on desktop.c

Meanwhile, the script is calling continuously to get_child_at_index. 

Looking at the behaviour, it seems that both are called, so it tries to get the child while it is removing the child.

I solve that using a similar solution that I found on other at-spi code, like registry: a queue. I add a safeguard on get_at_index, so it it is set, remove_application will postpone the removal, and add the application to a list. After get the child, get_child_at checks if any application need to be removed. If the list has any element. It flush it.

Last note: I could upload a backtrace too, but it is very uninformative, as it only indicates that the crash is caused because you call getChildAtIndex, but I can upload that if anyone has interest.

Comment 3 Li Yuan 2008-11-14 01:51:11 UTC

I am curious why crash is in get_child_at_index. Could you build at-spi with debug info (with CFLAGS=-g -O0) and paste the trace? Maybe we just add a check here and return NULL is OK.

Comment 4 Alejandro Piñeiro Iglesias (IRC: infapi00) 2008-11-14 09:27:38 UTC

Gdb backtrace of the crash:

    (gdb) bt
      #0  0x41050e74 in raise () from /lib/libc.so.6
      #1  0x41052450 in abort () from /lib/libc.so.6
      #2  0x40156a20 in IA__g_logv (log_domain=0x0, log_level=1091791928,format=0x400aadf4 "Attempted to marshal a bogus / dead object %p type",args1=0xbedbe17c) at gmessages.c:502
      #3  0x40156a60 in IA__g_log (log_domain=0x0, log_level=1767,format=0x400aadf4 "Attempted to marshal a bogus / dead object %p type") at gmessages.c:522
      #4  0x4008a6ec in ORBit_marshal_object (buf=0x3bfe0, obj=0x40b08) at corba-object.c:564
      #5  0x4008fc48 in ORBit_marshal_value (buf=0x3bfe0, val=0xbedbe1ec, tc=0x40052604) at corba-any.c:152
      #6  0x4008fda8 in ORBit_marshal_arg (buf=0x0, val=0xbedbe230, tc=0x400aadf4) at corba-any.c:364
      #7  0x4008843c in ORBit_small_invoke_adaptor (adaptor_obj=0x38c30,recv_buffer=0x3f910, m_data=0x40058860, data=0xbedbe2e0, ev=0xbedbe378)at orbit-small.c:907
      #8  0x40096e48 in ORBit_POAObject_handle_request (pobj=0x38c30, opname=0x42b3c "getChildAtIndex", ret=0x0, args=0x0, ctx=0x0, recv_buffer=0x3f910, v=0xbedbe378) at poa.c:1354
      #9  0x40097394 in ORBit_POAObject_invoke_incoming_request (pobj=0x38c30, recv_buffer=0x3f910, opt_ev=0xbedbe378) at poa.c:1422
      #10 0x40097624 in ORBit_POA_handle_request (poa=0x20e70, recv_buffer=0x3f910, objkey=0x38c30) at poa.c:1644
      #11 0x4009bd44 in ORBit_handle_request (orb=0x20dd0, recv_buffer=0x3f910) at orbit-adaptor.c:296
      #12 0x40084eb0 in giop_connection_handle_input (lcnx=0x0) at giop-recv-buffer.c:1282
      #13 0x400a2c9c in link_connection_io_handler (gioc=0x0, condition=G_IO_IN, data=0x400aadf4) at linc-connection.c:1367
      #14 0x400a4aa4 in link_source_dispatch (source=0x410a0, callback=0x400a2bb4 <link_connection_io_handler>, user_data=0x32d98) at linc-source.c:159
      #15 0x4014defc in IA__g_main_context_dispatch (context=0x1fdb8) at gmain.c:2045
      #16 0x4014fda8 in g_main_context_iterate (context=0x1fdb8, block=1, dispatch=1, self=0x40b08) at gmain.c:2677
      #17 0x4015016c in IA__g_main_loop_run (loop=0x3bf30) at gmain.c:2881
      #18 0x429ee77c in bonobo_main () from /usr/lib/libbonobo-2.so.0
      #19 0x0000efb8 in main (argc=1, argv=0x6e7) at registry-main.c:83

Comment 5 Alejandro Piñeiro Iglesias (IRC: infapi00) 2008-11-14 09:44:17 UTC

As I said on comment 2, I have no problem to past a backtrace, but as far as I see is really uninformative, as it only point where the crash appears, but no further relevant information.

And I think that it is really hard to add a check only on the get_child_at_index. My first thought was try to check if the index that receives impl_desktop_get_child_at_index, but the list still maintains the item at that moment, so "in theory" the index is correct.

Looking how the interaction between the three programs (at-spi, gtk-program, at-program(the test)), I think that the interaction is as explained at comment 2, but in detail:
  1. call to _get_child_at_index (from the at-program)
  2. call to _remove_application (from gtk-application, when it closes) 
    2.1 The item is freed on _remove_application
  3. _get_child_at_index try to interact with the deleted item
    3.1 Crash

Anyway, as I said previously, I have little experience with reentrancy problems, so I could be wrong.

Comment 6 Li Yuan 2008-11-21 03:52:14 UTC

Set breakpoint at spi_desktop_remove_application, to see what is get_child_at_index doing may be useful. Probably we can add a check in impl_desktop_get_child_at_index to avoid return a dead object.

Sorry to ask the trace again because I cannot reproduce the bug.

Comment 7 Alejandro Piñeiro Iglesias (IRC: infapi00) 2008-11-24 10:39:57 UTC

(In reply to comment #6)
> Set breakpoint at spi_desktop_remove_application, to see what is
> get_child_at_index doing may be useful. Probably we can add a check in
> impl_desktop_get_child_at_index to avoid return a dead object.

Here you are, new backtraces reproducing again the bug today:

break point at spi_desktop_remove_application

(gdb) bt

+ Trace 210177

#0 spi_desktop_remove_application
at desktop.c line 324
#1 _ORBIT_skel_small_Accessibility_Registry_deregisterApplication
at Accessibility-common.c line 960
#2 ORBit_POAObject_invoke
at poa.c line 1145
#3 ORBit_OAObject_invoke
at orbit-adaptor.c line 336
#4 ORBit_small_invoke_adaptor
at orbit-small.c line 835
#5 ORBit_POAObject_handle_request
at poa.c line 1354
#6 ORBit_POAObject_invoke_incoming_request
at poa.c line 1422
#7 ORBit_POA_handle_request
at poa.c line 1644
#8 ORBit_handle_request
at orbit-adaptor.c line 296
#9 giop_connection_handle_input
at giop-recv-buffer.c line 1282
#10 link_connection_io_handler
at linc-connection.c line 1367
#11 link_source_dispatch
at linc-source.c line 159
#12 g_main_context_dispatch
from /usr/lib/libglib-2.0.so.0
#13 ??
from /usr/lib/libglib-2.0.so.0
#0 raise
from /lib/libc.so.6
#1 abort
from /lib/libc.so.6
#2 g_logv
from /usr/lib/libglib-2.0.so.0
#3 g_log
from /usr/lib/libglib-2.0.so.0
#4 ORBit_marshal_object
at corba-object.c line 564
#5 ORBit_marshal_value
at corba-any.c line 152
#6 ORBit_marshal_arg
at corba-any.c line 364
#7 ORBit_small_invoke_adaptor
at orbit-small.c line 907
#8 ORBit_POAObject_handle_request
at poa.c line 1354
#9 ORBit_POAObject_invoke_incoming_request
at poa.c line 1422
#10 ORBit_POA_handle_request
at poa.c line 1644
#11 ORBit_handle_request
at orbit-adaptor.c line 296
#12 giop_connection_handle_input
at giop-recv-buffer.c line 1282
#13 link_connection_io_handler
at linc-connection.c line 1367
#14 link_source_dispatch
at linc-source.c line 159
#15 g_main_context_dispatch
from /usr/lib/libglib-2.0.so.0
#16 ??
from /usr/lib/libglib-2.0.so.0



> Sorry to ask the trace again because I cannot reproduce the bug.

No problem, the environment is really concrete. I tried to reproduce it by myself on the desktop by I was not able to do that. I just created the bug because this could be a problem everywhere, although right now it only happens on the concrete environment I pointed on the description.

Comment 8 lavi 2009-01-12 03:42:58 UTC

Created attachment 126247 [details]
log from valgrind

when launch/exit app more quick, at-spi crashed with:
do_unref: assertion failed: (robj->refs < ORBIT_REFCOUNT_MAX && robj->refs > 0)

Comment 9 lavi 2009-01-12 03:43:57 UTC

the patch above does not fix  this.

Comment 10 lavi 2009-01-12 09:15:54 UTC

I have put a workaround there, a flag to avoid re-enter impl_registry_notify_event

Comment 11 Alejandro Piñeiro Iglesias (IRC: infapi00) 2009-12-11 15:09:00 UTC

Created attachment 149590 [details] [review]
Update previous patch

The previous patch has a silly but important error, with the result that if the queue has two or more elements at-spi enters in a infinite loop, so the a11y support became inoperative.

It also adds a g_list_free (so avoiding memory leaks).

Comment 12 Alejandro Piñeiro Iglesias (IRC: infapi00) 2009-12-11 15:13:10 UTC

Just a quick comment to complement the original description. The original description said that this bug was only detected using a N810 device.

Update: It is also detected in a Nokia N900.

Comment 13 Li Yuan 2009-12-14 02:13:08 UTC

Thanks for the patch. There could be a problem if an AT only calls get_child_at_index when there are signals like APPLICATION_REMOVED. The AT could never know the application has been removed.

Comment 14 Alejandro Piñeiro Iglesias (IRC: infapi00) 2009-12-14 11:45:54 UTC

(In reply to comment #13)
> Thanks for the patch. There could be a problem if an AT only calls
> get_child_at_index when there are signals like APPLICATION_REMOVED. The AT
> could never know the application has been removed.

Yes you are right, I didn't take that into account, this patch would work if get_child_at_index is called continuously so the queue could be flushed. It works in the environment where I detected the bug because that it is the case.

So get_child_at_index is in fact a bad trigger to flush the applications.

Other option could be add an idle when the queue is created, and check the is_queueing_remove safeguard here, and call the flush.

In this case remove the application would be just postponed to a idle state, instead of being postponed to the next get_child_at_index.

If you think that this would be a good option, I could work in a new patch.

Comment 15 Li Yuan 2009-12-15 03:47:22 UTC

We could meet the same crash if we do it in an idle function. The idle function can be interrupted by the continues get_child_at_index calls also. But I am not very sure.

Comment 16 Alejandro Piñeiro Iglesias (IRC: infapi00) 2011-06-02 18:25:27 UTC

On these days most of the people are not using at all at-spi with n800 or n900, and two years have passed since then

I think that it doesn't worth to keep this bug open. I will close as WONTFIX. We can open it in the future if it is detected on the desktop, although sincerely, the desktop should use at-spi2 these days.