Bug 415061 – regression test results should be repeatable

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 415061 - regression test results should be repeatable


Summary:	regression test results should be repeatable


Status:	RESOLVED INCOMPLETE

Product:	orca
Classification:	Applications
Component:	general
Version:	0.9.x
Hardware:	Other Linux

Importance:	High blocker
Target Milestone:	2.22.0
Assigned To:	Willie Walker
QA Contact:	Orca Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-03-05 20:33 UTC by Lynn Monsanto
Modified:	2008-07-22 19:32 UTC

See Also:
GNOME target:	---
GNOME version:	2.15/2.16

Attachments
snap shot of test harness changes (not to be checked in) (2.20 KB, patch) 2007-03-05 20:47 UTC, Lynn Monsanto	none	Details \| Review
columnar diff(1) output from two executions of runall.sh (6.39 KB, text/plain) 2007-03-07 01:52 UTC, Lynn Monsanto		Details
side-by-side diff(1) output from two executions of runall.sh (4.38 KB, text/plain) 2007-03-07 01:54 UTC, Lynn Monsanto		Details
made delay between arrow navigation keystrokes configurable (2.92 KB, patch) 2007-03-07 01:57 UTC, Lynn Monsanto	needs-work	Details \| Review
Fixed a merge problem that caused .orca files to not be generated. (3.49 KB, patch) 2007-03-13 12:30 UTC, Lynn Monsanto	needs-work	Details \| Review
See comments (7.41 KB, patch) 2007-03-28 03:20 UTC, Lynn Monsanto	none	Details \| Review
See comments (3.25 KB, patch) 2007-03-28 03:21 UTC, Lynn Monsanto	none	Details \| Review
Patch to update use of logging module (5.39 KB, patch) 2007-07-09 21:46 UTC, Willie Walker	committed	Details \| Review
Patch to potentially integrate logging into the test harness (862 bytes, patch) 2007-07-09 21:55 UTC, Willie Walker	none	Details \| Review

Description Lynn Monsanto 2007-03-05 20:33:00 UTC

The problem is that multiple runs of test/harness/runall.sh yield different results when there are no changes to orca.  This makes it impossible to have automated regression testing. The goal is to for runall.sh to yield identical results when nothing changes in orca or the test environment (other than the date and time.)

Comment 1 Lynn Monsanto 2007-03-05 20:47:16 UTC

Created attachment 84002 [details] [review]
snap shot of test harness changes (not to be checked in)

Record of current work. Not to be checked in.

Comment 2 Lynn Monsanto 2007-03-07 01:52:59 UTC

Created attachment 84131 [details]
columnar diff(1) output from two executions of runall.sh

Comment 3 Lynn Monsanto 2007-03-07 01:54:54 UTC

Created attachment 84132 [details]
side-by-side diff(1) output from two executions of runall.sh

This is the output from the same two runall.sh execution as the previous attachment. It's just presented in a different format.

Comment 4 Lynn Monsanto 2007-03-07 01:57:15 UTC

Created attachment 84133 [details] [review]
made delay between arrow navigation keystrokes configurable

Comment 5 Lynn Monsanto 2007-03-13 12:30:46 UTC

Created attachment 84500 [details] [review]
Fixed a merge problem that caused .orca files to not be generated.

Comment 6 Lynn Monsanto 2007-03-28 03:20:25 UTC

Created attachment 85422 [details] [review]
See comments

2007-03-26  Lynn Monsanto <lynn.monsanto@sun.com>

        * test/harness/runone.sh, test/harness/runall.sh,
        src/tools/play_keystrokes.py: bug #405061 - Adjusted
        keystroke playback timing for navigation keys. There
        are still minor diffs between non-OpenOffice runs
        of runall.sh. This is probably okay for regression
        testing, but it does require someone to manually
        check the diffs every morning to verify there are
        no significant changes between runs.

        Note that there are still significant diffs between
        OpenOffice Writer and Calc runs. These are due
        to real bugs that need to be fixed.

        To Do: I need to modify runone.sh so that the user
        doen't need to specify whether code-coverage testing
        is done (1) or not done (0). Right not, you need to
        add a 0 or 1 at the end of the runone.sh command
        arguments. The runall.sh script always specifies a
        0 or 1.

Comment 7 Lynn Monsanto 2007-03-28 03:21:32 UTC

Created attachment 85423 [details] [review]
See comments

2007-03-27  Lynn Monsanto <lynn.monsanto@sun.com>

        * src/tools/play_keystrokes.py: bug #405061 - Modified
        keystroke playback to pause before each key press. This
        takes keystroke modifier keys and chords into account.

Comment 8 Lynn Monsanto 2007-05-25 00:07:03 UTC

I'm going to make another push to (hopefully) reduce the number of differences between regression test runs to zero.

Comment 9 Lynn Monsanto 2007-07-09 17:51:24 UTC

Will,

1. I still cannot get zero differences between runs. I've tried every thing I can think of and have not been able to succeed in removing all differences. I suggest we reevaluate whether removing all differences is a realistic goal, and whether meaningful regression testing requires that there be zero differences between runs. 

2. There are common difference that can be ignored. The best thing may be to run a sed script that filters out these differences:

=====

Common difference #1 which is caused by the test application exiting before Orca  quits. The warning messages occur because Orca is attempting to process and event for an AT-SPI object that no longer exists.

Traceback (most recent call last):

+ Trace 146788

File "/usr/lib/python2.5/site-packages/orca/atspi.py", line 688 in __init__
```
self.accessible = acc._narrow(Accessibility.Accessible) COMM_FAILURE
```
File "/usr/lib/python2.5/site-packages/orca/atspi.py", line 688 in __init__
```
self.accessible = acc._narrow(Accessibility.Accessible) COMM_FAILURE
```
File "/usr/lib/python2.5/site-packages/orca/atspi.py", line 688 in __init__
```
self.accessible = acc._narrow(Accessibility.Accessible) COMM_FAILURE
```
File "/usr/lib/python2.5/site-packages/orca/gnomespeechfactory.py", line 232 in getSpeechServer
```
server = SpeechServer.__createServer(s.iid)
```
File "/usr/lib/python2.5/site-packages/orca/gnomespeechfactory.py", line 152 in __createServer
```
driver = SpeechServer.__activateDriver(iid)
```
File "/usr/lib/python2.5/site-packages/orca/gnomespeechfactory.py", line 133 in __activateDriver
```
isInitialized = driver.isInitialized() COMM_FAILURE
```


=====

3. Every comparison between two runall.sh runs shows minor, non-repeatable differences. I've concluded that it will always be necessary to visually inspect the differences between runs to determine whether the differences are spurious, or whether they indicate a real regression.

Comment 10 Willie Walker 2007-07-09 21:32:56 UTC

> 1. I still cannot get zero differences between runs. I've tried every thing I
> can think of and have not been able to succeed in removing all differences. I
> suggest we reevaluate whether removing all differences is a realistic goal, and
> whether meaningful regression testing requires that there be zero differences
> between runs. 

The goal is to get the differences to zero, but the actual bug is to make the tests repeatable.  I think we're getting pretty close to the goal, though.  Thanks for getting us here.

> Traceback (most recent call last):
>   File "/usr/lib/python2.5/site-packages/orca/atspi.py", line 688, in __init__
>     self.accessible = acc._narrow(Accessibility.Accessible)
> COMM_FAILURE
> 
> atspi.py:Accessible.__init__ NOT GIVEN AN ACCESSIBLE!

These are probably reasonable to ignore.  We handle them at the WARNING level, which means Orca recovers well and moves on.   Alternatively, we can avoid using the debug support and start using the logging support -- the logging support just logs what Orca is saying and brailling.  I'll post something on that in a reply to this comment.

> Common difference #2 sometimes occurs when a test application (e.g., gedit)
> terminates normally (e.g., using ctrl-q) rather than by being killed by a '-9'
> signal from runone.sh. When this output occurs, it is right after the test
> application quits.
> 
> BRAILLE LINE:  'orca Application Orca Screen Reader / Magnifier Frame'
>      VISIBLE:  'Orca Screen Reader / Magnifier F', cursor=1
> SPEECH OUTPUT: 'Orca Screen Reader / Magnifier frame'
> BRAILLE LINE:  'orca Application Orca Screen Reader / Magnifier Frame
> Preferences Button'
>      VISIBLE:  'Preferences Button', cursor=1
> SPEECH OUTPUT: ''
> SPEECH OUTPUT: 'Preferences button'

This means the Orca preferences GUI got focus and Orca output information about it.  Makes sense, and we can probably filter than out as well.

> Common difference #3 sometimes occurs for the SayAll tests. The traceback is
> the first entry in the '.orca' file for the test run. The SayAll tests are
> unique in that the 'orca.settings.speechServerFactory = None' statement is
> commented out of the keystroke 'settings' file (e.g.,
> keystrokes/gedit/say-all.settings').

We should definitely fix bug 444416.  That would let you keep speechServerFactory = None and get rid of these errors.  The stuff I post above might be able to fix this.

> 3. Every comparison between two runall.sh runs shows minor, non-repeatable
> differences. I've concluded that it will always be necessary to visually
> inspect the differences between runs to determine whether the differences are
> spurious, or whether they indicate a real regression.

Do you have an example of those differences?

Comment 11 Willie Walker 2007-07-09 21:46:54 UTC

Created attachment 91518 [details] [review]
Patch to update use of logging module

This patch updates the use of the logging module.  To use it, you can add something like the following to your ~/.orca/user-settings.py or ~/.orca/orca-customizations.py file:

import logging
handler = logging.FileHandler("log.out")
formatter = logging.Formatter('%(name)s.%(message)s')
handler.setFormatter(formatter)
for logger in ["braille", "speech"]:
    log = logging.getLogger(logger)
    log.addHandler(handler)
    log.setLevel(logging.INFO)

When in use, all speech and braille activity will be logged to log.out (for the above example, anyway).  This will happen simultaneously with debug stuff, but only speech and braille activity will get logged, and it should get logged regardless of the debug level.

Comment 12 Willie Walker 2007-07-09 21:55:14 UTC

Created attachment 91520 [details] [review]
Patch to potentially integrate logging into the test harness

This is a potential patch to integrate the logging stuff into the test harness.  It merely changes the output file from the debug stuff to the logging stuff, allow us to avoid spurious debug stuff from making it into the test results.  I haven't run it yet, though.  Now, whether we want to avoid spurious debug stuff is another question to be answered...

Comment 13 Willie Walker 2007-08-21 17:31:35 UTC

The regression test harness has been reworked to use macaroon and Python-based tests.  Orca has also been modified to include a synchronous mode to help eliminate non-deterministic timing of event handling.  The results look very promising and the new test harness provides reliable repeatability with approximately 37 tests.  This work has been checked into the trunk and will not be part of GNOME 2.20.

Comment 14 Willie Walker 2007-10-01 15:45:27 UTC

I'm closing this bug out as INCOMPLETE.  Much work was done in Orca and the harness to enable repeatability.  But, the individual tests themselves also require attention to repeatability.  Thus, this bug is open ended and never ending.

Comment 15 Willie Walker 2007-10-01 15:45:49 UTC

I'm closing this bug out as INCOMPLETE.  Much work was done in Orca and the harness to enable repeatability.  But, the individual tests themselves also require attention to repeatability.  Thus, this bug is open ended and never ending.