GNOME Bugzilla – Bug 415061
regression test results should be repeatable
Last modified: 2008-07-22 19:32:18 UTC
The problem is that multiple runs of test/harness/runall.sh yield different results when there are no changes to orca. This makes it impossible to have automated regression testing. The goal is to for runall.sh to yield identical results when nothing changes in orca or the test environment (other than the date and time.)
Created attachment 84002 [details] [review] snap shot of test harness changes (not to be checked in) Record of current work. Not to be checked in.
Created attachment 84131 [details] columnar diff(1) output from two executions of runall.sh
Created attachment 84132 [details] side-by-side diff(1) output from two executions of runall.sh This is the output from the same two runall.sh execution as the previous attachment. It's just presented in a different format.
Created attachment 84133 [details] [review] made delay between arrow navigation keystrokes configurable
Created attachment 84500 [details] [review] Fixed a merge problem that caused .orca files to not be generated.
Created attachment 85422 [details] [review] See comments 2007-03-26 Lynn Monsanto <lynn.monsanto@sun.com> * test/harness/runone.sh, test/harness/runall.sh, src/tools/play_keystrokes.py: bug #405061 - Adjusted keystroke playback timing for navigation keys. There are still minor diffs between non-OpenOffice runs of runall.sh. This is probably okay for regression testing, but it does require someone to manually check the diffs every morning to verify there are no significant changes between runs. Note that there are still significant diffs between OpenOffice Writer and Calc runs. These are due to real bugs that need to be fixed. To Do: I need to modify runone.sh so that the user doen't need to specify whether code-coverage testing is done (1) or not done (0). Right not, you need to add a 0 or 1 at the end of the runone.sh command arguments. The runall.sh script always specifies a 0 or 1.
Created attachment 85423 [details] [review] See comments 2007-03-27 Lynn Monsanto <lynn.monsanto@sun.com> * src/tools/play_keystrokes.py: bug #405061 - Modified keystroke playback to pause before each key press. This takes keystroke modifier keys and chords into account.
I'm going to make another push to (hopefully) reduce the number of differences between regression test runs to zero.
Will, 1. I still cannot get zero differences between runs. I've tried every thing I can think of and have not been able to succeed in removing all differences. I suggest we reevaluate whether removing all differences is a realistic goal, and whether meaningful regression testing requires that there be zero differences between runs. 2. There are common difference that can be ignored. The best thing may be to run a sed script that filters out these differences: ===== Common difference #1 which is caused by the test application exiting before Orca quits. The warning messages occur because Orca is attempting to process and event for an AT-SPI object that no longer exists. Traceback (most recent call last):
+ Trace 146788
self.accessible = acc._narrow(Accessibility.Accessible) COMM_FAILURE
server = SpeechServer.__createServer(s.iid)
driver = SpeechServer.__activateDriver(iid)
isInitialized = driver.isInitialized() COMM_FAILURE
===== 3. Every comparison between two runall.sh runs shows minor, non-repeatable differences. I've concluded that it will always be necessary to visually inspect the differences between runs to determine whether the differences are spurious, or whether they indicate a real regression.
> 1. I still cannot get zero differences between runs. I've tried every thing I > can think of and have not been able to succeed in removing all differences. I > suggest we reevaluate whether removing all differences is a realistic goal, and > whether meaningful regression testing requires that there be zero differences > between runs. The goal is to get the differences to zero, but the actual bug is to make the tests repeatable. I think we're getting pretty close to the goal, though. Thanks for getting us here. > Traceback (most recent call last): > File "/usr/lib/python2.5/site-packages/orca/atspi.py", line 688, in __init__ > self.accessible = acc._narrow(Accessibility.Accessible) > COMM_FAILURE > > atspi.py:Accessible.__init__ NOT GIVEN AN ACCESSIBLE! These are probably reasonable to ignore. We handle them at the WARNING level, which means Orca recovers well and moves on. Alternatively, we can avoid using the debug support and start using the logging support -- the logging support just logs what Orca is saying and brailling. I'll post something on that in a reply to this comment. > Common difference #2 sometimes occurs when a test application (e.g., gedit) > terminates normally (e.g., using ctrl-q) rather than by being killed by a '-9' > signal from runone.sh. When this output occurs, it is right after the test > application quits. > > BRAILLE LINE: 'orca Application Orca Screen Reader / Magnifier Frame' > VISIBLE: 'Orca Screen Reader / Magnifier F', cursor=1 > SPEECH OUTPUT: 'Orca Screen Reader / Magnifier frame' > BRAILLE LINE: 'orca Application Orca Screen Reader / Magnifier Frame > Preferences Button' > VISIBLE: 'Preferences Button', cursor=1 > SPEECH OUTPUT: '' > SPEECH OUTPUT: 'Preferences button' This means the Orca preferences GUI got focus and Orca output information about it. Makes sense, and we can probably filter than out as well. > Common difference #3 sometimes occurs for the SayAll tests. The traceback is > the first entry in the '.orca' file for the test run. The SayAll tests are > unique in that the 'orca.settings.speechServerFactory = None' statement is > commented out of the keystroke 'settings' file (e.g., > keystrokes/gedit/say-all.settings'). We should definitely fix bug 444416. That would let you keep speechServerFactory = None and get rid of these errors. The stuff I post above might be able to fix this. > 3. Every comparison between two runall.sh runs shows minor, non-repeatable > differences. I've concluded that it will always be necessary to visually > inspect the differences between runs to determine whether the differences are > spurious, or whether they indicate a real regression. Do you have an example of those differences?
Created attachment 91518 [details] [review] Patch to update use of logging module This patch updates the use of the logging module. To use it, you can add something like the following to your ~/.orca/user-settings.py or ~/.orca/orca-customizations.py file: import logging handler = logging.FileHandler("log.out") formatter = logging.Formatter('%(name)s.%(message)s') handler.setFormatter(formatter) for logger in ["braille", "speech"]: log = logging.getLogger(logger) log.addHandler(handler) log.setLevel(logging.INFO) When in use, all speech and braille activity will be logged to log.out (for the above example, anyway). This will happen simultaneously with debug stuff, but only speech and braille activity will get logged, and it should get logged regardless of the debug level.
Created attachment 91520 [details] [review] Patch to potentially integrate logging into the test harness This is a potential patch to integrate the logging stuff into the test harness. It merely changes the output file from the debug stuff to the logging stuff, allow us to avoid spurious debug stuff from making it into the test results. I haven't run it yet, though. Now, whether we want to avoid spurious debug stuff is another question to be answered...
The regression test harness has been reworked to use macaroon and Python-based tests. Orca has also been modified to include a synchronous mode to help eliminate non-deterministic timing of event handling. The results look very promising and the new test harness provides reliable repeatability with approximately 37 tests. This work has been checked into the trunk and will not be part of GNOME 2.20.
I'm closing this bug out as INCOMPLETE. Much work was done in Orca and the harness to enable repeatability. But, the individual tests themselves also require attention to repeatability. Thus, this bug is open ended and never ending.