GNOME Bugzilla – Bug 520656
The regression test harness should be capable of handling alternative expected results
Last modified: 2008-03-24 16:47:01 UTC
One of the difficulties with automated regression tests is that different platforms and different versions of software within each platform can each result in different (but equally valid/correct) output by Orca. We should try to minimize these differences by creating tests that are not platform and/or environment dependent as well as by maintaining test environments which are as close as possible to a specified configuration. Sadly, these measures still won't eliminate all differences. :-( See for example, http://bugzilla.gnome.org/show_bug.cgi?id=519271#c6 as well as the following comment. If the regression test harness were capable of handling alternative expected results, we might be able to better address these remaining differences.
We could have output along these lines: Test 1 of 7 FAILED: /home/jd/orca/test/keystrokes/gtk-demo/role_icon.py:Layered pane focus EXPECTED: "BUG? - should something be presented here?", ACTUAL: "", [FAILURE WAS EXPECTED - LOOK FOR BUG? IN EXPECTED RESULTS] Test 2 of 7 FAILED: /home/jd/orca/test/keystrokes/gtk-demo/role_icon.py:Layered pane Where Am I EXPECTED: "BUG? - should we present the number of items in the layered pane?", "BRAILLE LINE: 'gtk-demo Application GtkIconView demo Frame ScrollPane LayeredPane'", " VISIBLE: 'LayeredPane', cursor=1", "SPEECH OUTPUT: ''", "SPEECH OUTPUT: 'layered pane'", ACTUAL: "BRAILLE LINE: 'gtk-demo Application GtkIconView demo Frame ScrollPane LayeredPane'", " VISIBLE: 'LayeredPane', cursor=1", "SPEECH OUTPUT: ''", "SPEECH OUTPUT: 'layered pane'", [FAILURE WAS EXPECTED - LOOK FOR BUG? IN EXPECTED RESULTS] Test 4 of 7 FAILED: /home/jd/orca/test/keystrokes/gtk-demo/role_icon.py:bin icon Where Am I EXPECTED: "BRAILLE LINE: 'gtk-demo Application GtkIconView demo Frame ScrollPane LayeredPane bin Icon'", " VISIBLE: 'bin Icon', cursor=1", "SPEECH OUTPUT: 'Icon panel'", "SPEECH OUTPUT: 'foobar'", "SPEECH OUTPUT: '1 of 24 items selected'", "SPEECH OUTPUT: 'on item 1 of 24'", ALTERNATIVELY: "BRAILLE LINE: 'gtk-demo Application GtkIconView demo Frame ScrollPane LayeredPane bin Icon'", " VISIBLE: 'bin Icon', cursor=1", "SPEECH OUTPUT: 'Icon panel'", "SPEECH OUTPUT: 'bin'", "SPEECH OUTPUT: '1 of 23 items selected'", "SPEECH OUTPUT: 'on item 1 of 23'", ACTUAL: "BRAILLE LINE: 'gtk-demo Application GtkIconView demo Frame ScrollPane LayeredPane bin Icon'", " VISIBLE: 'bin Icon', cursor=1", "SPEECH OUTPUT: 'Icon panel'", "SPEECH OUTPUT: 'bin'", "SPEECH OUTPUT: '1 of 24 items selected'", "SPEECH OUTPUT: 'on item 1 of 24'", [FAILURE WAS UNEXPECTED] Test 3 of 7 SUCCEEDED: /home/jd/orca/test/keystrokes/gtk-demo/role_icon.py:bin icon Test 5 of 7 SUCCEEDED: /home/jd/orca/test/keystrokes/gtk-demo/role_icon.py:boot icon Test 6 of 7 SUCCEEDED: /home/jd/orca/test/keystrokes/gtk-demo/role_icon.py:icon selection Test 7 of 7 SUCCEEDED: /home/jd/orca/test/keystrokes/gtk-demo/role_icon.py:icon selection Where Am I SUMMARY: 4 SUCCEEDED and 3 FAILED (1 UNEXPECTED) of 7 for /home/jd/orca/test/keystrokes/gtk-demo/role_icon.py :-)
Created attachment 106657 [details] [review] revision 1 - probably needs work This patch consists of some minor changes to utils.py to support alternative expected results. The change should work with our existing tests because it takes the expected results and turns them into a list of lists (if it's not already a list of lists). As part of the "proof of concept", I modified the role_icon.py test as follows: * test 4 of 7 has two possible alternative results, both of which were designed to fail on my machine. As a result, you get the EXPECTED: "foo" ALTERNATIVELY: "bar" ACTUAL: "oh crap" as seen my previous comment. * test 7 of 7 has two alternatives, the first of which succeeds, hence: Test 7 of 7 SUCCEEDED: /home/jd/orca/test/yadda/yadda/yadda.py
So guys, whatchya think?
I think we need something like this. There will always be differences between Solaris and "Linux", and if we want a reliable set of regression tests (instead of just punting with a "KNOWN ISSUE" solution), then we have to devise an approach that fixes that. My only thought was YA way of doing this using a dictionary where the results are stored with keys that are derived from running "uname". Something like: {'solaris': ['Line 1', 'Line 2', 'Line 3'], 'Linux': ['Line A', 'Line B', 'Line C']} That gives us a tighter match up of what's expected on each platform.
IMO, this is an interesting ides, but it should be used only as a last resort. Ideally, GNOME should behave like GNOME no matter where it runs. If the tests encounter differences that are the result of the underlying platform being exposed, we should try to avoid those issues. In most cases, I think we can. For example, something such as depending upon the contents of "/" remaining constant is just a really bad idea. My apologies for creating it. We need a better/different test, not a crutch. In addition, if there are differences that are just plain unexplainable (e.g., extra spaces in some tests on Solaris), we really should try to dig to the bottom of those issues. They could be symptoms of underlying toolkit/AT-SPI bugs and not something we should hide. Before applying this practice, I think we should try to better evaluate and fix our current issues via other means if we can.
Created attachment 107620 [details] [review] Patch to treat expected results as regular expressions Here's a patch to treat the expected results as a list of regular expressions. Also included is a gtk-demo/debug_commands.py modification to show how it might be used. I still think this should only be used as a last resort. That is, we should only use it for cases where we understand why and accept that there will be differences. It should not be used to hide unexplained problems.
Committed the regular expression patch, with some additional documentation. Closing as FIXED.