Bug 464754 – Pronunciation dictionary checks should be case insensitive.

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 464754 - Pronunciation dictionary checks should be case insensitive.


Summary:	Pronunciation dictionary checks should be case insensitive.


Status:	RESOLVED FIXED

Product:	orca
Classification:	Applications
Component:	general
Version:	2.19.x
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	2.22.0
Assigned To:	Rich Burridge
QA Contact:	Orca Maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-08-08 16:30 UTC by Rich Burridge
Modified:	2008-07-22 19:32 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Patch to hopefully fix this. (2.65 KB, patch) 2007-08-08 19:18 UTC, Rich Burridge	none	Details \| Review
Revised patch. (7.99 KB, patch) 2007-08-13 18:27 UTC, Rich Burridge	none	Details \| Review
Second revised patch. (9.33 KB, patch) 2007-08-13 21:26 UTC, Rich Burridge	none	Details \| Review
Fourth version. (9.28 KB, patch) 2007-08-14 15:01 UTC, Rich Burridge	committed	Details \| Review

Description Rich Burridge 2007-08-08 16:30:42 UTC

We are going to need to make the checks for entries in the pronunciation
diction case insensitive. For example, the "strikethrough" entry
(which gets replaced with "strike through") might occur at the
beginning of a sentence, in which case the first letter in capitalized.
For that case, it's not going to currently match against the "strikethrough"
entry in the pronunciation dictionary.

Comment 1 Rich Burridge 2007-08-08 19:18:50 UTC

Created attachment 93304 [details] [review]
Patch to hopefully fix this.

Not committed yet. Patch is against SVN HEAD, although if it's
what we want, then there is no reason why it couldn't be applied
to the gnome-2-20 as well.

Comment 2 Joanmarie Diggs (IRC: joanie) 2007-08-09 07:33:55 UTC

I'm wondering if, instead of altering the entries written by the user, we want to just be sure that we alter the strings prior to doing the comparison.  Mostly this is minor:  If I typed the actual word in all caps (like I would an acronym) or used an initial cap (like I would a name), I would prefer it remain that way the next time I look at the dictionary.  However, I stumbled across an interesting/more significant case:   

Enter JOSÉ as the actual word, come up with a replacement string, press OK, then quit and restart Orca.  Results:

Actual word becomes: josÉ

In addition, the replacement string is not used for any of the following:

JOSÉ
José
josé
josÉ <-- surprised me

If you enter the actual string as José, the replacement string works for all four of the above variants.

Comment 3 Rich Burridge 2007-08-09 11:17:02 UTC

> I'm wondering if, instead of altering the entries written by the user, we want
> to just be sure that we alter the strings prior to doing the comparison.

I'm not sure I follow you. What do you mean by "just be sure that we alter 
the strings prior to doing the comparison"? The attached patch I believe does
that, but in order to successfully do a comparison with potential pronunciation
dictionary entries, then the keys for that dictionary need to be in lower
case too.

> However, I stumbled across an interesting/more significant case:

Yup. That's a biggie. I've no idea what the best way of solving that problem is.

Comment 4 Willie Walker 2007-08-09 12:24:34 UTC

> I'm wondering if, instead of altering the entries written by the user, we want
> to just be sure that we alter the strings prior to doing the comparison. 
> Mostly this is minor:  If I typed the actual word in all caps (like I would an
> acronym) or used an initial cap (like I would a name), I would prefer it remain
> that way the next time I look at the dictionary. 

This is a good point.  The problem we're facing is that we have the string acting as the key for the dictionary.  I wonder if we should consider something like the following:

...
pronunciation_dict[_("ASAP").lower()]    = _("as soon as possible")
...

and then:

...
        if pronunciations != None:
            return pronunciations[word.lower()]
        else:
            return pronunciation_dict[word.lower()]
...

Comment 5 Rich Burridge 2007-08-09 15:07:27 UTC

Okay, then can be done. Will, how should the second problem in 
comment #3 be fixed?

Comment 6 Rich Burridge 2007-08-09 15:13:07 UTC

Will, if we do you proposal in comment #4, what should be written
out to the ~/.orca/user-settings.py and ~/.orca/app-settings/<APPNAME>.py
file instead of lines like:

pronunciation_dict[_("ASAP")]    = _("as soon as possible")

If it's:

pronunciation_dict[_("ASAP").lower()]    = _("as soon as possible")

then how do we rebuild the pronunciation list in the Orca preferences
dialog correctly with the user's original capitalized Actual words?

Comment 7 Willie Walker 2007-08-09 15:34:23 UTC

(In reply to comment #6)
> Will, if we do you proposal in comment #4, what should be written
> out to the ~/.orca/user-settings.py and ~/.orca/app-settings/<APPNAME>.py
> file instead of lines like:
> 
> pronunciation_dict[_("ASAP")]    = _("as soon as possible")
> 
> If it's:
> 
> pronunciation_dict[_("ASAP").lower()]    = _("as soon as possible")
> 
> then how do we rebuild the pronunciation list in the Orca preferences
> dialog correctly with the user's original capitalized Actual words?

Darn. You're right.  Hmmm...this is getting ugly.  A dirty hack is to loop through the keys.  Something like:

lowerWord = word.lower()
if not pronunciations:
    pronunciations = pronunciation_dict
for key in pronunciations:
    if key.lower() == lowerWord:
        return pronunciations[key]
return word

Another alternative might be to consider reworking the pronunciations API to expose a method such as addPronunciation(word, pronunciation).  Internally, we could then do whatever we want.  For example, we could maintain two parallel dictionaries -- one for the word/pronunciation as spelled by the user and another that uses lower case words for the keys.

Comment 8 Rich Burridge 2007-08-09 15:42:06 UTC

Or maybe just write out lines like:

pronunciation_dict[_("ASAP").lower()]    = [ _("ASAP"), _("as soon as possible") ]

and adjust the list rebuilding code and the getPronunciation()
routine to understand that the values are now a list.

Comment 9 Rich Burridge 2007-08-09 15:43:53 UTC

Will, still need your reply on second problem that Joanie raised in
comment #2 (sorry, not comment #3).

Comment 10 Willie Walker 2007-08-09 16:54:35 UTC

(In reply to comment #8)
> Or maybe just write out lines like:
> 
> pronunciation_dict[_("ASAP").lower()]    = [ _("ASAP"), _("as soon as
> possible") ]
> 
> and adjust the list rebuilding code and the getPronunciation()
> routine to understand that the values are now a list.

That seems like a reasonable alternative.

(In reply to comment #9)
> Will, still need your reply on second problem that Joanie raised in
> comment #2 (sorry, not comment #3).

Well, it looks like the lower() of JOSÉ and José are different:

Python 2.5.1 (r251:54863, May  2 2007, 16:56:35) 
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> "José".lower()
'jos\xc3\xa9'
>>> "JOSÉ".lower()
'jos\xc3\x89'

As for how to handle this, I'm not sure.  Maybe we need to store both the original and lower() forms as keys?

Comment 11 Willie Walker 2007-08-09 17:07:27 UTC

(In reply to comment #10)
> As for how to handle this, I'm not sure.  Maybe we need to store both the
> original and lower() forms as keys?

....and upper() form as well, I guess...

I think the route of adding new API (addPronunciation) might the thing that would provide us with the most flexibility here.  We might also consider making the pronunciation thing a full blown class instead of just being a module.

Comment 12 Rich Burridge 2007-08-09 17:11:00 UTC

> As for how to handle this, I'm not sure.  Maybe we need to store both the
> original and lower() forms as keys?

Ugh! ;-)

There are a swag of Unicode characters like this
(see work that Joanie did in chnames.py). Are you suggesting
that for each character in the potential key for a pronunciation
dictionary entry, we are going to have to check to see if it's 
one of these special characters, and if it is, then we are going to
have to generate two dictionary entries? I think that's what you are
saying. Just want to make sure?

Comment 13 Willie Walker 2007-08-09 17:47:03 UTC

> There are a swag of Unicode characters like this
> (see work that Joanie did in chnames.py). Are you suggesting
> that for each character in the potential key for a pronunciation
> dictionary entry, we are going to have to check to see if it's 
> one of these special characters, and if it is, then we are going to
> have to generate two dictionary entries? I think that's what you are
> saying. Just want to make sure?

I'm really not sure of the right solution here.  It seems like the solution should handle the case where the user enters the following word/replacement: "Ééks"/"yikes".  With this case, we should get "yikes" for any of the following words if indeed "É".lower() should be "é" and "é".upper() should be "É":

ééks
Ééks
éÉks
ÉÉks

From the following unicode descriptions, it seems as though the upper/lower definition of "é" and "É" are well defined and are what we would expect (see the "Upper case" and "Lower case" portions on each page):

http://www.fileformat.info/info/unicode/char/00e9/index.htm
http://www.fileformat.info/info/unicode/char/00c9/index.htm

Given this, it seems as though Python might be broken.  I'm not sure we want to incur a whole lot of overhead in a frequently called portion of the code in an attempt to fix an underlying Python bug.  Instead, I propose we attempt the string.lower() solution as a minimum, and I'd also propose that we attempt to hide that detail from the user-settings.py file.  That is, internally we can go with whatever solution we want, but externally (i.e., in user-settings.py), we save/restore the exact strings that the user typed.  With this, we can do our own brute force matching technique if we really need to.

Comment 14 Joanmarie Diggs (IRC: joanie) 2007-08-09 17:48:43 UTC

If you convert the string to unicode first and then call lower(), I believe you get the expected results.

Comment 15 Joanmarie Diggs (IRC: joanie) 2007-08-09 17:56:18 UTC

> >>> "José".lower()
> 'jos\xc3\xa9'
> >>> "JOSÉ".lower()
> 'jos\xc3\x89'
 
>>> "José".lower()
'jos\xc3\xa9'
>>> "JOSÉ".decode("UTF-8").lower().encode("UTF-8")
'jos\xc3\xa9'

Comment 16 Willie Walker 2007-08-09 18:07:31 UTC

Well dang!  That was sitting right in my window and I swore I didn't see it.  I must be looking at too many different problems at once today.

>>> originalWord = "Ééks"
>>> keyWord = originalWord.decode("UTF-8").lower()
>>> a = {}
>>> a[keyWord] = [originalWord, "yikes"]
>>> for word in ["ééks", "Ééks", "éÉks", "ÉÉks"]:
...     lowerWord = word.decode("UTF-8").lower()
...     print a[lowerWord]
... 
['\xc3\x89\xc3\xa9ks', 'yikes']
['\xc3\x89\xc3\xa9ks', 'yikes']
['\xc3\x89\xc3\xa9ks', 'yikes']
['\xc3\x89\xc3\xa9ks', 'yikes']

Rich, hopefully you have enough information at your fingertips to propose a solution.

Comment 17 Rich Burridge 2007-08-09 18:15:05 UTC

> Rich, hopefully you have enough information at your fingertips to propose a
> solution.

Yup. Thankyou both. I'll probably leave this to next week now. I've got
my head wrapped around OOo sbase bugs at the moment.

Comment 18 Rich Burridge 2007-08-13 18:27:42 UTC

Created attachment 93589 [details] [review]
Revised patch.

Patch against SVN HEAD. 

It seems to fixup the recent problems found by Joanie.

Tested with the following pronunciation added to the global settings:

Actual String       Replacement String
-------------------------------------------
 "josé"             hose eh california

Then with the following three lines in gedit:

Can you tell me the way to San José
Can you tell me the way to San josé
Can you tell me the way to San JOSÉ

This patch also fixes a bug that seems to have gone unnoticed. Namely,
that if you had set an application specific pronunciation, then did a
Insert-Control-Space and tried to change it to something else, the
application specific pronunciation list would have been re-filled with the
global pronunciations instead.

Patch not committed yet.

Please test.

Comment 19 Rich Burridge 2007-08-13 18:31:35 UTC

Oh, I forget to mention. We have a flag day here. The new pronunciation
dictionary entries in ~/.orca/user-settings.py and the application
specific files now look like:

# User customized pronunciation dictionary settings
#
import orca.pronunciation_dict 

orca.pronunciation_dict.pronunciation_dict={}
orca.pronunciation_dict.pronunciation_dict["asap"]=[ "ASAP", "as soon as possible" ]
orca.pronunciation_dict.pronunciation_dict["ghz"]=[ "GHz", "super gigahertz" ]
orca.pronunciation_dict.pronunciation_dict["imap"]=[ "IMAP", "eye map" ]
orca.pronunciation_dict.pronunciation_dict["josé"]=[ "josé", "hose eh california" ]
orca.pronunciation_dict.pronunciation_dict["ldap"]=[ "LDAP", "ell dap" ]
orca.pronunciation_dict.pronunciation_dict["lol"]=[ "LOL", "laughing out loud" ]
orca.pronunciation_dict.pronunciation_dict["mhz"]=[ "MHz", "megahertz" ]
orca.pronunciation_dict.pronunciation_dict["selinux"]=[ "SELinux", "ess ee linux" ]
orca.pronunciation_dict.pronunciation_dict["strikethrough"]=[ "strikethrough", "strike through" ]


and

# User customized application specific pronunciation dictionary settings
#
import orca.pronunciation_dict

def overridePronunciations(script, pronunciations):
    pronunciations["asap"]=[ "ASAP", "as soon as possible" ]
    pronunciations["ghz"]=[ "GHz", "ultra mega super gigahertz" ]
    pronunciations["imap"]=[ "IMAP", "eye map" ]
    pronunciations["ldap"]=[ "LDAP", "ell dap" ]
    pronunciations["lol"]=[ "LOL", "laughing out loud" ]
    pronunciations["mhz"]=[ "MHz", "megahertz" ]
    pronunciations["selinux"]=[ "SELinux", "ess ee linux" ]
    pronunciations["strikethrough"]=[ "strikethrough", "strike through" ]
    return pronunciations

orca.settings.overridePronunciations = overridePronunciations


This means that if you have any of the old-style entries, Orca is going to
chuck a wobblie (um, throw a Traceback). 

Is just informing the users on the Orca mailing list enough here, or should
we be trying to test for such occurances and programming around them? I'd
prefer the former.

Comment 20 Joanmarie Diggs (IRC: joanie) 2007-08-13 19:33:01 UTC

What about informing the users and doing *minimal* programming around them?

I just tried this on a clean machine where I don't have any existing entries other than the old-style ones in user-settings.py.  The resulting wobblie being chucked (I love that) keeps the Orca Preferences dialog from ever appearing.  I assume this will impact any user upgrading to the latest Orca.

It seems that pronunciation_dict.py already wraps its stuff in a try/except so that wouldn't spit up.  So doesn't that just leave orca_gui_prefs.py's _createPronunciationTreeView()?  Perhaps there, we could test to see if we have old-style entries and, if so, set pronDict to pronunciation_dict.pronunciation_dict?  Less work than trying to do a conversion....

Should we go this route, the note to the list would be to inform users that their dictionary's getting blown away -- as opposed to suggest that they need to blow away their user_settings.py (and corresponding app settings files) or manually edit out the old-style entries from each affected file.

Comment 21 Willie Walker 2007-08-13 19:48:02 UTC

I'd like it if we didn't have to inflict our translators with the work of translating the same string twice (e.g., 'asap' and 'ASAP'):

+pronunciation_dict[_("asap")]    = [ _("ASAP"), _("as soon as possible") ]

Going down the route of adding new API (addPronunciation or setPronunciation) might the thing that would provide us with the most flexibility here:

    pronunciation_dict.setPronunciation(_("ASAP"), _("as soon as possible"))

Comment 22 Rich Burridge 2007-08-13 21:26:00 UTC

Created attachment 93604 [details] [review]
Second revised patch.

Further revised patch based on previous comments from Joanie and Will.
- Adds a setPronunciation() routine in pronunciation_dict.py (no new strings).
- Add a try/except clause around the setting on model entries when the
  pronunciation list is created for the Orca preferences, with the except
  clause hopefully handling old-style ~/.orca/user-settings.py.

I didn't see an easy way to use setPronunciation(), for the setting of
pronunciation entries for the application specific files. These look
different. I.e:

# User customized application specific pronunciation dictionary settings
#
import orca.pronunciation_dict

def overridePronunciations(script, pronunciations):
    pronunciations["asap"]=[ "ASAP", "as soon as possible" ]
    pronunciations["ghz"]=[ "GHz", "ultra mega super gigahertz" ]
    pronunciations["imap"]=[ "IMAP", "eye map" ]
    pronunciations["ldap"]=[ "LDAP", "ell dap" ]
    pronunciations["lol"]=[ "LOL", "laughing out loud" ]
    pronunciations["mhz"]=[ "MHz", "megahertz" ]
    pronunciations["selinux"]=[ "SELinux", "ess ee linux" ]
    pronunciations["strikethrough"]=[ "strikethrough", "strike through" ]
    return pronunciations

orca.settings.overridePronunciations = overridePronunciations

Will/Joanie if you have any ideas on how to rewrite this to use
setPronunciation(), please let me know. Otherwise I think we can 
leave it the way it is (compatible with the way that application 
specific key bindings are done).

Thoughts?

Comment 23 Rich Burridge 2007-08-13 22:08:23 UTC

I wrote:
> Will/Joanie if you have any ideas on how to rewrite this to use
> setPronunciation(), please let me know.

Actually, I think I can see how this can be done. I'll work on a 
further revised patch tomorrow morning.

Comment 24 Rich Burridge 2007-08-14 15:01:12 UTC

Created attachment 93649 [details] [review]
Fourth version.

Adjusts _writePronunciation() in app_prefs.py to use
orca.pronunciation_dict.setPronunciation()

Comment 25 Joanmarie Diggs (IRC: joanie) 2007-08-14 22:59:41 UTC

I noticed something that seems to be a (positive) side effect of this change:  Before we were initially populating the app-specific dictionaries with the existing entries; now we're not.  I really like seeing just those things I've added within an app appear in the tree for that app.

So far this patch seems to be working *most of the time*.  Every once in a while, however, it seems that one dictionary is stomping on the other.  But it's rare and I haven't yet worked out why.  My test case has been two entries for my first name:

user-settings.py:
orca.pronunciation_dict.setPronunciation("Joanmarie", "Joan Marie")

app-settings/gedit.py:
orca.pronunciation_dict.setPronunciation("Joanmarie", "JD", pronunciations)

I then created a single line document in Gedit and another in Writer, each containing just my first name.  I'm Alt Tabbing between these, arrowing off, and then back on my name.  I'd say 8 or 9 times out of 10, Orca gets it right; the other time or two it doesn't.

There was also one time when I went back into the Gedit preferences and didn't have my entry listed.  That I cannot reproduce.

Comment 26 Rich Burridge 2007-08-14 23:47:06 UTC

Thanks for testing this.

It's loadAppSettings() in focus_tracking_presenter.py that will load
an application specific pronunciation dictionary (see about line 419).

That method is called from _processObjectEvent() in focus_tracking_presenter.py
at about line 619, if it's seen a "window:activate" event or a "focus:"
event on a FRAME object.

I'd guess that the times that it's not working correctly for you, you aren't
getting either of those trigger events.

Comment 27 Rich Burridge 2007-08-15 00:28:52 UTC

The other possibility is that something else is coming along
(say like that network applet) that's running in the
"background", and that's calling loadAppSettings() for itself
and there goes the gedit settings. Dunno. Just thinking out loud.

A line or two of debug just before the call to loadAppSettings()
would confirm or deny that theory.

Comment 28 Joanmarie Diggs (IRC: joanie) 2007-08-15 01:59:11 UTC

It's still pretty iffy in the reproducibility department, but I think that your first theory (comment 26) is the correct one.  I added some debug lines to spit out orca_state.activeScript after it was set and just before loadAppSettings() was called.  When Orca fails to use the correct dictionary, there's no output from these lines. I didn't see any unexpected lines from other app(let)s.

More noteworthy is that I cannot reproduce this issue to save my life when compiz is disabled.  That's not this patch.... 

I suspect that we're going to start seeing these sorts of issues when Gutsy is released.  (Another issue:  We don't speak the selected item during the Alt+Tab when compiz is enabled).  It would seem that if compiz is enable-able, it now is by default in Gutsy.  Guess it's time to start paying attention to it....

Comment 29 Willie Walker 2007-08-16 00:20:18 UTC

I tested this, modifying both the global user-settings.py and app-specific settings for gnome-terminal.py.  Seems to work like a charm. I'm not sure I beat on it long enough, but I was unable to reproduce the problem Joanie saw.  Then again, I don't use Compiz.  Given that the problem Joanie saw seems like it might be related to Compiz use, and since Compiz seems to have its own bugs, I think it might be safe to commit this patch.  I'll let Joanie poke more, though, if she wants to.

BTW, at some point, I think we need to consider turning settings.py into a class and giving each script instance its own settings class instance.  I'm not sure if that will be better/worse than what we're doing now, but it might be worth a thought experiment at some time for GNOME 2.22.

Comment 30 Rich Burridge 2007-08-16 01:05:49 UTC

I've only committed it to SVN HEAD. As I tried to patch it in to the
gnome-2020 branch, I realized that I'd totally refactored orca_prefs.py 
in SVN HEAD, and this patch isn't going to easily fit into the old code.

Will, if you think this patch is essential for Orca for gnome-2-20,
let me know and I'll try to work out the equivalent for the old code.

Comment 31 Rich Burridge 2007-08-16 16:59:11 UTC

Will and I chatted this morning about this, and I said it should be
possible to rework the patch to fit the gnome-2-20 code and we probably
should do this because Halim found the problem (i.e. a RealWorld(TM) user).

Having now looked into it in more detail, it's going to be quite disruptive
to the existing code in order to wedge this into the old (non-refactored)
way of doing things.

So I take it back. I suggest we don't try to fix this for Orca 2.19.91,
but instead such fix it for the next major release.

Thoughts? ...

Comment 32 Mike Pedersen 2007-08-16 17:07:11 UTC

I think this is OK as most of our users seem to build from trunk any way.  This problem doesn't really impact the core stability of orca.

Comment 33 Willie Walker 2007-08-16 17:17:04 UTC

> So I take it back. I suggest we don't try to fix this for Orca 2.19.91,
> but instead such fix it for the next major release.

Given your analysis, I agree.  Thanks!

Comment 34 Rich Burridge 2007-08-17 15:09:58 UTC

Putting the bug into a "[pending]" state.

Comment 35 Rich Burridge 2007-08-21 20:13:25 UTC

Closing as FIXED.