GNOME Bugzilla – Bug 321216
gnome-speech driver for festival does not handle UTF-8 text
Last modified: 2006-07-01 16:25:26 UTC
Please describe the problem: The festival synthesis driver in gnome-speech-0.3.8 explicitly sets the channel encoding to ISO-8859-1 This is causing a problem when using the driver with UTF-8 encoded text. I am trying to use the driver for Indian language text which is UTF-8 encoded and am facing this problem. Steps to reproduce: I am using my own build of festival TTS which can speak out Telugu language text represented in UTF-8. When I tried to use this festival with gnopernicus screen reader and opened an application with the locale set to Telugu, gnopernicus could not read out the tool tips etc. This turned out to be because of the festival synthesis driver which was not passing the text messages correctly to the festival speech synthesis server. Actual results: Telugu UTF-8 text is clipped off. Expected results: Text in encodings other than ISO-8859-1 is not passed to the festival server correctly. Does this happen every time? Yes Other information:
Created attachment 54628 [details] [review] Patch created against CVS that fixes the bug The patch sets the channel encoding to UTF-8
Fixed in the development version. The fix will be available in the next major release. Thank you for your bug report.
This patch seems to have problems - the channel was being set explicitly to ISO-8859-1 because most Festival non-english voices seem to use that encoding. The new patch regresses support for those voices.
Now, what about Telugu (festival-te.sf.net) and other Indian languages? UTF-8 should be treated neutral and not ISO8859-1. Non-english Festival voices should be treated as broken to that effect. If fixing those voices is not an easy task, then perhaps exceptions for broken non-english voices based on the selected voice should be added as hacks to gnome-speech. Will a patch be accepted on these lines?
I don't agree that UTF-8 should be treated as "neutral". The TTS engine we are using, festival, was written before UTF-8 was commonplace, and its expectations and voices must determine what we do here. ISO-8859-1 is the most common encoding for festival voices. You may think this is old-fashioned, but it is not something we control, and thus ISO-8859-1 makes a reasonable default. We do of course need to provide some ability to use the appropriate encoding for a given voice. I don't think that festival allows us to determine that, so we will have to provide some configuration table. Bill
Historical related discussion to this bug has taken place in bug 141516, which was really two separate bugs. But, it does cover encoding issues as well as autodection of voices. So...I'm going to mark this bug as a duplicate of 141516, reopen 141516, and move the discussion there. *** This bug has been marked as a duplicate of 141516 ***