GNOME Bugzilla – Bug 660813
Translation Memory give wrong characters
Last modified: 2011-10-12 14:32:40 UTC
saluton, i use gtranslator (version 1.9.13) for translation to esperanto. the problem is that the Translation Memory change the special esperanto-characters to others. the other characters looks like the right but they aren't. so here are the right characters: ĉĝĥĵŝŭ ĈĜĤĴŜŬ and that's what the Translation Memory give: ĉĝĥĵŝŭ ĈĜĤĴŜŬ
Created attachment 198141 [details] program to normalise to composed form Here's a quick program I wrote that will take a utf8 text file and normalise it to character-composed form. If I replace the G_NORMALIZE_ALL_COMPOSE with G_NORMALIZE_ALL then it has the opposite effect: all of the special characters are decomposed into letter + combining accent. Perhaps something like this is going on inside gtranslator...
Indeed, in src/translation-memory/gtr-gda.c we see: norm_translation = g_utf8_normalize (translation, -1, G_NORMALIZE_DEFAULT); in a couple of places...
Created attachment 198143 [details] screenshot of problem caused by the bug See attached screenshot for how these incorrect letters are rendered.
Feel free to provide a patch for it ;) If not, I can have a look at it next week.
*** Bug 637850 has been marked as a duplicate of this bug. ***
Created attachment 198664 [details] [review] trans memory Please test if this fixes your problem. You should check if it is stored correctly and if you can remove entries from the translation memory.
I wonder if it would be better not to do the normalization at all. Ignacio, could you perhaps explain why it does this? If the translations are coming from already existing .po files, then it doesn't seem right to alter the string at all.
To be honest that's the few parts of gtranslator that I didn't write myself so I can't really answer that.
Ok. If we do go with that patch, it should probably be changed to G_NORMALIZE_DEFAULT_COMPOSE instead of G_NORMALIZE_ALL_COMPOSE because the latter also converts ellipses to dots and superscript digits to normal digits etc.
I'd have to look at the code more closely to see if we want to normalize at all, but assuming we do, G_NORMALIZE_DEFAULT_COMPOSE is what we want. NFKC and NFKD forms shouldn't be stored in this sort of instance.
Ok so from the docs and I guess it is the reason it was normalized in the first place: "You should generally call g_utf8_normalize() before comparing two Unicode strings" If anybody feel free like making this change go ahead, although I'd like some testing first.
Where is it doing string comparisons on the translated value? The only case I can see is where it uses a query matching both the translation and the msgid to remove the translation. However in that case the string is coming from the GtkTreeView which should already have the exact value copied from the database so normalization should not be needed there. It seems to me that if it turns out we really do need normalization for string comparison then it should store both the original and normalized values, or try to calculate it just before doing the comparison because modifying the string at all seems wrong.
Ok, you convinced me, maybe can we try then to remove the normalization and check that we do not get any problem?
For what it's worth, I think the fact that we can see the problem at all is because of this bug in the font Canterell: https://bugzilla.gnome.org/show_bug.cgi?id=637066 The other fonts seem to display the decomposed characters correctly.
Hi all, I am not familiar with the Gtranslator implementation, so I am not sure how to fix the implementation. But, I think it is not problematic in itself to compare normalized strings. A real problem should be that a string which a tm (translation memory) gives is different from a original string for the tm. Gtranslator should output the same string as the original. It may be good to take advantage of the normalization function only when comparing. Thanks,
(In reply to comment #14) > For what it's worth, I think the fact that we can see the problem at all is > because of this bug in the font Canterell: > > https://bugzilla.gnome.org/show_bug.cgi?id=637066 > > The other fonts seem to display the decomposed characters correctly. So is this not actually a gtranslator problem or it is?
It's a gtranslator problem which is highlighted by a problem in cantarell.
Created attachment 198841 [details] [review] Don't normalize translations stored in the translation memory Yeah, I think it's still worth fixing in gtranslator if just because storing the decomposed characters is less efficient :) It looks like the problem is different and worse for Japenese according to bug 637850 because those two strings look different to me even though the font isn't Cantarell. Here is a patch if that helps.
Review of attachment 198841 [details] [review]: Looks good.
Ok, thanks. I've pushed it as c1c77f1