Bug 660813 – Translation Memory give wrong characters

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 660813 - Translation Memory give wrong characters


Summary:	Translation Memory give wrong characters


Status:	RESOLVED FIXED

Product:	gtranslator
Classification:	Other
Component:	Autotranslation
Version:	1.9.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	2.0
Assigned To:	gtranslator-maint
QA Contact:	gtranslator-maint

URL:
Whiteboard:

Duplicates:	637850 (view as bug list)
Depends on:
Blocks:

Reported:	2011-10-03 18:56 UTC by Kristjan SCHMIDT
Modified:	2011-10-12 14:32 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
program to normalise to composed form (1.09 KB, text/plain) 2011-10-03 19:07 UTC, Allison Karlitskaya (desrt)		Details
screenshot of problem caused by the bug (134.54 KB, image/png) 2011-10-03 19:19 UTC, Tiffany Antopolski		Details
trans memory (1.19 KB, patch) 2011-10-09 18:53 UTC, Ignacio Casal Quinteiro (nacho)	none	Details \| Review
Don't normalize translations stored in the translation memory (3.91 KB, patch) 2011-10-12 10:36 UTC, Neil Roberts	accepted-commit_now	Details \| Review

Description Kristjan SCHMIDT 2011-10-03 18:56:10 UTC

saluton,

i use gtranslator (version 1.9.13) for translation to esperanto.

the problem is that the Translation Memory change the special esperanto-characters to others. the other characters looks like the right but they aren't.

so here are the right characters:
ĉĝĥĵŝŭ ĈĜĤĴŜŬ

and that's what the Translation Memory give:
ĉĝĥĵŝŭ ĈĜĤĴŜŬ

Comment 1 Allison Karlitskaya (desrt) 2011-10-03 19:07:13 UTC

Created attachment 198141 [details]
program to normalise to composed form

Here's a quick program I wrote that will take a utf8 text file and normalise it to character-composed form.

If I replace the G_NORMALIZE_ALL_COMPOSE with G_NORMALIZE_ALL then it has the opposite effect: all of the special characters are decomposed into letter + combining accent.

Perhaps something like this is going on inside gtranslator...

Comment 2 Allison Karlitskaya (desrt) 2011-10-03 19:10:33 UTC

Indeed, in src/translation-memory/gtr-gda.c we see:

  norm_translation = g_utf8_normalize (translation, -1,
                                       G_NORMALIZE_DEFAULT);

in a couple of places...

Comment 3 Tiffany Antopolski 2011-10-03 19:19:10 UTC

Created attachment 198143 [details]
screenshot of problem caused by the bug

See attached screenshot for how these incorrect letters are rendered.

Comment 4 Ignacio Casal Quinteiro (nacho) 2011-10-03 19:20:20 UTC

Feel free to provide a patch for it ;) If not, I can have a look at it next week.

Comment 5 Ignacio Casal Quinteiro (nacho) 2011-10-09 18:48:40 UTC

*** Bug 637850 has been marked as a duplicate of this bug. ***

Comment 6 Ignacio Casal Quinteiro (nacho) 2011-10-09 18:53:07 UTC

Created attachment 198664 [details] [review]
trans memory

Please test if this fixes your problem. You should check if it is stored correctly and if you can remove entries from the translation memory.

Comment 7 Neil Roberts 2011-10-09 20:06:41 UTC

I wonder if it would be better not to do the normalization at all. Ignacio, could you perhaps explain why it does this? If the translations are coming from already existing .po files, then it doesn't seem right to alter the string at all.

Comment 8 Ignacio Casal Quinteiro (nacho) 2011-10-09 20:19:26 UTC

To be honest that's the few parts of gtranslator that I didn't write myself so I can't really answer that.

Comment 9 Neil Roberts 2011-10-09 22:36:47 UTC

Ok. If we do go with that patch, it should probably be changed to G_NORMALIZE_DEFAULT_COMPOSE instead of G_NORMALIZE_ALL_COMPOSE because the latter also converts ellipses to dots and superscript digits to normal digits etc.

Comment 10 Seán de Búrca 2011-10-10 06:41:28 UTC

I'd have to look at the code more closely to see if we want to normalize at all, but assuming we do, G_NORMALIZE_DEFAULT_COMPOSE is what we want. NFKC and NFKD forms shouldn't be stored in this sort of instance.

Comment 11 Ignacio Casal Quinteiro (nacho) 2011-10-10 07:51:52 UTC

Ok so from the docs and I guess it is the reason it was normalized in the first place: "You should generally call g_utf8_normalize() before comparing two Unicode strings" If anybody feel free like making this change go ahead, although I'd like some testing first.

Comment 12 Neil Roberts 2011-10-10 08:41:09 UTC

Where is it doing string comparisons on the translated value? The only case I can see is where it uses a query matching both the translation and the msgid to remove the translation. However in that case the string is coming from the GtkTreeView which should already have the exact value copied from the database so normalization should not be needed there.

It seems to me that if it turns out we really do need normalization for string comparison then it should store both the original and normalized values, or try to calculate it just before doing the comparison because modifying the string at all seems wrong.

Comment 13 Ignacio Casal Quinteiro (nacho) 2011-10-10 08:44:25 UTC

Ok, you convinced me, maybe can we try then to remove the normalization and check that we do not get any problem?

Comment 14 Neil Roberts 2011-10-10 13:41:50 UTC

For what it's worth, I think the fact that we can see the problem at all is because of this bug in the font Canterell:

https://bugzilla.gnome.org/show_bug.cgi?id=637066

The other fonts seem to display the decomposed characters correctly.

Comment 15 Jiro Matsuzawa 2011-10-10 14:02:54 UTC

Hi all,

I am not familiar with the Gtranslator implementation, so I am not sure how to fix the implementation.
But, I think it is not problematic in itself to compare normalized strings.
A real problem should be that a string which a tm (translation memory) gives is different from a original string for the tm.

Gtranslator should output the same string as the original.
It may be good to take advantage of the normalization function only when comparing.

Thanks,

Comment 16 Ignacio Casal Quinteiro (nacho) 2011-10-11 21:24:29 UTC

(In reply to comment #14)
> For what it's worth, I think the fact that we can see the problem at all is
> because of this bug in the font Canterell:
> 
> https://bugzilla.gnome.org/show_bug.cgi?id=637066
> 
> The other fonts seem to display the decomposed characters correctly.

So is this not actually a gtranslator problem or it is?

Comment 17 Seán de Búrca 2011-10-11 22:51:46 UTC

It's a gtranslator problem which is highlighted by a problem in cantarell.

Comment 18 Neil Roberts 2011-10-12 10:36:54 UTC

Created attachment 198841 [details] [review]
Don't normalize translations stored in the translation memory

Yeah, I think it's still worth fixing in gtranslator if just because
storing the decomposed characters is less efficient :) It looks like
the problem is different and worse for Japenese according to bug
637850 because those two strings look different to me even though the
font isn't Cantarell. Here is a patch if that helps.

Comment 19 Ignacio Casal Quinteiro (nacho) 2011-10-12 12:02:05 UTC

Review of attachment 198841 [details] [review]:

Looks good.

Comment 20 Neil Roberts 2011-10-12 14:32:40 UTC

Ok, thanks. I've pushed it as c1c77f1