Bug 524077 – ASCII characters > ord(127) are not extracted correct from jpeg EXIF ImageDescription and UserComment

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 524077 - ASCII characters > ord(127) are not extracted correct from jpeg EXIF ImageDescription and UserComment


Summary:	ASCII characters > ord(127) are not extracted correct from jpeg EXIF ImageDes...


Status:	RESOLVED INCOMPLETE

Product:	beagle
Classification:	Other
Component:	General
Version:	0.2.18
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Beagle Bugs
QA Contact:	Beagle Bugs

URL:
Whiteboard:

Depends on:	524214
Blocks:

Reported:	2008-03-24 07:58 UTC by Karsten Rasmussen
Modified:	2009-02-05 19:34 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Test image with danish letters (æøåÆØÅ) in description and user comment (3.14 KB, image/jpeg) 2008-03-25 18:31 UTC, Karsten Rasmussen		Details
Description of actual (and wanted) output from beagle-extract-contents of test.jpg (59.67 KB, application/pdf) 2008-03-25 18:35 UTC, Karsten Rasmussen		Details
Use correct encoding for ImageDescription and UserComment and don't break other tags (2.83 KB, patch) 2008-03-26 16:25 UTC, Debajyoti Bera	none	Details \| Review

Description Karsten Rasmussen 2008-03-24 07:58:05 UTC

beagle-extract-content does not extract ASCII characters > ord(127) correct from jpeg files. E.g. danish letters (æøåÆØÅ) are removed from output.

According: http://www.exif.org/Exif2-1.PDF

IFD0.ImageDescription is always ASCII
EXIF.UserComment charset can be ASCII/JIS/Unicode/Undefined 
I assume ASCII codepage should be determined based on ENV settings.

EXIF.UserComment contains information about charset used.

Comment 1 Debajyoti Bera 2008-03-24 21:49:50 UTC

Can you attach a sample image, expected and actual output that I can test against ? Thanks.

Comment 2 Karsten Rasmussen 2008-03-25 18:31:55 UTC

Created attachment 108010 [details]
Test image with danish letters (æøåÆØÅ) in description and user comment

Comment 3 Karsten Rasmussen 2008-03-25 18:35:45 UTC

Created attachment 108011 [details]
Description of actual (and wanted) output from beagle-extract-contents of test.jpg

Comment 4 Debajyoti Bera 2008-03-26 16:25:17 UTC

Created attachment 108070 [details] [review]
Use correct encoding for ImageDescription and UserComment and don't break other tags

Can you test if this patch works correctly ?

Comment 5 Karsten Rasmussen 2008-03-30 11:09:37 UTC

This patch did not work on my installation.

I can se the code uses: System.Text.Encoding.Default
But where does mono get this setting from

To quote my self
>>I assume ASCII codepage should be determined based on ENV settings
Maybe it is not this easy. My env says:
   LANG=en_DK.UTF-8

But does fedora have a setting for ASCII codepage, when ASCII no longer is used as default?

Comment 6 Debajyoti Bera 2008-03-30 12:52:45 UTC

No. In fact you can use any encoding for any filename, its contents or any allowed metadata. There are only a few specific ways to know what encoding was used:
- system default encoding (en_DK.UTF-8) in your case
- where encoding is specified in the metadata spec or in the metadata itself

But there will always be errors e.g. you received a file from someone with data in a different encoding, then it is not possible to find out what encoding it is. For metadata, some metadata could be in a different encoding than the others, so there is really really no way to deal with them all unless you have everything in utf8 or in your system encoding.

You can try 
$ LANG=iso-88591 beagle-extract-content /...
to see if that makes any difference.

Comment 7 Karsten Rasmussen 2008-04-01 05:13:14 UTC

Below make the patch work for me:
   export LANG=en_DK.ISO-88591

PS: Windows XP has a system setting "how should I handle non-unicode programs" where it is posible to assign a ASCII codepage. This works in 99.9 % of the cases if you organisation is domestic. This allow a smooth transition from ASCII to unicode - with no need for converting (tampering) old data files to unicode.

A similar setting would be nice.

(The EXIF.UserComment in the test jpeg file is marked as ASCII - but it is not poisible to assign codepage.)

Comment 8 Debajyoti Bera 2008-04-01 12:45:46 UTC

> PS: Windows XP has a system setting "how should I handle non-unicode programs"
> where it is posible to assign a ASCII codepage. This works in 99.9 % of the
> cases if you organisation is domestic. This allow a smooth transition from
> ASCII to unicode - with no need for converting (tampering) old data files to
> unicode.
> 
> A similar setting would be nice.

I don't know why Linux does not have such things. Maybe it was historically not needed.

We can try to add an environment variable BEAGLE_LANG_ASCII_CODEPAGE which can be used to specify default codepage for ASCII (ANSI, if unspecified). But such things will always break something else. E.g. all the apps out there will not be able to show the right information even though we extract it correctly. And then there will always be files with a different encoding that will be completely misread if the default encoding is used.

I want to post this question on the mailing list and see what other suggestion people have. Hope you don't mind.

Comment 9 Debajyoti Bera 2008-07-22 15:18:52 UTC

I forgot to update this bug. The last several releases (since 0.3.6 I believe) have updated f-spot image importers which handle non-utf8 encodings better than the previous approach. Unspecified encodings are still set to the system default encoding but usercomments and image-descriptions with different encoding are now correctly handled.

That should fix your original problem. Can you check and report ?

Comment 10 Tobias Mueller 2009-02-05 19:34:43 UTC

Closing this bug report as no further information has been provided. Please feel free to reopen this bug if you can provide the information asked for.
Thanks!