Bug 669448 – Use universal encoding auto-detection (e.g. universalchardet by mozilla)

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 669448 - Use universal encoding auto-detection (e.g. universalchardet by mozilla)


Summary:	Use universal encoding auto-detection (e.g. universalchardet by mozilla)


Status:	RESOLVED FIXED

Product:	gtksourceview
Classification:	Platform
Component:	File loading and saving
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal enhancement
Target Milestone:	---
Assigned To:	GTK Sourceview maintainers
QA Contact:	GTK Sourceview maintainers

URL:	https://github.com/BYVoid/uchardet/
Whiteboard:

Depends on:
Blocks:

Reported:	2012-02-06 07:58 UTC by Ma Hsiao-chun
Modified:	2016-08-28 19:09 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Add uchardet dependency. (4.22 KB, patch) 2015-09-03 19:43 UTC, Jehan	none	Details \| Review
New patch. (5.51 KB, patch) 2015-09-05 14:13 UTC, Jehan	none	Details \| Review
Side patch: fix fallback encoding detection. (1.38 KB, patch) 2015-09-05 14:40 UTC, Jehan	none	Details \| Review
Version 3! (5.40 KB, patch) 2015-09-05 16:06 UTC, Jehan	reviewed	Details \| Review
guess_encoding_fallback() return the last encoding even if it fails. (1.31 KB, patch) 2015-11-17 19:34 UTC, Jehan	needs-work	Details \| Review
File containing invalid byte \C0 (7 bytes, application/octet-stream) 2015-12-05 18:52 UTC, Sébastien Wilmet		Details

Description Ma Hsiao-chun 2012-02-06 07:58:05 UTC

The current mechanism seems to be trying the predefined encodings one by one.

gsettings get org.gnome.gedit.preferences.encodings auto-detected
['UTF-8', 'CURRENT', 'ISO-8859-15', 'UTF-16']

Let's consider Linux Users in China as an example. The problem applies to Hong Kong and Taiwan as well.

The Linux Users in China are likely to use zh_CN.UTF-8 locale.
However, Simplified Chinese version of Windows, the Windows version for China, use GBK or GB18030.

So, if a Windows user distribute a plain text file contains Chinese characters, a Linux User using gedit cannot get correct result by default.

Test cases for the above statement:
https://github.com/maxiaojun/gedit3-encoding-helper/blob/master/test_gbk.txt
https://bugzilla.redhat.com/attachment.cgi?id=401937

Manual encoding conversion by other tools is not a desirable solution, right?

Some people manually add GBK or GB18030 between 'CURRENT' and 'ISO-8859-15', this works. However, I wonder whether we'd do this by default and provide a visible disable option, because zh_CN already gives us a clue. zh_CN implies GBK or GB18030 is probable. zh_HK implies Big5hkscs is probable. zh_TW implies big5 is probable. Other languages may use the same principle as well. I believe that China users meet more files in their own language rather than in non-English western European languages. So when gedit choose 'ISO-8859-15', it is unlikely it is a success. This approach may be short-term fix.

As some people tell me, the above solution is not elegant enough. That's it. But we can do better. Mozilla people already devised a universal encoding detection mechanism 10 years ago.
http://mxr.mozilla.org/mozilla/source/extensions/universalchardet/
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html (dead)
http://code.google.com/p/universal-encoding-detector/
http://code.google.com/p/ude/
...

If we can employ such mechanism, user are unlikely to see broken documents. Indded, not every one has a clear understanding of encoding.

Comment 1 Ignacio Casal Quinteiro (nacho) 2012-02-06 08:06:46 UTC

gedit provides this list of encodings to the translators. If you don't get the right list of encodings running gedit in Chinese then there is a high probability that there is a bug in the translation. Please let us know if you run gedit in Chinese or if you run it in another language.

Comment 2 Ma Hsiao-chun 2012-02-06 08:18:12 UTC

My Fedora 16 (yum updated the other day) use locale en_US.utf8

https://github.com/maxiaojun/gedit3-encoding-helper/blob/master/test_gbk.txt
denoted f1
https://bugzilla.redhat.com/attachment.cgi?id=401937
denoted f2

#PASS
LANG=C gedit --encoding gbk file.txt 

#I expect PASS here because 'CURRENT' is correct encoding already.
#The result is, however, f1 PASS while f2 FAIL.
LANG=zh_CN.gbk gedit file.txt
LANG=zh_CN.gb18030 gedit file.txt

#FAIL
#I'd like to fix this case.
#I'd point out that zh_CN.utf8 is the default locale for zh_CN for all the distributions I know.
LANG=zh_CN.utf8 gedit file.txt

In the second and third invocation, the gedit is already in Chinese UI.

Comment 3 Ignacio Casal Quinteiro (nacho) 2012-02-06 08:22:26 UTC

Is gb18030 a supported encoding by iconv?

Comment 4 Ma Hsiao-chun 2012-02-06 08:28:39 UTC

Yes.
http://www.gnu.org/software/libiconv/

Comment 5 Ignacio Casal Quinteiro (nacho) 2012-02-06 08:31:11 UTC

Would be great if you could have a look at our unit test http://git.gnome.org/browse/gedit/tree/tests/document-output-stream.c and provide one unit test that fails with this case

Comment 6 Ma Hsiao-chun 2012-02-06 10:16:50 UTC

The following is my intuition of your test framework, please correct me.

test_consecutive_write() tests newline handling and so. It uses UTF-8 exclusively.
do_test() can test handling of a particular encoding or test encoding guessing based on a given list.

I'm still seeking for `hard` test cases. f2 failed even when the candidate list contains gbk/gb18030, but, you know, it is quite long.

What I hope is, however, employing more sophisticated mechanism mentioned in my origin post. List-based approach has apparent limitations. Adding 'GBK' or 'GB18030' before 'ISO-8859-15' can fix GB-encoded files. But this action may block real 'ISO-8859-15' files.

Comment 7 Ma Hsiao-chun 2012-02-07 05:14:48 UTC

I filed a new bug regarding GB family encoding guessing issue.
To my experience, such issues only occur in gedit3. Gedit2 is fine in terms of static list-based method. 
https://bugzilla.gnome.org/show_bug.cgi?id=669525

This bug is actually, a feature request. My hope, as the title suggest, is that gedit can add universal encoding detection mechanism found in Mozilla Firefox or Google Chrome. Static list-based method is limited. We have to trade one encoding for another.

Comment 8 Ma Hsiao-chun 2012-02-08 15:02:10 UTC

FYI, Leafpad, http://tarot.freeshell.org/leafpad/, a seemingly boring text editor, has some built-in encoding detection mechanism. The mechanism respect locale settings.

Maybe you can be inspired by Leafpad.

Comment 9 Ma Hsiao-chun 2012-02-15 04:22:54 UTC

For locale-based method, as used in leafpad, I filed a new bug.
https://bugzilla.gnome.org/show_bug.cgi?id=670117

All my bug reports related to gedit is still in UNCONFIRMED status.

When people realize the existence of the problem...you can choose between locale-based and universal...I will just wait~

Comment 10 Ignacio Casal Quinteiro (nacho) 2012-02-15 07:11:44 UTC

For the record the gedit team does not make a difference between unconfirmed and new bugs. For the record we know the encodings support is not perfect but also any of the gedit maintainers have any problem with encodings as we are all of us using UTF-8. If any of you would like to help on this would be great if not, I think there will not any chance soon for working on this.
We are mainly working on gedit as a hobby and to be honest after having invested so much time last year on encodings I think it will not happen again for quite a long time. At least from me :)
Anyway as said feel free to jump in the #gedit channel to ask questions about gedit internals and we can help you to implement it correctly.

Comment 11 André Klapper 2012-02-15 10:02:31 UTC

(In reply to comment #9)
> All my bug reports related to gedit is still in UNCONFIRMED status.

There is no difference between UNCONFIRMED and NEW, see bug 658470.

Comment 12 Ma Hsiao-chun 2012-04-26 03:38:47 UTC


*** This bug has been marked as a duplicate of bug 647935 ***

Comment 13 Ma Hsiao-chun 2013-11-10 03:57:34 UTC

bug 647935 went into another direction, no longer a dup.

Comment 14 Jehan 2015-09-03 19:43:44 UTC

Created attachment 310619 [details] [review]
Add uchardet dependency.

The attached patch adds an optional dependency to uchardet which is a C binding to universal charset detector by Mozilla.

Gedit with this added dependency will first test a file through uchardet and only fallback to the current algorithm if uchardet was failing.

Comment 15 Jehan 2015-09-03 19:44:47 UTC

By the way, tested with gedit with various Korean files.
I can provide Korean or Japanese files if needed for tests.

Comment 16 Jehan 2015-09-03 19:46:27 UTC

Oh and for reference:
uchardet: https://github.com/BYVoid/uchardet/

It has been implemented in mpv too. And that works far better than the current gtksourceview code, or enca or libguess.

Comment 17 Sébastien Wilmet 2015-09-05 11:59:16 UTC

Review of attachment 310619 [details] [review]:

Thanks for the patch! I didn't know the existence of uchardet. It looks like a small dependency (in terms of number of symbols exported). So for apps not using the file loading of GtkSourceView, this would not be a big problem to have one more dependency.

Instead of adding code to gtksourcebufferoutputstream.c (which is already quite complex, as the comment "welcome to a really big headache" suggests), I think it's better to have a utility function (by creating gtksourcefileutils.c, for instance), or keep the BufferOutputStream only for the GtkSourceView algorithm.

Also, with the GtkSourceFileLoader API, we can provide the list of candidate encodings with gtk_source_file_loader_set_candidate_encodings(). If uchardet is used, this bypasses the provided list. The documentation should at least be updated to say that it's a fallback list used if the universal character detection fails (or if the latter is disabled).

In gedit, to configure the encodings there is a dialog window to choose the candidate encodings by priority (it has changed during the 3.17 cycle). So if uchardet is used, the UI doesn't reflect the reality. So again, maybe the UI needs to be adapted and the documentation updated, but it's for the gedit side (it's better to open a new bug for gedit, once this bug is fixed).

::: configure.ac
@@ +142,3 @@
+	have_uchardet="no (disabled)"
+else
+	PKG_CHECK_MODULES(UCHARDET, uchardet >= 0.0.1,

For dependencies, there are variables at the top of configure.ac with the version numbers, so it's better to add uchardet_req.

And now AX_PKG_CHECK_MODULES needs to be used instead (with uchardet as a private dependency).

Comment 18 Sébastien Wilmet 2015-09-05 12:48:18 UTC

Review of attachment 310619 [details] [review]:

Actually the added code is not too complicated. Only the guess_encoding() function is modified, so the change is not invasive.
Maybe you could split guess_encoding() with two sub-functions, one that uses uchardet, and the other that contains the fallback, so guess_encoding() would be clearer.
The two sub-functions can be named guess_encoding_with_uchardet() and guess_encoding_fallback().

What do you think?

Comment 19 Sébastien Wilmet 2015-09-05 12:52:04 UTC

(In reply to Sébastien Wilmet from comment #17)
> It looks like
> a small dependency (in terms of number of symbols exported). So for apps not
> using the file loading of GtkSourceView, this would not be a big problem to
> have one more dependency.

Of course it's at least _two_ more dependencies, since there is also the C++ library. Do you know if it needs to be linked differently, since a C++ library is used?

Comment 20 Jehan 2015-09-05 14:13:14 UTC

Created attachment 310700 [details] [review]
New patch.

> so it's better to add uchardet_req.

Done.

> And now AX_PKG_CHECK_MODULES needs to be used instead (with uchardet as a private dependency).

Even after reading gnu.org docs for this macro, I did not understand what are the difference between a public and private module. What is the difference?
In any case, I have also made this change.

> The two sub-functions can be named guess_encoding_with_uchardet() and guess_encoding_fallback().

> What do you think?

I think it makes sense. Not sure if making a new file just for this was really worthwhile (unless you were planning to move or create more stuff in there soon).
I have done the change.

> Of course it's at least _two_ more dependencies, since there is also the C++ library.

Actually the uchardet package contains both the C++ and the C lib. The C++ library was part of the bigger Mozilla project and apparently didn't have its own independent building system, so the author of uchardet extracted it from Mozilla repo and added the C binding. So that's a single dependency all by itself.

> Do you know if it needs to be linked differently, since a C++ library is used?

As far as I know, it has to be considered as any other C library from gtksourceview point of view since the uchardet project took care of the binding.
There was a warning with the uchardet.h include, but I am taking care of it with a bug report upstream: https://github.com/BYVoid/uchardet/pull/9

Comment 21 Jehan 2015-09-05 14:40:23 UTC

Created attachment 310701 [details] [review]
Side patch: fix fallback encoding detection.

While I am at it, I discovered the original guess_encoding() (now guess_encoding_fallback()) had a bug: it loops through a list of encoding (I imagine, the list of candidates you were talking about) and test them one by one. The problem is that if all candidates were failing, the function was still returning the last tested one (even when the test failed, so it cannot be this one!).

You don't want this. You want NULL to be returned to let the user know that we are unable to detect the encoding, rather than just returning some random encoding that we know for sure will fail and display garbage data.

I could have proposed a separate bug report, but well… it's in the same portion of code and straightforward enough for me to propose this side patch here too. Sorry if that's a problem.

Comment 22 Sébastien Wilmet 2015-09-05 15:34:33 UTC

(In reply to Jehan from comment #20)
> Even after reading gnu.org docs for this macro, I did not understand what
> are the difference between a public and private module. What is the
> difference?

It should be documented for pkg-config. A public dependency is when some data types of that library are present in the headers. A private dependency is used only internally, so is not needed to compile an application.

It was explained here:
https://tecnocode.co.uk/2014/12/09/a-checklist-for-writing-pkg-config-files/

> Actually the uchardet package contains both the C++ and the C lib. The C++
> library was part of the bigger Mozilla project and apparently didn't have
> its own independent building system, so the author of uchardet extracted it
> from Mozilla repo and added the C binding. So that's a single dependency all
> by itself.

Alright.

> > Do you know if it needs to be linked differently, since a C++ library is used?
> 
> As far as I know, it has to be considered as any other C library from
> gtksourceview point of view since the uchardet project took care of the
> binding.
> There was a warning with the uchardet.h include, but I am taking care of it
> with a bug report upstream: https://github.com/BYVoid/uchardet/pull/9

Ok thanks.

Comment 23 Sébastien Wilmet 2015-09-05 15:42:01 UTC

Review of attachment 310700 [details] [review]:

Some comments below.

::: configure.ac
@@ +42,3 @@
 libxml_req=2.6.0
 gladeui_req=3.9
+uchardet_req=0.01

Is it 0.0.1 or 0.1?

::: gtksourceview/gtksourcebufferoutputstream.c
@@ +360,3 @@
 static GCharsetConverter *
+guess_encoding_with_uchardet (G_GNUC_UNUSED
+			      GtkSourceBufferOutputStream *stream,

It's better to remove the parameter altogether, so we know in guess_encoding() that guess_encoding_with_uchardet() doesn't use the self variable.

@@ +486,3 @@
+
+	if (conv == NULL)
+#endif /* HAVE_UCHARDET */

I would put the #endif before the if. conv is anyway initialized to NULL.

@@ +489,3 @@
+	  {
+	    conv = guess_encoding_fallback (stream, inbuf, inbuf_size);
+	  }

indentation problem (but keep the curly braces, that's fine).

Comment 24 Jehan 2015-09-05 16:06:16 UTC

Created attachment 310704 [details] [review]
Version 3!

> Is it 0.0.1 or 0.1?

Oups! It's 0.0.1! Well at least this typo made me realize of another bug when I wondered "why did the pkg-config succeed then?!", the answer being that I forgot the dollar before uchardet_req! Fixed.

> It's better to remove the parameter altogether, so we know in guess_encoding() that guess_encoding_with_uchardet() doesn't use the self variable.

Ok. That's what I would have done for my own code, but I was unsure if you didn't want to always have the class object as first argument if possible.

> I would put the #endif before the if. conv is anyway initialized to NULL.

Ok. I wanted to avoid the unnecessary condition test when building without uchardet. But well… you are right it's clearer this way.

> indentation problem (but keep the curly braces, that's fine).

Fixed. As are all the previous comments.

Comment 25 Sébastien Wilmet 2015-09-05 17:20:43 UTC

(In reply to Jehan from comment #24)
> Ok. I wanted to avoid the unnecessary condition test when building without
> uchardet. But well… you are right it's clearer this way.

It's clearer, and the compiler will anyway optimize the code when the #ifdef section isn't present. Micro-optimizations should be done only if it's really necessary. The data structures used, the algorithms' complexity and the general code architecture are the primary factors for code execution time. One more O(1) operation not often executed makes no differences at all.

Comment 26 Sébastien Wilmet 2015-09-05 17:25:42 UTC

Review of attachment 310704 [details] [review]:

Looks good, but I haven't tested yet.
The GtkSourceFileLoader documentation still needs to be updated, but this can be done for another patch.
The README file should also be updated to explain that there is an optional dependency.
There are maybe a few other places where the documentation needs to be updated. Maybe in GtkSourceEncoding (for getting the default candidates) and GtkSourceFile.

Comment 27 Jehan 2015-09-05 17:43:13 UTC

I'm not really aware of what the "accepted-commit_after_freeze" means. Does it mean I can push my patch right now? Or that will be after a version freeze only (when is it?)? Or will you do it?

> The README file should also be updated to explain that there is an optional dependency.

I can do it, no problem. Should I?
Or you will do it?

> There are maybe a few other places where the documentation needs to be updated. 

Looks in particular as gtk_source_file_loader_set_candidate_encodings() docs could be updated. I can also do it if needed.

Comment 28 Sébastien Wilmet 2015-09-06 09:23:25 UTC

Yes we're in freeze, GNOME 3.18 will be released soon:
https://wiki.gnome.org/ThreePointSeventeen

So the patch can be pushed on master only when the gnome-3-18 branch is created.

For updating the documentation, yes it's better if you propose a patch.

Comment 29 Sébastien Wilmet 2015-09-06 12:26:06 UTC

The only thing that hampers using uchardet is that it seems that there is no releases, no tarballs, no tags in git. I've filed a bug at:

https://github.com/BYVoid/uchardet/issues/10

If uchardet is not well maintained, I don't know if it's a good idea to use it. But well, it can anyway be disabled in GtkSourceView at compile time.

Comment 30 Jehan 2015-09-06 13:35:58 UTC

I agree. The problem is that there are no alternative that I know of. Enca and libguess are kind of crappy (even the few encodings they were supposed to be able to detect, they often failed miserably in my tests!).

Just "testing" encoding until one succeeds (as was doing gtksourceview) is often giving wrong results as well because many encoding use similar coding; especially for single byte encoding, they nearly all basically have the same range of possible encoding. Which is why (I guess) libguess API has mandatory language hinting, and Enca is not able to recognize any single-byte encoding without language hinting (in other words: it is also mandatory for full detection).
So gtksourceview current method would work only by mere chance and because gedit (or other software) would ask users to choose a list of valid encoding. But even so, it will often fail. Let's say I often receive files in ISO-8859-1 and ISO-8859-15 encoding (both very likely for French speaking for instance!), I would add both in my list. Well they have exactly the same range of characters, so any file from one encoding will be recognized by a test in the other. In other word: whatever is first in the list will always be returned first (making it useless to add them both as candidates, and likely any other ISO-8859, or even most single byte encodings). And paf! You get garbled text.

Now I was also annoyed by the low development of uchardet. Actually it seems that Mozilla have basically abandoned themselves the universalchardet in Firefox, even though they still keep the lib around. That's what I discovered in a bug report: https://bugzilla.mozilla.org/show_bug.cgi?id=1105839#c2
They apparently now use detectors per-language, which may make more sense on the web anyway where web pages have metadata for lang and charset, national domain names, etc. Much less for a desktop where you can just receive files from anyone in any encoding with few metadata to help with hinting.

Yet that's what works the best for now (in my research to find a good replacement! I'd be happy to be proven wrong) and that's quite a stable algorithm which should work well until we find better. Here for the paper describing the algorithm: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
Basically it mixes the "testing" method ("coding scheme method" in the paper) to statistics methods (character distribution and 2-character distribution). With the right character distribution data, you therefore can differentiate even single-byte encoding (and detect correctly between all ISO-8859 for instance!).

All this to say that I agree but I believe that we have no better. And this is really not invasive code (which means really easy to remove and replace the day we want to), so until we do have better + more maintained/active, I'd be in favor to keep such a dependency to at least have something working.

Also Debian, Ubuntu and derivative already have a uchardet package. Fedora has none, but I asked an active packager (Christopher Meng) if he could package uchardet; he said he would do it. mpv already added the feature. If gedit (through gtksourceview) does too, maybe others will follow.

My main point is that it has a sufficiently low print and the feature it brings is sufficiently huge that it's worth it, in my opinion.
For the background, I have lived in Korea, and in Japan. And dealing with files on Linux from inhabitants of these countries is just terrible. It never works out of the box, you *always* have to specify encoding manually. Just terrible.
This is why I decided to step in and provide patchs here and there. :-)

Comment 31 Sébastien Wilmet 2015-09-06 15:56:21 UTC

Ok, thanks for your research. It is indeed not an invasive change, so worst case we can remove the code and come back to the state before.

Maybe some application authors will not be happy with the extra dependency (I'm thinking about gnome-calculator for example, which uses GtkSourceView for the completion (and maybe for the undo/redo too)). If it's really problematic, we can dlopen uchardet.

Comment 32 Sébastien Wilmet 2015-10-01 15:56:36 UTC

Review of attachment 310701 [details] [review]:

Sorry for the late review of that side patch.

Are you sure the function should return NULL? Because the BufferOutputStream also handles invalid characters by replacing them by their equivalent hexadecimal value as a string.

Some other minor comments below.

::: gtksourceview/gtksourcebufferoutputstream.c
@@ +457,2 @@
 		/* Try to convert */
+		if ((success = try_convert (conv, inbuf, inbuf_size)))

In GtkSourceView we generally don't put several instructions on the same line.
It's clearer to have success = TRUE inside the block.

@@ +466,3 @@
 		g_converter_reset (G_CONVERTER (conv));
 	}
+	if (success == FALSE)

For booleans you should not compare to TRUE or FALSE. For FALSE you can use the not operator !.

Comment 33 Sébastien Wilmet 2015-10-11 11:48:37 UTC

Review of attachment 310704 [details] [review]:

The documentation still needs to be updated. Maybe when writing the doc we will see that the API is awkward, since it doesn't do what the API says (wrt candidate encodings).
Also, what happens if a user selects only one encoding? The list of candidate encodings has only one element, and the user expects the text editor to try only that one. E.g. the automatic detection fails, so the user selects a specific encoding in the infobar. BUT for the automatic detection, the list can also contain only one element, if the user removes everything except UTF-8 (see the user interface in gedit). So maybe GtkSourceFileLoader needs a new API to enable "smart" autodetection, i.e. not based on the list provided.

So I think the patch is not really ready to be merged as-is. As a first step, the patch is good though, it's better to write _other_ patches to keep the changes small in each commit.

Comment 34 Jehan 2015-10-11 11:57:17 UTC

Hi,

Thanks for both reviews. Sorry for not answering earlier to comment 32 but I was on some other project and did not have gtksourceview repo anymore on my laptop to test again and answer for sure your question.

I have some idea to improve the issue on comment 33, so I'll have to test first. I'll try to handle both reviews soon. :-)

Comment 35 Jehan 2015-11-17 19:34:20 UTC

Created attachment 315787 [details] [review]
guess_encoding_fallback() return the last encoding even if it fails.

Sorry for the late reply!

About the Review of attachment 310701 [details] [review]:

> Are you sure the function should return NULL? Because the BufferOutputStream also handles invalid characters by replacing them by their equivalent hexadecimal value as a string.

Yes, I am sure.
If you follow the code, you see that when guess_encoding_fallback() returns NULL, then guess_encoding() returns NULL, which in turns is verified in gtk_source_buffer_output_stream_write(), line 955 of gtksourceview/gtksourcebufferoutputstream.c.

> ostream->priv->charset_conv = guess_encoding (ostream, buffer, count);
>
> /* If we still have the previous case is that we didn't guess
>    anything */
> if (ostream->priv->charset_conv == NULL &&
>    !ostream->priv->is_utf8)
> {
>    g_set_error_literal (error, GTK_SOURCE_FILE_LOADER_ERROR,
>    GTK_SOURCE_FILE_LOADER_ERROR_ENCODING_AUTO_DETECTION_FAILED,
>    "It is not possible to detect the encoding automatically");
>
>    return -1;
> }

The problem with current code is that this error message will never be outputted because NULL is never returned, so that's dead code. It is a much user-friendlier behavior if a program clearly states it was not able to detect an encoding, rather than using an encoding that we know is wrong (which is the case when guess_encoding_fallback() fails all conversion tests), because when users start to work in the wrong encoding, it may end up a whole waste of time hours later when you discover it.

About the minor comments, they are fixed in the updated patch.

Comment 36 Sébastien Wilmet 2015-11-17 19:45:07 UTC

Review of attachment 315787 [details] [review]:

Looks good.

I see that you've become a new maintainer of uchardet, congrats!
So when there is a release of uchardet, I think we can merge the two patches here and see how it goes.

Comment 37 Jehan 2015-11-19 15:07:22 UTC

> I see that you've become a new maintainer of uchardet, congrats!

Yep. :-)

> So when there is a release of uchardet

I actually already made one a few days ago, but I discovered it had a regression from one of the contributed and recently merged patch.

So anyway I fixed it, improved the build a little (with tests, ability to not build the static lib…), and especially made sure that all the returned charset are iconv-compatible (they already mostly were but 2), which is probably one of the most interesting change for software.
Sore here is the release version 0.0.3.

https://github.com/BYVoid/uchardet/releases/tag/v0.0.3

Comment 38 Jehan 2015-11-19 15:16:53 UTC

Sébastien > so should I just push the 2 patches into your repository? Or will you do it yourself? Thanks.

Comment 39 Sébastien Wilmet 2015-11-19 15:24:54 UTC

I'll push the patches myself, after testing them.

Comment 40 Sébastien Wilmet 2015-11-26 16:38:07 UTC

Review of attachment 315787 [details] [review]:

I've modified the two commits and pushed them on:
https://git.gnome.org/browse/gtksourceview/log/?h=wip/uchardet

By default, it fails if uchardet is not installed, so all packagers will see the change. I prefer that than silently ignoring the dependency.

But the second commit breaks the unit tests when uchardet support is disabled! So either the unit tests must be adapted, or the change is wrong.

Comment 41 Jehan 2015-11-28 18:41:14 UTC

Wow, that was a little crazy to diagnose but I finally did it!

Since this is a feature branch, not master, I directly pushed my commits there. Hope that's ok.

[1] In my previous commit, I missed the fact that current code had a shortcut to directly return the first encoding if the list had only a single encoding (without even testing it!). This shortcut and my changes conflicted and produced an unexpected NULL return.

Before my commits (without uchardet), the algo was:

> if there is only a single encoding in the list:
>     do not even test it and return it (so it can be wrong, we don't care).
> else:
>     test the text against every encoding in the list.
>     return the first encoding which can be used without an error.
>     if no encoding matches (they are all wrong!):
>         return the first encoding of the list (even though we know for sure it is the wrong answer).

So the question is: do we really feel this is the right algorithm? I personally say this is absolutely not. The naive "testing encoding schemes" may return wrong answers a lot more than uchardet, but it has at least one good side: when it discards an algorithm, we are 100% sure it is *NOT* the right answer.
So returning the first encoding of the list, just out of spite, looks to me like the worse behavior.
This is up to a using application to react appropriately in case of detection failure (showing a proper GUI to the user asking to choose explicitly a charset, or fallbacking to the first of the list if they really want to, whatever). gtksourceview should not do this and hiding the detection failure.

I know we already discussed this point, and as far as I can see you agreed, but since I see explicit code about this now (before it looked only like an error), I reiterate my rational. :-)

Commit c886bd6 removes this logic of "use the first charset of the list out of spite".

[2] Now even with this fixed, the test was still failing at some other places!
In test_utf16_utf8() in tests/test-buffer-output-stream.c, we can see the following test:

> aux = do_test (text, "UTF-16", NULL, aux_len, 1, NULL);

We are trying to write by chunk of 1 byte! Since our test uses UTF-16 with 2 bytes from the start, it fails (even without uchardet). The wrong error was being used (G_CONVERT_ERROR_PARTIAL_INPUT instead of G_IO_ERROR_PARTIAL_INPUT).
This bug was overlooked until now only because no encoding test was ever done since UTF-16 was the only option in the list (cf. [1]).
Fixed with commit 199bbd7.

These 2 commits fixes the unit tests using --without-uchardet.

[3] The fact that gtk_source_buffer_output_stream_write() would try to guess the encoding at the first call is very problematic.
Unit tests with chunks of 1 byte (cf. [2]) is quite a good example actually. uchardet is not capable of detecting reliably a Multi-byte encoding with a single byte.

To be clear, the fallback system is not either. The only reason why this works in the unit test is because we give it no other choice (encoding list with 1 charset only) and because the current code has a lot of approximations.
Seriously the point [2] is quite symptomatic: with a single partial byte of the first character, we say "yeah ok, let's say it's UTF-16".

So test_big_char() fails with uchardet, because we are feeding it 2 bytes at a time (using two 3-byte characters. I tested, uchardet detect perfectly the UTF-8 character even with 3 bytes, but of course not with the incomplete bytes).
Fixed with commit a5250e8.

[4] Now with test_utf8_utf8(), we do an explicit "small chunk" test (line 317).
By giving only 2 characters "fo" to uchardet, we should not be surprised it returns "ASCII".
Now I have no idea how to deal with it. Uchardet is right: the file is ASCII when you feed it just "fo". Now I can see 2 ways to deal with it:
- Either we wait for more data before initializing the output stream and setting the encoding.
- Or we make complicated assumptions of what encoding is a superset of what. Hence if a file is ASCII, we just switch to UTF-8 which is a valid ASCII superset.

Anyway for now, I disable this part of the test (only the one with small chunks) with commit 5e95c6a.

[5] Same with commit 11f9085 for test_utf16_utf8().

[6] Finally I fixed gtk_source_buffer_output_stream_get_guessed() when uchardet is built with commit a2ae71d.

With all these commits, the wip/uchardet branch now runs `make test` without fault both --without-uchardet and with.

Comment 42 Sébastien Wilmet 2015-12-04 16:30:01 UTC

(In reply to Jehan from comment #41)
> Wow, that was a little crazy to diagnose but I finally did it!

Thanks! As the comment at the top warns, it can be a big headache, unfortunately. I myself didn't touch a lot the code when I moved it from gedit to GSV.

> Since this is a feature branch, not master, I directly pushed my commits
> there. Hope that's ok.

Sure, it's easier with a branch. Git is also smarter when doing rebases than re-applying an old patch provided in bugzilla.

I've pushed wip/uchardet-2 with some minor changes. See the diff between the two branches. I've also renamed some of the commits' titles, since we usually don't end them with a dot, and it should be less than 72 characters.

For the code changes, it looks good. I've tried to load an invalid file with gedit, and the invalid character (\0) is still converted to its hexadecimal value, so I guess your changes are fine wrt to that feature.

But I just tested the branch without uchardet. I need to actually install uchardet. If the branch is merged, uchardet will need to be added to a jhbuild moduleset, since it's enabled by default in GtkSourceView.

Comment 43 Jehan 2015-12-04 17:00:02 UTC

> I've pushed wip/uchardet-2 with some minor changes. See the diff between the two branches.

Looks cool.

> I've also renamed some of the commits' titles, since we usually don't end them with a dot,

Noted.

> I need to actually install uchardet.

Try to install later versions if possible (I've just released one yesterday).
At first I only wanted to maintain it, but in the end, by taking maintainership, I improved the algorithms quite a lot, and am even in the process of adding support for new charsets.
I have written scripts to generate statistics data from Wikipedia for all sort of languages and that really works well.

> If the branch is merged, uchardet will need to be added to a jhbuild moduleset, since it's enabled by default in GtkSourceView.

Would be cool.

Comment 44 Sébastien Wilmet 2015-12-05 18:48:21 UTC

I'm testing the branch with uchardet, in gedit.

With the file attached at bug #658836, I get the following error:
> Unexpected error: Invalid byte sequence in conversion input

With a file containing "foobar\0" (ending with the invalid byte 0, I can attach the file), I get the following content:
> foobarÀ

and when I save as the file it proposes me UTF-8 (but I know the initial file was not valid UTF-8). See bug #728181.

In the "File loading and saving" component of GtkSourceView bugzilla, there is also bug #693020.

Comment 45 Sébastien Wilmet 2015-12-05 18:52:56 UTC

Created attachment 316813 [details]
File containing invalid byte \C0

Here is the file, but the invalid byte is actually \C0, not \0.

Comment 46 Sébastien Wilmet 2015-12-05 18:57:34 UTC

I think it is worth adding those samples to the unit tests, if the bugs are fixed with uchardet.

Comment 47 Jehan 2015-12-05 20:34:43 UTC

Hi,

1) The invalid char in the file from bug #658836 is not 0xC0 but 0xA0 (non breaking space NBSP).

You can check with:
> $ hexdump  bare_jrnl_compsoc.tex |cut -d " " -f 2- |grep "\<c0\|c0\>"
> (no result)
> $ hexdump  bare_jrnl_compsoc.tex |cut -d " " -f 2- |grep "\<a0\|a0\>"
> (a few results)

uchardet currently returns ASCII for this file, which was on purpose in original Mozilla code. There is a comment which says that a lot of ASCII files had the NBSP as only character exception and therefore they were considering it an exception. Of course this is wrong, so I changed the logics. See my latest commit a few minutes ago.

Now uchardet returns "ISO-8859-1" for such a text. Therefore it should work well

2) In bug #728181, there still is no 0xC0, but there are indeed 0x00 characters (try again the above command).
uchardet returns "ASCII" on this one, which is actually right:

> $ uchardet gherkin-2.11.6.gemspec 
> ASCII

This is indeed an ASCII text file with NULL bytes (which is weird, but not an encoding error. You an make an `iconv -f ASCII -t UTF-8 gherkin-2.11.6.gemspec` and you'll see there is no error).

So if there is a bug, this is on gedit. Uchardet works great on this one. And building gedit with uchardet will actually fix bug #728181 (same for #658836).

3) Bug #693020 is not about encoding detection. It is the fact about skipping the ZWNBSP character when it is the first in a text (in this position, it is not a "zero-width non-breaking space", but a BOM).

uchardet can do nothing about this. Its sole role is to detect the encoding, not to edit the input text in any way. In other words, uchardet is read-only and should stay that way.

4) As for your file "invalid-text", it indeed has a 0xC0, and uchardet detects it as ISO-8859-3, thus making its contents be: "foobarÀ".
Now it could actually be nearly any of the single byte charsets. Working on such small texts with no real language associated has not much of a value. uchardet uses language statistics to process texts and can't do much with meaningless 1-word texts.

As for the fact you see 0xC0 everywhere on the bug report files, are you sure you have not made an automatic (wrong) conversion with some editor?
Or did I miss some other file?

Comment 48 Jehan 2015-12-05 20:54:25 UTC

Ok the previous comment was uchardet side, now for gtksourceview/gedit size.

1) As for bug #658836: uchardet detects ISO-8859-1, which is a good answer, but in gedit, it detects it as ASCII. I have not checked exactly the code, but I am pretty sure I know why.
I have already warned about this issue in my comment 41 (my point [3]). The problem is that you would break a text in chunks of smaller text, and gtksourceview try to detect an encoding from the first chunk only. This is wrong design.

The file from bug #658836 is the living example of it, since the first non-ASCII byte is at position 36058 (according to iconv), and I imagine you gave gtksourceview a smaller first chunk. Hence it feeds only a perfectly ASCII piece to uchardet.

This design will have to be reviewed if you want to be able to fix this.

2) The file from bug #728181 is correctly detected as ASCII by uchardet, but gedit is indeed warning me about invalid characters. Well I don't know if the error is in gtksourceview or gedit, but NULL byte is a perfectly valid ASCII character (control character, hence non-printable, but valid). Now I can understand that gedit is a text editor, so one might wonder what to do with non-printable characters.

Maybe the right move would be simply to warn the user of the presence of non-printable characters and ask what to do (removing or keeping them), but it should not consider the file as necessarily corrupted. It is not.

I can have a look at reviewing the code design for the first bug, if you wish.
The second one, well I think that's more UI design and we should decide first what should happen on coming by control characters.

Comment 49 Jehan 2015-12-05 21:36:57 UTC

Ok I have a first code for a proposed redesign of the way gtksourceview detects encoding (to leave software optional capacity to detect with fuller text, hence fixing bug #658836).

I stashed it and will review it in details, but maybe not before a few days since I have some other priorities these days to come. Then I will commit and push it on the feature branch.

For the BOM issue (Bug #693020), do you want this to be dealt in GtkSourceBufferOutputStream? In other words: are they not any application using GtkSourceView which wants the text with the BOM?

Comment 50 Sébastien Wilmet 2015-12-06 12:32:54 UTC

My invalid-text file was actually generated with a C program. It is not valid UTF-8, but yes indeed it is valid in other one-byte encodings.

See bug #598818 for the too-small-chunk problem.

For dealing with non-printable characters, GtkSourceView should escape them, the same way as it escapes invalid bytes. As I understand it, the invalid bytes are escaped when iconv reports an error, right? For valid but non-printable characters, maybe only \0 is a problem. I think for other non-printable characters GTK+ shows a rectangle with the hexadecimal value in it, no?

Anyway, those problems can be fixed separately. They are not regressions. I just hoped that using uchardet would fix all reported bugs for the file loading. This bug has already quite a lot of comments, so let's try to close it. Further improvements can be done in the other bugs, it'll be clearer.

There are some problems though in gedit when using uchardet: when I open uchardet/test/fr/iso-8859-1.txt, it correctly opens the file and converts it to UTF-8, but:
1. In the tab tooltip, I see "Encoding: ISO-8859-15" (I have only UTF-8 and ISO-8859-15 in my candidate encodings).
2. When I Save As the file, the proposed encoding is UTF-8. It should be the initial one instead (ISO-8859-1).

For the gedit problems, another bug can be opened.

Comment 51 Sébastien Wilmet 2016-01-02 12:26:19 UTC

Maybe to solve the too-small-chunk problem, a good solution is to add a D-Bus service to uchardet. So GtkSourceView can ask to the D-Bus service "what is the encoding of this file?".

Comment 52 Jehan 2016-01-02 12:46:15 UTC

That's interesting. I have never implemented a dbus service.
I'll have a look. :-)

Comment 53 Jehan 2016-01-02 12:47:37 UTC

P.S.: though I don't understand how it would solve the too-small-chunk issue. If there is not enough input, there is not enough input, whether as a lib or as a dbus service.

Comment 54 Sébastien Wilmet 2016-01-03 15:03:33 UTC

I was thinking about giving as a parameter the file name, not the data. But of course this wouldn't work when loading from a GInputStream.

Anyway another solution is to launch a subprocess to call the uchardet utility (with GSubprocess).

It's just a random idea, it's maybe better to gather more data in GSV and use the uchardet API.

Comment 55 Jehan 2016-01-03 21:33:52 UTC

I think that an API with string parameter is the most versatile choice. It works for any kind of input and I don't think that uchardet should have to care if the source is a file, a stream, or whatever.

As for running the uchardet utility rather than going through uchardet lib, well, you are the boss, but I'd think that it is a lot better to use the lib when you have the choice. :-)

Comment 56 Sébastien Wilmet 2016-01-04 19:14:39 UTC

My understanding was that uchardet is better placed to know how much data it needs to be confident enough about the encoding. But it's maybe simply "the more the better". When you call:

$ uchardet <file>

Does it look at all the data?

Comment 57 Jehan 2016-01-04 21:04:46 UTC

Yes, "the more, the better".

> Does it look at all the data?

uchardet command line tool does indeed process all the file contents until eof. But it's true that there is an internal shortcut where uchardet API would actually stop processing incoming data when it is already certain of the outcome. Such a shortcut would not happen for all encoding though. In particular, most of the single byte encodings are statistics-based, for many of them using the same codepoints, and therefore *really* the more data the better.

Bug 658836 is living example of the problem. The file in the report is 858 lines, full ASCII but for line 800 which exhibits a few NBSP characters. So feeding even a few hundreds of lines, which seems more than enough, uchardet will return ASCII and it will be right up to there. But it will be wrong in the whole and conversion will fail in the end because of this one single line in the whole file.

So yes, I really feel we should feed the whole data to uchardet. It's not like it's such a long process. It can still be considered instant even when processing files with millions of characters (and in some cases, it may even shortcut the processing itself).
I think the only reason why one would want to do some optimization would be if continuously processing billions of files and encoding detection approximation was acceptable.

I believe that this is not the use case for GtkSourceView, which is more oriented to users checking a few files at once at most and accuracy of the detection is more important than any kind of shortcut which would not make any sensible difference anyway.

Comment 58 Sébastien Wilmet 2016-01-07 09:14:19 UTC

Alright. So for the loading progress information, it should be split in two: the first half for encoding detection, the other half for converting the text and feeding the GtkTextBuffer.

Comment 59 Sébastien Wilmet 2016-06-08 11:32:05 UTC

I'm writing a new FileLoader in the recently-created Gtef library:
https://github.com/swilmet/gtef

The new FileLoader will be based on uchardet.

Comment 60 Jehan 2016-06-08 13:18:31 UTC

Sorry for the lack of feedback lately. I've been regularly thinking about hacking uchardet in gtksourceview but never took the time. :-/

> The new FileLoader will be based on uchardet.

Awesome. Does that mean that the uchardet branch for GtkSourceView that we started will become outdated and useless? Or is it complementary?

Of course I will be a little sad that my work got wasted, but well having proper encoding detection in gedit (and other software using GtkSourceView) would be a good tradeoff. :-)
Also that's my fault for being slow!

Comment 61 Sébastien Wilmet 2016-06-08 15:18:46 UTC

I'm not happy with the GtkSourceFileLoader and BufferOutputStream implementation:
- Encoding detection relies only on the first (small) chunk loaded, which sucks.
- With uchardet, the candidate encodings don't make a lot of sense to have that in the API, and the gedit UI is based on that.

With the new implementation, I want to:
- Limit file size.
- Limit line length, to avoid the performance problem of GtkTextView with very long lines. Ideally propose to split the long line.
- Have better detection of binary files.
- For encoding auto-detection, try first with uchardet, and if uchardet can't detect the encoding, try each encoding one by one, for the whole content, counting the number of invalid bytes in each, and then choose the best one. I think if it happens all in memory it should be fast, provided that there is the limit on the file size.

Comment 62 Jehan 2016-06-12 16:47:34 UTC

> try each encoding one by one, for the whole content, counting the number of invalid bytes in each, and then choose the best one.

I understand this algorithm is coming from current gedit's, but this is not good, IMO. There are dozens of encoding, and the chances that your text will be valid in several encodings are huge. Especially if the actual encoding is single-byte, your file is most likely fully valid with a dozen encodings out there (all single-byte encodings use nearly the same range of codepoints).

Also you should just ban the idea of making encoding selection visible to the users (like currently in gedit). I am a developer, I understand the logics behind encoding, but if I download a file on the web, I can't recognize its encoding! Even if I display the file in an hexadecimal editor, then I look the character maps of common encodings up, it would still take me several minutes of thinking to decide "yep that's this encoding" (and even doing so, I may likely be wrong). So asking any user to select in a list of encoding names? This is just wrong.

What does a user know? Usually a user know the language one thinks the file has text for. So here is my ideal GUI in a software which needs to read user-provided files (for instance gedit):

1/ By default, the software must make absolutely no assumption on a new file (well it may gives some weight to the OS-set lang, but this should not be a hard one because a user can still read files from other langs). From there, try to guess the encoding (uchardet).

2/ If it fails, try to add more weight to some "languages" (and not encodings!). The list of languages could ever have been provided by the user in preference of languages, or even better the software could teach oneself (if the user usually read Korean, French and English files for instance, which would likely be my case lately, then the software could give weights to encoding for these 3 languages). Then try again encoding detection with this soft-hinting (language weighting).

3/ If all this were to fail, the software could pop-up a question to the user, similar to:
"We are not able to detect the encoding for your file. Do you know which language it is supposed to be in?"
With this information, if the user knows, make a hard hint (i.e. consider *only* encodings usable for the given language), simply loop through the given list of encodings possible for this lang, and select the best match, just as you proposed.

3.alt/ If the user actually don't know at all what is in the file, hence the language, then and only then, we pop-up an error with a choice of encodings:
"We are sorry. It seems we are unable to correctly detect the encoding for you. Here is the list of supported encodings which you can try out on your file."
Then the user can click all encodings one by one, and the result would be displayed on-the-fly so that the user can just do the whole list without clicking OK all the time. This is the only time where you may actually resort to even name the word "encoding" because that's just developer shit that users don't want to know about (they just know "language").

So here is the ideal algorithm that I have been thinking a lot about these last years when I saw all the failures that various software out there are doing. 

Summed-up in:
* Users (even advanced one) don't know what encodings their files are in. This should not be anywhere in the normal UI, but only available as a last last resort when everything else fails.
* Users usually know about languages.
* Testing encoding validity sucks. Don't ever rely on just looping through encoding because all it proves is that your file is valid in too many encodings for it to be useful information.

Note: uchardet will also become a language detector in a not-so-distant future, which will allow for a program to "learn" preferred languages of a given user by keeping history of languages for all files read until now.

Note 2: I have been thinking of adding concepts of soft and hard language hinting to uchardet at some point too.

Comment 63 Sébastien Wilmet 2016-06-13 15:02:58 UTC

Thanks for the suggestion. It makes sense.

One thing that complicates the matter is dealing with invalid bytes (a corrupted file, or a binary file). Currently GtkSourceFileLoader supports that, it escapes the invalid bytes. But for a new FileLoader implementation, I'll not add invalid bytes handling to the mix directly.

Comment 64 Jehan 2016-06-13 15:19:49 UTC

Invalid byte (or trying to read binary file, to grab the few strings in it, I guess?) is not supposed to happen, which means it is rare enough that when it happens, we will simply reach the case 3.alt/ above, leaving the user play with the encoding list.

So such a file will still be readable, simply the user will have to face the language question, and then choose an encoding.

But in the first steps, invalid bytes should just be forbidden. Because if we start allowing the possibility that a file could be a given encoding even with some invalid bytes, then we just break encoding detection. As said, many encodings are just too close and differences are subtle. If we now get rid of the bigger differences (invalid bytes), there is no point in giving credits to the smaller differences.

Comment 65 Sébastien Wilmet 2016-06-14 18:13:05 UTC

I agree, but in case of a corrupted or binary file, we can fallback to that algo:

> try each encoding one by one, for the whole content, counting the number of
> invalid bytes in each, and then choose the best one.

with still an infobar to be able to choose another language/encoding.

Comment 66 Sébastien Wilmet 2016-08-28 19:09:04 UTC

Basic implementation merged:
https://github.com/swilmet/gtef/blob/master/gtef/gtef-file-loader.c

The interesting function is determine_encoding(). uchardet works smoothly, thanks a lot Jehan for all your work on uchardet!

Other interesting bits:
- there is a limit on the file size, since all the contents is loaded in memory and since GtkTextView uses much more memory than the size of the raw contents.
- I wrote a higher-level API for encoding conversion (for streaming), based on iconv:
https://github.com/swilmet/gtef/blob/master/gtef/gtef-encoding-converter.c

The new file loader still needs a lot of work:
- escape invalid chars (if we force a specific encoding);
- limit on line length, since GtkTextView doesn't support very long lines;
- support again loading from a GInputStream (e.g. stdin);
- report progress information, to have a progressbar;
- if a limit is reached, propose a better solution than just reporting an error: if max-size is reached, propose to truncate. If the max-line-length is reached, propose to split the line, or whatever. When saving, a warning infobar must appear of course.

But now at least the bases are there. And in Gtef it is not a problem if it is not perfect or lack a few features.

The new file loader will be added to GtkSourceView when it's more mature. For now, we can close the bug.