After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 502947 - Doesn't open invalid UTF-8 text
Doesn't open invalid UTF-8 text
Status: RESOLVED FIXED
Product: gedit
Classification: Applications
Component: general
unspecified
Other Linux
: Normal normal
: ---
Assigned To: Gedit maintainers
Gedit maintainers
: 574535 590966 632377 (view as bug list)
Depends on:
Blocks: 660633
 
 
Reported: 2007-12-10 23:22 UTC by Behdad Esfahbod
Modified: 2011-10-01 14:06 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
invalid char v1 (11.74 KB, patch)
2010-12-14 14:06 UTC, Ignacio Casal Quinteiro (nacho)
none Details | Review
error in chunks v1 (2.99 KB, patch)
2010-12-14 18:01 UTC, Ignacio Casal Quinteiro (nacho)
none Details | Review
invalid char v2 (12.52 KB, patch)
2010-12-14 18:59 UTC, Ignacio Casal Quinteiro (nacho)
none Details | Review
invalid char v3 (17.15 KB, patch)
2010-12-15 13:58 UTC, Ignacio Casal Quinteiro (nacho)
none Details | Review
Add invalid char support. (33.17 KB, patch)
2011-01-27 15:37 UTC, Ignacio Casal Quinteiro (nacho)
none Details | Review
Add invalid char support. (30.01 KB, patch)
2011-01-29 17:43 UTC, Ignacio Casal Quinteiro (nacho)
committed Details | Review

Description Behdad Esfahbod 2007-12-10 23:22:12 UTC
So if a file has a single invalid byte I can't edit it or even open to fix it.

I've made pango render a box with a cross inside for every invalid input byte.  As long as you don't reuse what pango_layout_get_text() returns, it should be safe to pass invalid text to pango_layout_set_text().
Comment 1 Ignacio Casal Quinteiro (nacho) 2010-04-05 12:38:10 UTC
*** Bug 590966 has been marked as a duplicate of this bug. ***
Comment 2 Ignacio Casal Quinteiro (nacho) 2010-10-17 18:03:15 UTC
*** Bug 632377 has been marked as a duplicate of this bug. ***
Comment 3 Ignacio Casal Quinteiro (nacho) 2010-12-14 14:06:03 UTC
Created attachment 176404 [details] [review]
invalid char v1

Here is the first part. Next step will be to just apply the tag per chunk.
Comment 4 Ignacio Casal Quinteiro (nacho) 2010-12-14 18:01:19 UTC
Created attachment 176421 [details] [review]
error in chunks v1

Here it is to apply error tags by chunks. Next no so important step is to manage the conversion errors.
Comment 5 Ignacio Casal Quinteiro (nacho) 2010-12-14 18:59:37 UTC
Created attachment 176425 [details] [review]
invalid char v2

Squashed both patches. For easier review
Comment 6 Paolo Borelli 2010-12-14 20:39:53 UTC
Review of attachment 176425 [details] [review]:

Patch looks good, except for some annotation below.

As you say, it is missing the part where it actually notices that something went wrong and notifies the user about it.

::: gedit/gedit-document-loader.c
@@ +507,3 @@
+		                     "needed to use a fallback char");
+	}
+

I am probably missing context, but can you explain why you are moving this hunk?

::: gedit/gedit-document-output-stream.c
@@ +251,3 @@
+	stream->priv->current_encoding = stream->priv->encodings;
+
+	return (const GeditEncoding *)stream->priv->current_encoding->data;

maybe I am missing the context, but I do not understand how this hunk relates to the rest

@@ +582,2 @@
 		{
+			GtkTextIter start;

We also must tag the invalid text when reaching the end of the text, if the last characters were invalid
Comment 7 Behdad Esfahbod 2010-12-14 21:06:01 UTC
So, saving the file leaves the invalid bytes unaffected, right?
Comment 8 Ignacio Casal Quinteiro (nacho) 2010-12-14 21:07:21 UTC
(In reply to comment #6)
> Review of attachment 176425 [details] [review]:
> 
> Patch looks good, except for some annotation below.
> 
> As you say, it is missing the part where it actually notices that something
> went wrong and notifies the user about it.
That part is already added. It was added in the previous cycle when we did all the stream thingie.

> 
> ::: gedit/gedit-document-loader.c
> @@ +507,3 @@
> +                             "needed to use a fallback char");
> +    }
> +
> 
> I am probably missing context, but can you explain why you are moving this
> hunk?
We need to check this after we close the stream, so we are sure that the stream was flushed.

> 
> ::: gedit/gedit-document-output-stream.c
> @@ +251,3 @@
> +    stream->priv->current_encoding = stream->priv->encodings;
> +
> +    return (const GeditEncoding *)stream->priv->current_encoding->data;
> 
> maybe I am missing the context, but I do not understand how this hunk relates
> to the rest
> 
> @@ +582,2 @@
>          {
> +            GtkTextIter start;
> 
> We also must tag the invalid text when reaching the end of the text, if the
> last characters were invalid
Right, I'll fix it asap.
Comment 9 Ignacio Casal Quinteiro (nacho) 2010-12-14 21:08:55 UTC
(In reply to comment #7)
> So, saving the file leaves the invalid bytes unaffected, right?
Nop, We still have to deal with it, not sure how though. The problem is that if we escape them, then we have to unescape them also when we are saving. Would be easier if we could just insert the invalid chars in GtkTextBuffer and this one would just do the escaping.
Comment 10 Ignacio Casal Quinteiro (nacho) 2010-12-14 21:11:49 UTC
Oh! about this:

::: gedit/gedit-document-output-stream.c
@@ +251,3 @@
+    stream->priv->current_encoding = stream->priv->encodings;
+
+    return (const GeditEncoding *)stream->priv->current_encoding->data;

maybe I am missing the context, but I do not understand how this hunk relates
to the rest

We need this to return the first encoding of the list when we are guessing. So if all the guessed encodings failed, we return the first one of the list, and we do the escaping with it.
Comment 11 Behdad Esfahbod 2010-12-14 21:12:46 UTC
(In reply to comment #9)
> (In reply to comment #7)
> > So, saving the file leaves the invalid bytes unaffected, right?
> Nop, We still have to deal with it, not sure how though. The problem is that if
> we escape them, then we have to unescape them also when we are saving. Would be
> easier if we could just insert the invalid chars in GtkTextBuffer and this one
> would just do the escaping.

Yes, lets add support to GtkTextBuffer/View directly.  So you just insert it, and it displays it appropriately.
Comment 12 Ignacio Casal Quinteiro (nacho) 2010-12-14 21:14:01 UTC
So does pango properly supports either insert and get invalid chars?
Comment 13 Paolo Borelli 2010-12-14 21:17:16 UTC
(In reply to comment #7)
> So, saving the file leaves the invalid bytes unaffected, right?

The UI I had in mind shows you the escaped chars (highlighted as errors) and shows you an info bar telling you that "encoding detection failed blah blah  [edit anyaway]". Unless you do it explicitely, the file is readonly so you will not corrupt the original.


Fixing textview to also accept non utf8 text is a huge task.
Comment 14 Behdad Esfahbod 2010-12-14 21:19:24 UTC
(In reply to comment #12)
> So does pango properly supports either insert and get invalid chars?

Pango just draws a box, and getting the text back gives (guchar)-1.  So, drawing a proper hexbox with a red backeground, etc, and saving it intact, needs to be done in Gtk.
Comment 15 Ignacio Casal Quinteiro (nacho) 2010-12-14 21:22:35 UTC
(In reply to comment #14)
> (In reply to comment #12)
> > So does pango properly supports either insert and get invalid chars?
> 
> Pango just draws a box, and getting the text back gives (guchar)-1.  So,
> drawing a proper hexbox with a red backeground, etc, and saving it intact,
> needs to be done in Gtk.

Not sure how we can do this also without breaking the api, see that we need to know from gtk_text_buffer_insert, if we inserted something wrong in the buffer to show the doom message.
Comment 16 Behdad Esfahbod 2010-12-14 21:32:56 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #12)
> > > So does pango properly supports either insert and get invalid chars?
> > 
> > Pango just draws a box, and getting the text back gives (guchar)-1.  So,
> > drawing a proper hexbox with a red backeground, etc, and saving it intact,
> > needs to be done in Gtk.
> 
> Not sure how we can do this also without breaking the api, see that we need to
> know from gtk_text_buffer_insert, if we inserted something wrong in the buffer
> to show the doom message.

Much like pango_layout_get_unknown_glyphs_count() does for PangoLayout, you need a GtkTextBuffer equivalent.
Comment 17 Ignacio Casal Quinteiro (nacho) 2010-12-14 21:39:16 UTC
Makes sense, it is something similar to what we have right now in the output-stream. Now my concern as pbor says looks like a huge task, and my knowledge of the textview internals is quite few. I can try though, but I fear I'll not get it on time for 3.0 and then the patch will be there till 3.2 or 4.0...
Comment 18 Behdad Esfahbod 2010-12-14 21:41:18 UTC
Lets file a bug against gtk+ and proceed with fixing it in gedit for now.
Comment 19 Ignacio Casal Quinteiro (nacho) 2010-12-14 21:44:50 UTC
How do you suggest to fix gedit? Using my patch or some workaround to access the pangolayout in the textbuffer?
Comment 20 Behdad Esfahbod 2010-12-14 23:04:27 UTC
(In reply to comment #19)
> How do you suggest to fix gedit? Using my patch or some workaround to access
> the pangolayout in the textbuffer?

Fix it using your patch but with pbor's suggestion of opening readonly and warning the user that saving the text will corrupt the original.modifying
Comment 21 Behdad Esfahbod 2010-12-14 23:06:55 UTC
Filed bug 637270 for gtk enhancement.
Comment 22 Ignacio Casal Quinteiro (nacho) 2010-12-15 13:58:01 UTC
Created attachment 176460 [details] [review]
invalid char v3

This fixes the issues pointed and makes the document readonly, see that I had to add a hack as in each focus of the tab we check if it was externally modified.
Comment 23 Paolo Borelli 2010-12-15 14:05:27 UTC
Review of attachment 176460 [details] [review]:

I think we should even be more careful: when there are still error tags (which maybe should have a more specific name), we should make save insenitive
Comment 24 Ignacio Casal Quinteiro (nacho) 2010-12-15 14:07:19 UTC
Afaik the save is insensitive while it is readonly. Do you think that it is really neccessary if the a red tag exists?
Comment 25 Paolo Borelli 2010-12-15 14:09:37 UTC
it depends on the message we put in the message area, but my idea was that the user should not be allowed to save the escaped text: to save he must edit the document to replace the escaped characters
Comment 26 Ignacio Casal Quinteiro (nacho) 2010-12-15 14:11:27 UTC
How do you recommend to track the tagged chars?
Comment 27 Paolo Borelli 2010-12-15 14:17:26 UTC
we can use gtk_text_iter_forward_to_tag_toggle to see of there are no tags of that kind
Comment 28 Ignacio Casal Quinteiro (nacho) 2010-12-15 14:18:55 UTC
but how do we track it? using the insert-text and delete... signals?
Comment 29 Ignacio Casal Quinteiro (nacho) 2010-12-15 14:27:07 UTC
For the tag name, maybe it is better: "invalid-char-style" ? Less common and more representative
Comment 30 Ignacio Casal Quinteiro (nacho) 2011-01-27 15:37:35 UTC
Created attachment 179439 [details] [review]
Add invalid char support.
Comment 31 Ignacio Casal Quinteiro (nacho) 2011-01-29 17:43:28 UTC
Created attachment 179605 [details] [review]
Add invalid char support.
Comment 32 Ignacio Casal Quinteiro (nacho) 2011-01-29 17:57:14 UTC
Comment on attachment 179605 [details] [review]
Add invalid char support.

Pushed with some modifications, leaving the bug opened in case some day we decide to convert the escaped chars to invalid chars when saving.
Comment 33 Ignacio Casal Quinteiro (nacho) 2011-01-31 21:52:02 UTC
*** Bug 574535 has been marked as a duplicate of this bug. ***
Comment 34 Ignacio Casal Quinteiro (nacho) 2011-10-01 14:06:25 UTC
Closing as this was fixed already. I cloned this bug to track the unescape when saving.