GNOME Bugzilla – Bug 502947
Doesn't open invalid UTF-8 text
Last modified: 2011-10-01 14:06:25 UTC
So if a file has a single invalid byte I can't edit it or even open to fix it. I've made pango render a box with a cross inside for every invalid input byte. As long as you don't reuse what pango_layout_get_text() returns, it should be safe to pass invalid text to pango_layout_set_text().
*** Bug 590966 has been marked as a duplicate of this bug. ***
*** Bug 632377 has been marked as a duplicate of this bug. ***
Created attachment 176404 [details] [review] invalid char v1 Here is the first part. Next step will be to just apply the tag per chunk.
Created attachment 176421 [details] [review] error in chunks v1 Here it is to apply error tags by chunks. Next no so important step is to manage the conversion errors.
Created attachment 176425 [details] [review] invalid char v2 Squashed both patches. For easier review
Review of attachment 176425 [details] [review]: Patch looks good, except for some annotation below. As you say, it is missing the part where it actually notices that something went wrong and notifies the user about it. ::: gedit/gedit-document-loader.c @@ +507,3 @@ + "needed to use a fallback char"); + } + I am probably missing context, but can you explain why you are moving this hunk? ::: gedit/gedit-document-output-stream.c @@ +251,3 @@ + stream->priv->current_encoding = stream->priv->encodings; + + return (const GeditEncoding *)stream->priv->current_encoding->data; maybe I am missing the context, but I do not understand how this hunk relates to the rest @@ +582,2 @@ { + GtkTextIter start; We also must tag the invalid text when reaching the end of the text, if the last characters were invalid
So, saving the file leaves the invalid bytes unaffected, right?
(In reply to comment #6) > Review of attachment 176425 [details] [review]: > > Patch looks good, except for some annotation below. > > As you say, it is missing the part where it actually notices that something > went wrong and notifies the user about it. That part is already added. It was added in the previous cycle when we did all the stream thingie. > > ::: gedit/gedit-document-loader.c > @@ +507,3 @@ > + "needed to use a fallback char"); > + } > + > > I am probably missing context, but can you explain why you are moving this > hunk? We need to check this after we close the stream, so we are sure that the stream was flushed. > > ::: gedit/gedit-document-output-stream.c > @@ +251,3 @@ > + stream->priv->current_encoding = stream->priv->encodings; > + > + return (const GeditEncoding *)stream->priv->current_encoding->data; > > maybe I am missing the context, but I do not understand how this hunk relates > to the rest > > @@ +582,2 @@ > { > + GtkTextIter start; > > We also must tag the invalid text when reaching the end of the text, if the > last characters were invalid Right, I'll fix it asap.
(In reply to comment #7) > So, saving the file leaves the invalid bytes unaffected, right? Nop, We still have to deal with it, not sure how though. The problem is that if we escape them, then we have to unescape them also when we are saving. Would be easier if we could just insert the invalid chars in GtkTextBuffer and this one would just do the escaping.
Oh! about this: ::: gedit/gedit-document-output-stream.c @@ +251,3 @@ + stream->priv->current_encoding = stream->priv->encodings; + + return (const GeditEncoding *)stream->priv->current_encoding->data; maybe I am missing the context, but I do not understand how this hunk relates to the rest We need this to return the first encoding of the list when we are guessing. So if all the guessed encodings failed, we return the first one of the list, and we do the escaping with it.
(In reply to comment #9) > (In reply to comment #7) > > So, saving the file leaves the invalid bytes unaffected, right? > Nop, We still have to deal with it, not sure how though. The problem is that if > we escape them, then we have to unescape them also when we are saving. Would be > easier if we could just insert the invalid chars in GtkTextBuffer and this one > would just do the escaping. Yes, lets add support to GtkTextBuffer/View directly. So you just insert it, and it displays it appropriately.
So does pango properly supports either insert and get invalid chars?
(In reply to comment #7) > So, saving the file leaves the invalid bytes unaffected, right? The UI I had in mind shows you the escaped chars (highlighted as errors) and shows you an info bar telling you that "encoding detection failed blah blah [edit anyaway]". Unless you do it explicitely, the file is readonly so you will not corrupt the original. Fixing textview to also accept non utf8 text is a huge task.
(In reply to comment #12) > So does pango properly supports either insert and get invalid chars? Pango just draws a box, and getting the text back gives (guchar)-1. So, drawing a proper hexbox with a red backeground, etc, and saving it intact, needs to be done in Gtk.
(In reply to comment #14) > (In reply to comment #12) > > So does pango properly supports either insert and get invalid chars? > > Pango just draws a box, and getting the text back gives (guchar)-1. So, > drawing a proper hexbox with a red backeground, etc, and saving it intact, > needs to be done in Gtk. Not sure how we can do this also without breaking the api, see that we need to know from gtk_text_buffer_insert, if we inserted something wrong in the buffer to show the doom message.
(In reply to comment #15) > (In reply to comment #14) > > (In reply to comment #12) > > > So does pango properly supports either insert and get invalid chars? > > > > Pango just draws a box, and getting the text back gives (guchar)-1. So, > > drawing a proper hexbox with a red backeground, etc, and saving it intact, > > needs to be done in Gtk. > > Not sure how we can do this also without breaking the api, see that we need to > know from gtk_text_buffer_insert, if we inserted something wrong in the buffer > to show the doom message. Much like pango_layout_get_unknown_glyphs_count() does for PangoLayout, you need a GtkTextBuffer equivalent.
Makes sense, it is something similar to what we have right now in the output-stream. Now my concern as pbor says looks like a huge task, and my knowledge of the textview internals is quite few. I can try though, but I fear I'll not get it on time for 3.0 and then the patch will be there till 3.2 or 4.0...
Lets file a bug against gtk+ and proceed with fixing it in gedit for now.
How do you suggest to fix gedit? Using my patch or some workaround to access the pangolayout in the textbuffer?
(In reply to comment #19) > How do you suggest to fix gedit? Using my patch or some workaround to access > the pangolayout in the textbuffer? Fix it using your patch but with pbor's suggestion of opening readonly and warning the user that saving the text will corrupt the original.modifying
Filed bug 637270 for gtk enhancement.
Created attachment 176460 [details] [review] invalid char v3 This fixes the issues pointed and makes the document readonly, see that I had to add a hack as in each focus of the tab we check if it was externally modified.
Review of attachment 176460 [details] [review]: I think we should even be more careful: when there are still error tags (which maybe should have a more specific name), we should make save insenitive
Afaik the save is insensitive while it is readonly. Do you think that it is really neccessary if the a red tag exists?
it depends on the message we put in the message area, but my idea was that the user should not be allowed to save the escaped text: to save he must edit the document to replace the escaped characters
How do you recommend to track the tagged chars?
we can use gtk_text_iter_forward_to_tag_toggle to see of there are no tags of that kind
but how do we track it? using the insert-text and delete... signals?
For the tag name, maybe it is better: "invalid-char-style" ? Less common and more representative
Created attachment 179439 [details] [review] Add invalid char support.
Created attachment 179605 [details] [review] Add invalid char support.
Comment on attachment 179605 [details] [review] Add invalid char support. Pushed with some modifications, leaving the bug opened in case some day we decide to convert the escaped chars to invalid chars when saving.
*** Bug 574535 has been marked as a duplicate of this bug. ***
Closing as this was fixed already. I cloned this bug to track the unescape when saving.