GNOME Bugzilla – Bug 662537
Ext4 unicode labels not shown correctly
Last modified: 2011-12-13 17:36:52 UTC
If the label of an ext4 partition contains any unicode characters, it is not displayed correctly in gprated. For example, the label "あかさ" is displayed as: M-cM-^AM-^BM-cM-^AM-^KM-cM-^AM Also, the text entry for setting a partition label doesn't let me type unicode through ibus. If I write it somewhere else, then copy paste it in and set the label, it is set correctly, albeit being displayed as garbage (example above).
Would you be able to provide the output from the following two commands (with root privilege)? e2label /path-to-your-labeled-partition Where /path-to-your-labeled-partition is something like /dev/sda6 blkid /path-to-your-labeled-partition Where /path-to-your-labeled-partition is something like /dev/sda6
e2label: あかさ blkid: /dev/sda6: LABEL="M-cM-^AM-^BM-cM-^AM-^KM-cM-^AM-^U" UUID="2550ae15-4193-4d66-9a83-bf4f17960294" TYPE="ext4" (Does this mean libparted is not at fault?)
Created attachment 199941 [details] [review] Diff file with update to change volume label tool read order Thanks for the output. The problem is not with libparted. It appears that the blkid command is not displaying the volume label in the way we would expect. I have attached a diff file that reverses the order of which tool to use to first try to read the volume label. To apply the diff to a git repository, use: git apply bug662537-unicode-labels-incorrect.diff If you are unfamiliar with git, you can learn the basics at: http://gparted.org/git.php Would you be able to test this on your computer?
Thank you for your work! I recently learned how to use git so this wasn't a problem for me. With your patch, the label is now (almost) displayed correctly. Another (minor) weirdness that appeared now is that the line of /dev/sda6 is wider than the others. "あかさ" would be "E3 81 82 E3 81 8B E3 81 95" in hex, but when I had e2label write its output to a file and opened that with a hex editor, what appeared was "E3 81 82 E3 81 8B E3 81 95 0A". So, it seems that the "0A" at the end which doesn't belong to the characters is wrongly interpreted as an ascii "new line" I think? It's just weird, because when I set the label anew to "あかさ", the "0A" isn't there yet because the partitions line is as wide as the others. But as soon as I apply the changes (and the label is actually set), the "0A" appears.
Created attachment 199953 [details] [review] Diff file with update to change volume label tool read order attempt 2 Thank you for testing. :-) Attached is a new diff file that should address the extra 0x0A at the end of the label. You will need to undo the previous patch before applying this patch. You can get rid of all changes to git with the following command: git reset --hard Please let me know if this fixes the problem, and also if you experience any other problems with this diff file.
I did a "git checkout HEAD~1", it should work the same right? Anyway, sadly nothing changed.. the label shows fine but the "new line" is still there.. I double checked the source and the new regexps are all in place =/
Created attachment 199961 [details] [review] Diff file with update to change volume label tool read order attempt 3 I think that "git checkout HEAD~1" goes back to the parent of HEAD and is not the same as "get reset --hard" which will remove all changes and revert back to the head. Attached is a third attempt, though I am at a loss as to why the second diff file did not address the problem. Please let me know if this fixes the problem, and also if you experience any other problems with this diff file.
(In reply to comment #4) > Another (minor) weirdness that appeared now is that the line of /dev/sda6 is > wider than the others. > "あかさ" would be "E3 81 82 E3 81 8B E3 81 95" in hex, > but when I had e2label write its output to a file and opened that with a hex > editor, > what appeared was "E3 81 82 E3 81 8B E3 81 95 0A". The newline from the e2label command is appended to the output of the label, but is not part of the label. Otherwise the output would be something like: user@ubuntu:~$ sudo e2label /dev/sda2 myvolumelabeluser@ubuntu:~$ Instead of what does currently occur: user@ubuntu:~$ sudo e2label /dev/sda2 myvolumelabel user@ubuntu:~$ Are you seeing the newline in the GParted display of the label?
Created attachment 199966 [details] screenshot Sadly, the problem presists =/ I have cut together 2 screenshots to somehow demonstrate how it looks
Created attachment 199986 [details] [review] Diff file with update to change volume label tool read order attempt 4 Thank you for the screen shot. This certainly displays the problem quite clearly. With this next diff file I have changed the execute_command to not use the "C" locale since this might be part of the problem. Please try this new patch to see if this addresses the problem.
I'm sorry, this didn't fix it either. I experimented around a little bit, and the regexp_label function sure does behave a little bit weirdly. For example, I changed the expression to "^([^\n])". Normally, this should match the first character when it isn't a LF, right? Well, with non-unicode labels, it does exactly that, so "Fedora15" becomes "F", but with unicode characters, it treats three of them as one character! "あかさたなはま" becomes "あかさ". Also, I tried this with and without C-locale enabled, and the results are the same. Maybe the regexp function wasn't even written with Unicode/UTF-8 in mind?
Great investigative work Jonas! It would appear that we definitely have a problem with unicode characters and the current regexp method. After looking at the screen shot in comment #10, I noticed that the upper label takes up less space and perhaps does not have the extra newline. Did you set this label from the command line? Thinking of methods that do not properly support unicode, I believe the Utils::trim method is one of these. I mention this because Utils::trim is invoked on the user typed in label before it is assigned. Perhaps this method might be the cause of the extra newline. I am investigating how to make the regexp and trim methods into unicode aware methods. In the meantime would you be able to test removing the Utils::trim method from the Dialog_Partition_Label::get_new_label() method in src/Dialog_Partition_Label.cc?
Created attachment 200072 [details] screenshot Thanks for still sticking with me (= About the previous screenshot: First, I normally set the label to the unicode string (not applied yet), so it would show up as a preview, which is visibly not flawed. I also opened the "set label" dialog again to verify there are no strange characters there (yet), and made this into the first part of the screenshot. Then I applied the changes, and after GParted rescanned the devices, LF at the end shows up, visible (1) through the wider row you mentioned (that's what I previously meant with "wider line", sorry), and (2) through the "set label" dialog I opened up again, where you can actually "see" the LF through the placeholder character, and that's what in the second part of the shot. About the trim function, I'm not completely sure, but I think it actually works fine with unicode. Why? Because, when I open up the "set label" dialog and actually input the label + a LF at the end, when I click "OK", the LF is already removed from the label in the preview. Also, the LF always first appears after applying, and is clearly not a part of the label as you yourself said (it'd mess up my thunar view else), so the error clearly only happens when it is read out, which basically makes this bug just a cosmetic annoyance more than a real problem.. But it is still pretty annoying to not have a program work 100% the way you intend it to, right? So I quickly taught myself some basic C++, and added this dirty hack to ext4.cc: if ( partition.label[ partition.label.length()-1 ] == 0x0A ) { size_t sz = partition.label.length(); partition.label.resize( sz-1 ); } I guess I'll happily be using this until the real cause of all this is investigated.. BUT! While testing this with every possible combination, I found yet another weirdness. It only happens in special cases: The end of the label gets replaced by some weird placeholder which I've never seen before (described further below), and the console actually keeps spewing out: (gpartedbin:28813): Pango-WARNING **: Invalid UTF-8 string passed to pango_layout_set_text() After some experimenting, I have (somehow) determined the pattern of it: Every sequence of roughly 3 ascii characters is treated as one unicode character; if the amount of unicode characters exceeds the number of 5, the rest of the label will be, after applying, cut off and replaced with a placeholder character which indicates some sort of error (see new screenshot). If however, the 5th (or 6th) character is a sequence of ascii characters, no pango errors occur, but this last sequence is still halfway cut off in a (to me still) random fashion. Also, this error works "best" if you use three byte unicode symbols (e.g. '€'), with two bytes like the '°' the pattern is a little bit more random. I switched back to the master branch to make sure my changes didn't cause any of this, and there it doesn't happen because the label doesn't get read correctly in the first place. So I then applied your very first patch which only made the unicode visible, and here, this problem already happens. This time the effects reach outside of GParted, as the label is also displayed in the cut off pattern as described above everywhere else (filemanager etc.), although I think the pango errors only happen withing GParted. Check the new screenshot for this because it is really hard to explain.. PS. I checked this without the Util::trim function in the new label dialog, and it didn't seem to affect this particular problem.
Oh, and by the way, I never set the label from the command line as I do everything through GParted
Created attachment 200130 [details] [review] Diff file with update to change volume label tool read order attempt 5 Based upon your research it would appear that the Utils::trim method works properly. As such I have created another patch that wraps the Utils::trim method around the Utils::regexp_label method. This is likely a temporary solution until I can figure out how to get regular expressions working correctly for unicode characters. Currently I am looking at the following glibmm library: http://developer.gnome.org/glibmm/unstable/classGlib_1_1Regex.html Would you be able to test this new patch?
Created attachment 200131 [details] [review] Diff file with update to change volume label tool read order attempt 6 Silly me. ;-) In this code I don't even need the regular expression. Hence this new diff file replaces the Utils::regexp_label with Utils::trim method. Please test and let me know how this new diff file works.
Thanks! It seems like trim does the job well, no more LF at the end (= The invalid ustrings however have me worrying a lot; right now I am trying everything to get C++ to print out the bytes of the (malformed?) strings, so I can compare them and locate what kind of manipulation/damage has been done to them in the process. What I tried is to insert something along the lines of Glib::ustring::size_type sz = partition_temp.label.size(); std::cout << "**************\nRead partition: " << partition_temp.label << "\nValid? " << partition_temp.label.validate() << "\nLabel length: " << sz << "\nIn hex:"; Glib::ustring::size_type i; for (i=0;i<sz;i++) {std::cout << " 0x" << std::hex << partition_temp.label.at(i) << std::dec;} But for unicode strings, it just keeps saying their length is 0! Clearly the regexp_label function doesn't use the Glib regexp, does it? Because I couldn't find the sruct "regex_t" anywhere in the documentation of said Glib class, I'm assuming so. But isn't this flawing the whole program to begin with? I mean, Glib::ustring is being used everywhere, but I don't think there are any classes outside Glibmm that can deal properly with the multibyte ustrings in Glib::ustring. (or maybe I have no idea what I'm talking about? sorry, I have had absolutely zero experience with C++/Glibmm til this bug report..)
Created attachment 200222 [details] [review] Diff file with update to change volume label tool read order attempt 7 (In reply to comment #17) > Thanks! It seems like trim does the job well, > no more LF at the end (= Glad to hear it. :) Does this also work for unicode strings that are not just 3 characters long? For example, 5 characters? > Clearly the regexp_label function doesn't use the Glib regexp, does it? Yes, you are correct. The existing code did not use the Glib::Regex class. The attached diff file has a replacement that does use Glib::Regex, and also includes some fixes for regexp_label calls in ntfs and btrfs file systems too. The Glib::Regex class is supposed to handle unicode characters. Would you be able to test this new diff file and provide your comments?
Hey, good news! I finally found the source of all the malformed strings. (I wish I had known that GParted is just an interface to external progs for almost everything earlier..) Anyway, so when a label is to be written, GParted calls e2label, right? Here's an excerpt of e2label.c: #define VOLNAMSZ 16 [...] strncpy(sb.s_volume_name, label, VOLNAMSZ); if (strlen(label) > VOLNAMSZ) fprintf(stderr, _("Warning: label too long, truncating.\n")); So labels have a max size of 16 bytes, eh? Didn't know that :P This would normally work for ascii (and UTF-8 containing it), but with multibyte chars the possibility is there that they get chopped in half, leaving garbage bytes. As I see it the partition label dialog is fs agnostic, so the option to pop up a MsgDialog along the lines of "Label too long!" or "Label was too long and has been trimmed" would be rather cumbersome I guess. I added this to ext4.cc to at least avoid bad strings: bool ext4::write_label([...]) { // trim label to 16 bytes Glib::ustring temp_label(partition.label); while ( temp_label.bytes() > 16) { temp_label.resize( temp_label.length()-1 ); } [then use temp_label instead of partition.label] } which seems to work ok. Anyway, there's the remaining problem that users aren't notified of this trim and could be confused to after the write. Maybe add a message to the error log shown after each operation? It's up to you GParted devs I guess (; Oh, and good job on the regex change, I think that was a good thing to do nonetheless ^^ Btw: Do you know where I should ask about ibus not working in the label dialog? I tested the entry widget with both 2.4 and 3.0, and they both worked
Thanks again Jonas for all your investigative work and testing. :-) You've been a great help with addressing the problem(s) in this bug report. (In reply to comment #19) > (I wish I had known that GParted is just an interface to external progs for > almost everything earlier..) Yes, as you discovered GParted is graphical front end. GParted uses the libparted library to edit partition tables, and other free software file system tools to manipulate the file systems. > Anyway, so when a label is to be written, GParted calls e2label, right? Yes, for ext2, ext3, and ext4 file systems. > As I see it the partition label dialog is fs agnostic, > so the option to pop up a MsgDialog along the lines of "Label too long!" > or "Label was too long and has been trimmed" would be rather cumbersome I > guess. Yes, it could be cumbersome. I'll have to think on this a while to see if I can come up with an elegant solution. > I added this to ext4.cc to at least avoid bad strings: > > bool ext4::write_label([...]) > { > // trim label to 16 bytes > Glib::ustring temp_label(partition.label); > while ( temp_label.bytes() > 16) > { temp_label.resize( temp_label.length()-1 ); } > [then use temp_label instead of partition.label] > } > > which seems to work ok. > Anyway, there's the remaining problem that users aren't notified of this trim > and could be confused to after the write. > Maybe add a message to the error log shown after each operation? > It's up to you GParted devs I guess (; Good suggestion. GParted contains similar code for FAT16 and FAT32 file systems as can be seen in the following method: http://git.gnome.org/browse/gparted/tree/src/Utils.cc#n349 I would prefer a solution that kept the file system logic contained within the specific file system source code files. The current solution does not expand well to new file systems. ;-( > > Oh, and good job on the regex change, I think that was a good thing to do > nonetheless ^^ Agreed. :-) > Btw: Do you know where I should ask about ibus not working in the label dialog? > I tested the entry widget with both 2.4 and 3.0, and they both worked That's a good question. I knew nothing about ibus until you mentioned it. I did a quick search for "linux ibus" and this is the first result: http://code.google.com/p/ibus/ Is this the ibus code you are referring too?
(In reply to comment #20) > Thanks again Jonas for all your investigative work and testing. :-) Well thank you for sticking through with me ^^ I just wanted to fix this asap as it directly affects me (I use lots of unicode labels) > > Anyway, so when a label is to be written, GParted calls e2label, right? > > Yes, for ext2, ext3, and ext4 file systems. Yeah, that's what I meant :P Sorry, I usually mean "ext* label" when I say "label" > GParted contains similar code for FAT16 and FAT32 file systems as can be seen > in the following method: > http://git.gnome.org/browse/gparted/tree/src/Utils.cc#n349 > > I would prefer a solution that kept the file system logic contained within the > specific file system source code files. The current solution does not expand > well to new file systems. ;-( If I understand you correctly, then there would be a necessity for an abstract "label_maxsize" or something value in Filesystem (which I think is the parent class of ext*, btrfs etc.?), and then the call to a "check_label_length" function, which in turn could use a "Utils::trim_to_n_bytes", would have to be made mandatory. (As it seems NTFS does have a bytecap of 32 as well, which isn't reflected in the code I think) This should make the integration with the 'set label' dialog pretty easy, if you choose to do so. > Is this the ibus code you are referring too? Yep. I think I'll just start at Fedora Forums and ask if anybody experiences the same problem, then most likely I'll be led on the right path.
While thinking about this problem, I thought it would also be worth following this up from the perspective of the util-linux blkid command. Some Internet searching brought me to the following email thread: Re: Problem with blkid http://www.spinics.net/lists/util-linux-ng/msg04242.html From reading this email thread it appears that newer versions of blkid will have a "-d" option to indicate that the label output should not be encoded. According to the util-linux ChangeLog, this new "-d" option should be available with util-linux version 2.20. util-linux v2.20 http://lwn.net/Articles/456688/
An enhancement to permit GParted to properly display ext2/3/4 unicode volume labels has been committed to the Gnome git repository for inclusion in the next release of GParted. The relevant git commit can be viewed at the following link: http://git.gnome.org/browse/gparted/commit/?id=cf8293cf78efe2e2e8b9b74279083c1f986d236e
>Some Internet searching brought me to the following email thread Interesting. So the label has never been read incorrectly to begin with, but displayed in such a way as to not break the shell when unprintable chars appear (like "echo あかさ | cat -v" does). You never stop learning.. Now that you filed the patch, is it appropriate to mark this bug as "RESOLVED"?
Yes, we could mark this bug as "RESOLVED". Lately I prefer to keep the bug report open until a new GParted release is created that contains the associated changes. That way if someone runs into the same problem they are less likely to create another new bug report. If you wish to close the report now, then I am okay with that. Otherwise I plan to close it when the next version of GParted is released. :-)
(In reply to comment #25) > If you wish to close the report now, then I am okay with that. > > Otherwise I plan to close it when the next version of GParted is released. :-) Sounds like the best solution. I wouldn't mess with the bug status myself anyway as I'm not a developer :P
The enhancement to address this bug report has been included in GParted 0.11.0 released on Dec. 13, 2011.