GNOME Bugzilla – Bug 306403
filenames in non-UTF-8 encodings are not handled correctly
Last modified: 2020-11-11 19:15:09 UTC
Please describe the problem:
The ZIP format (http://www.info-zip.org/pub/infozip/doc/) does not specify the
encoding of the filenames of the compressed files.
Therefore, ZIP files created on old systems may contain filenames in an
non-latin 8-bit encoding (for example, Cyrillic, Greek, etc).
Fileroller has trouble dealing with these files as it cannot "autoconvert" the
filename from the source encoding to UTF-8.
An informal survey has been carried out on this:
a. WinZIP (Windows) manages to autodetect/convert
Steps to reproduce:
(it contains a single .doc file; the name is in Greek, in CP737 encoding (iconv
2. Open with file-roller.
3. Observe the filename - try to extract file.
An incorrect filename appears. The file cannot be extracted, neither renamed
from within the ZIP archive.
File-roller should attempt to detect the encoding of the filename and do the
appropriate conversion (iconv-style) to UTF-8. If it cannot do a conversion, it
should nevertheless make the file accessible. For example, unconverted
characters could be changed to 0xFFFD (Unicode Replacement Character,
Does this happen every time?
Have a look at the thread at http://mail.nl.linux.org/linux-utf8/2005-06/#00000
and specifically walk through the SUMMARY mail.
*** Bug 152236 has been marked as a duplicate of this bug. ***
Summary URL: http://mail.nl.linux.org/linux-utf8/2005-06/#00010
The encoding detection of modern browsers works quite good for me, e.g.
Filename encodings can be changed with convmv:
*** Bug 320467 has been marked as a duplicate of this bug. ***
*** Bug 333225 has been marked as a duplicate of this bug. ***
Confirmed with File Roller 2.14.0 on Ubuntu Dapper.
I have same problem too, with cp949 encoding files. Although annoying way, zip command help me correct this.
$ zip -FF cp949.zip (-F or -FF)
$ unzip cp949.zip
...... <- filename is broken.
$ convmv -f cp949 -t utf8 cp949.zip
..... <- If convering name show correctly, trying real change name.
$ convmv -f cp949 -t utf8 cp949.zip --notest
Without 'zip -FF' process, convmv complain bad encoding.
My system profile,
ubuntu@ubuntu:~$ zip -help | grep '\-F'
-F fix zipfile (-FF try harder) -D do not add directory entries
ubuntu@ubuntu:~$ unzip -v
UnZip 5.52 of 28 February 2005, by Ubuntu. Original by Info-ZIP.
Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ;
see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites.
Compiled with gcc 4.0.3 (Ubuntu 4.0.3-1ubuntu3) for Unix (Linux ELF) on Mar 23 2006.
UnZip special compilation options:
COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported)
USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported)
USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported)
[decryption, version 2.9 of 05 May 2000]
UnZip and ZipInfo environment options:
ubuntu@ubuntu:~$ zip -v
Copyright (C) 1990-2005 Info-ZIP
Type 'zip "-L"' for software license.
This is Zip 2.31 (March 8th 2005), by Info-ZIP.
Currently maintained by Onno van der Linden. Please send bug reports to
the authors using http://www.info-zip.org/zip-bug.html; see README for details.
Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip, as of
above date; see http://www.info-zip.org for other sites.
Compiled with gcc 4.0.1 20050522 (prerelease) (Debian 4.0.0-7ubuntu7) for Unix (Linux ELF) on May 26 2005.
Zip special compilation options:
[encryption, version 2.9 of 05 May 2000]
The encryption code of this program is not copyrighted and is
put in the public domain. It was originally written in Europe
and, to the best of our knowledge, can be freely distributed
in both source and object forms from any country, including
the USA under License Exception TSU of the U.S. Export
Administration Regulations (section 740.13(e)) of 6 June 2002.
Zip environment options:
Now, this appears to be a generic problem that manifests itself in several places.
I feel it is desirable to have some sort of generic library that takes into account the system locale settings and automagically determines the most appropriate source encoding before it converts to UTF-8.
Apart from the ZIP file format that does not specify the encoding of the filenames, IDv1/IDv2 tags in MP3 files do not specify the encoding either. Therefore, even services such as www.mugshot.org (shares the current playing song details) stumble on the issue.
The "algorithm" would be something like:
1. Try to check if the string is UTF-8 encoded. If yes, use as is, else Step 2.
2. Case of system locale,
el_GR: try iconv -f iso-8859-7 -t utf-8. If succeeds, accept.
try iconv -f cp737 -t utf-8. If success, accept
I do not know if the association between locale value and legacy encodings is available in a comprehensive list. If so, it would be trivial to complete without individual input from each locale users.
Another bug report that deals with the same issue,
*** Bug 346018 has been marked as a duplicate of this bug. ***
Adding links on this
a. Dmitry Butskoy volunteered to write a patch for "unzip.
b. Ubuntu blueprint that captures the current info of this issue,
There are another report from Ubuntu 7.10
Encoding autodetection, as proposed by Simos, should be implemented and it'll cover many cases. But it's not enough, there'll always be cases where it will fail, because there is much overlapping in the #128 - #255 area in many code pages.
So, a "manually select filename encoding" option is necessary, and as such, maybe it should the first one to be implemented, even as just a command line option with no GUI equivalent.
E.g. from unzip --help:
-O CHARSET specify a character encoding for DOS, Windows and OS/2 archives
-I CHARSET specify a character encoding for UNIX and other archives
As a sidenote,
* As of September 2007, the zip file format supports utf-8 encoding:
* After that, many zip products support utf-8, like winzip, 7zip etc.
Please support that too, it would solve much of our encoding problems, at least for newly created .zips.
Simos, thanks for the blueprint, at least we now have a workaround.
export UNZIP="-O cp737"
export ZIPINFO="-O cp737"
(to be used in system configuration files).
*** Bug 547312 has been marked as a duplicate of this bug. ***
Created attachment 117397 [details] [review]
Patch for src/fr-command.c
Regarding to bug 547312, I think previous file-roller doesn't have this bug.
Previous file-roller can show the different filenames with the garbaged chars.
The attached patch can fix bug 547312 at the moment.
(In reply to comment #16)
> Created an attachment (id=117397) 
> Patch for src/fr-command.c
> Regarding to bug 547312, I think previous file-roller doesn't have this bug.
> Previous file-roller can show the different filenames with the garbaged chars.
> The attached patch can fix bug 547312 at the moment.
actually, bug 547312 is supposed to be already fixed in version 2.23.6
> actually, bug 547312 is supposed to be already fixed in version 2.23.6
Thanks for your reply.
I confirmed "LC_ALL=C" is changed to "LC_MESSAGES=C" in 2.23.6.
There may be another complicating issue -- the convention is to assume the filenames are in CP850 format and convert them automatically to ISO-8859-1 format on unzip, i.e. it is not just that there is a lack of conversion:
AFAIK in ubuntu file-roller handles .rar files too. These files have the same issue with encoding. If I right-click the file in nautilus and choose "Extract here" from contect menu, files are extracted with wrong names. However if I invoke 'unrar x file.rar' from command line, files are extracted with correct names.
I understand that description of that bug in tracker applies to .zip files only. However the title is generic and doesn't mention zip. If my rar problem should be reported as a separate bug, please tell me, I'll file it as separate.
Problem with .rar files, does not appear when only unrar is installed to your system. See comment #58 in https://bugs.launchpad.net/ubuntu/+source/file-roller/+bug/177929
Furthermore, Ark is not influenced by the presence or not of rar package.
Looking at the file-roller and ark source,
the problem with unrar command implementation,
lies in /file-roller-22.214.171.124/src/fr-command-rar.cin file
Not a developer, but in the aforementioned file,
changing the order of these two lines as follows,
and compiling solves the problem in file-roller...
if (have_rar ())
fr_process_begin_command (comm->process, "unrar");
fr_process_begin_command (comm->process, "rar");
This is one of the major problems when manipulating ZIP files from the Windows world. By default, winzip and many other software still encode filenames in legacy encodings, leading to incorrect filenames when they are extracted on Linux environments.
We definitely need to address this one way or another. And fixing winzip is unfortunately not an option.
7-Zip for windows has a way of detecting the encoding correctly. It's open-source so could there be a way to see how they do it and implement it?
Could someone please provide a link to a new ZIP file to test this. Thanks.
A link to a ZIP file with Hebrew filenames not displayed correctly in file roller:
A zip archive that file roller fails to unzip is freely downloadable here  (sorry, it's quite big) and the file with accent is:
"/Gibilterra Land/04 Vincenzo Costantino Cinaski - Niente è grande come le piccole cose.mp3"
The name is read by file roller as "04 Vincenzo Costantino Cinaski - Niente e?? grande come le piccole cose.mp3" thus both renaming and extracting fail with the following msg:
"caution: filename not matched: Gibilterra Land/04 Vincenzo Costantino Cinaski \- Niente e\?\? grande come le piccole cose.mp3"
If needed I can try to create a smaller test-case.
* File Roller v. 3.4.1 on Ubuntu 12.04
(In reply to comment #26)
> A zip archive that file roller fails to unzip is freely downloadable here 
> (sorry, it's quite big) and the file with accent is:
> "/Gibilterra Land/04 Vincenzo Costantino Cinaski - Niente è grande come le
> piccole cose.mp3"
> The name is read by file roller as "04 Vincenzo Costantino Cinaski - Niente e??
> grande come le piccole cose.mp3" thus both renaming and extracting fail with
> the following msg:
> "caution: filename not matched: Gibilterra Land/04 Vincenzo Costantino Cinaski
> \- Niente e\?\? grande come le piccole cose.mp3"
I cannot reproduce the problem, try to execute the following command to see if the output is correct:
7z l -slt -bd -y -- /home/paolo/Scrivania/GibilterraLand.zip
if 7z is not installed on your system, try this one instead:
unzip -ZTs -- GibilterraLand.zip
these are the two commands used by file-roller to list the content of a zip archive, the priority is given to 7z, if it is not available unzip is used.
It can be useful to know the command versions as well, for me:
"7z --help" prints "7-Zip  9.20"
"unzip" prints "UnZip 6.00 of 20 April 2009"
I haven't 7z, so I use unzip.
The relevant part is:
-rw-r--r-- 2.1 unx 2786539 bX defN 20120522.144145 Gibilterra Land/04 Vincenzo Costantino Cinaski - Niente e?? grande come le piccole cose.mp3
That's strange since my unzip version matches yours:
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
After installing 7z the problem is different: I can rename and extract the file, but it appears to ben named "04 Vincenzo Costantino Cinaski - Niente eÌ grande come le piccole cose.mp3"
Don't know if matters but my locale is IT_it
(In reply to comment #28)
> I haven't 7z, so I use unzip.
> The relevant part is:
> -rw-r--r-- 2.1 unx 2786539 bX defN 20120522.144145 Gibilterra Land/04
> Vincenzo Costantino Cinaski - Niente e?? grande come le piccole cose.mp3
> ..so wrong.
> That's strange since my unzip version matches yours:
> UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
> After installing 7z the problem is different: I can rename and extract the
> file, but it appears to ben named "04 Vincenzo Costantino Cinaski - Niente eÌ
> grande come le piccole cose.mp3"
> Don't know if matters but my locale is IT_it
maybe this is the problem, mine is it_IT.utf8
Sorry: I wrote without checking (I thought it could be a problem with Italian's locale, I didn't notice you were Italian too!), mine it's
I don't know what else I could check for...
I forgot to include 7z's output and version.
Output for the file is:
Path = Gibilterra Land/04 Vincenzo Costantino Cinaski - Niente eÌ grande come
le piccole cose.mp3
Folder = -
Size = 2786539
Packed Size = 2738046
Modified = 2012-05-22 14:41:46
Attributes = .....
Encrypted = -
CRC = 3B41EBEE
Method = Deflate
Host OS = Unix
Version = 20
While version is:
7-Zip  9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=it_IT.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)
I attach also the "zipinfo -v" output as reference:
Central directory entry #11:
There are an extra 16 bytes preceding this file.
Gibilterra Land/04 Vincenzo Costantino Cinaski - Niente e?? grande come le
offset of local header from start of archive: 25514261
file system or operating system of origin: Unix
version of encoding software: 2.1
minimum file system compatibility required: MS-DOS, OS/2 or NT FAT
minimum software version required to extract: 2.0
compression method: deflated
compression sub-type (deflation): normal
file security status: not encrypted
extended local header: yes
file last modified on (DOS date/time): 2012 May 22 14:41:46
file last modified on (UT extra field modtime): 2012 May 22 14:41:45 local
file last modified on (UT extra field modtime): 2012 May 22 12:41:45 UTC
32-bit CRC value (hex): 3b41ebee
compressed size: 2738046 bytes
uncompressed size: 2786539 bytes
length of filename: 91 characters
length of extra field: 12 bytes
length of file comment: 0 characters
disk number on which file begins: disk 1
apparent file type: binary
Unix file attributes (100644 octal): -rw-r--r--
MS-DOS file attributes (00 hex): none
The central-directory extra field contains:
- A subfield with ID 0x5855 (old Info-ZIP Unix/OS2/NT) and 8 data bytes:
11 f1 c5 4f 89 89 bb 4f.
There is no file comment.
Duplicate of bug 581496?
We have Unicode enabled and not Unicode enabled ZIP archives.
We also three kind of tools Info-Zip, p7zip, The Unarchiver.
UnZip in Info-Zip always list non-ASCII character in file names as '?'. It can correctly extract Unicode enabled archives, though.
UnZip is generally included by default installation and File Roller supports it.
p7zip can correctly list non-ASCII character in file names for Unicode enabled archives. It has no luck on not Unicode enabled ones.
File Roller supports 7z and prefer 7z to unzip when 7z is installed. That's why p7zip can be a workaround for some people.
The Unarchiver  or lsar/unar supports auto encoding detect and manual encoding selection natively. Please check its man page for inspiration.
File Roller has limited support for it currently. File Roller don't use unar for ZIP archives currently.
(In reply to comment #33)
> The Unarchiver  or lsar/unar supports auto encoding detect and manual
> encoding selection natively. Please check its man page for inspiration.
> File Roller has limited support for it currently. File Roller don't use unar
> for ZIP archives currently.
> 1. http://code.google.com/p/theunarchiver/
This comment doesn't seem to have received much attention, but it was the first time I'd heard about The Unarchiver, so I tried using it. I found that "unar" on the command line correctly handled a zip file compressed on Windows with Shift-JIS filenames, which File Roller (using unzip) had problems with. In other words, at least in some cases, using unar instead of unzip fixes this bug!
To give some more details about the support for unar in File Roller: The source for File Roller contains the add-in file /src/fr-command-unarchiver.c (and the .h header file). However, unar is given a low priority, because it is at the bottom of all of the register_archive calls in /src/fr-init.c (on line 371), and also because it cannot write zip files, only read them, as it says in the nearby comment (line 342):
/* The order here is important. Commands registered earlier have higher
* priority. However commands that can read and write a file format
* have higher priority over commands that can only read the same
* format, regardless of the registration order. */
This suggests the following possible workarounds and solutions for this bug:
(Please note: these are only suggestions, and are not necessarily all desirable or feasible solutions.)
1) Use unar on the command line instead of File Roller. (This is just a workaround.)
2) Comment out the following line (line 367) in /src/fr-init.c and recompile. This disables the use of zip/unzip, so that unar is used instead (I think. I haven't actually checked.) The disadvantage is that zip cannot be used to create archives, so this is also not a very good solution.
3) Alter /src/fr-init.c to change the priorities of archive commands so that unar is used instead of unzip. This is a more permanent solution but requires a substantial change to the program.
4) Alter File Roller so that the priority of archive commands is set using the GUI or a configuration file, instead of in the source code. Again, this is a substantial change.
5) Alter the source of unzip so that it uses the same method as unar for detecting encodings. This is a substantial change to zip/unzip, which is upstream of File Roller. Also, unar is written in Objective C and unzip is in C.
It also raises the following questions:
Q1) Does unar solve this problem for all non-UTF-8 filenames? To put it another way, are there any cases where unar fails to handle filenames?
Q2) Are there any reasons to prefer unzip to unar (for example, does unar have bugs or additional dependencies, or does unzip have extra capabilities)?
Q3) Are there any other problems with any of the above solutions?
I'm new to commenting on bugs so please tell me if I made any breaches of etiquette or if anything else is wrong with my comment.
So, this bug is almost 9 years old and still no changes? Unzip with patches using libnatspec detects filename encoding correctly, but file-roller doesn't. Just tried with file-roller-3.10. Shall anyone fix this?
Bugs get fixed quicker if somebody provides a patch. Age is not a criterion.
Sounds like someone missed some orientation before using GNOME:
I know that Xarchiever deals correctly with encoding, maybe port code from this project? https://sourceforge.net/projects/xarchiver/
AAA, got it! Please, provide option to use unzip instead of 7zip. Found problem root on mate github https://github.com/mate-desktop/engrampa/issues/5
I made a patch to ger File-roller always use unzip instead of p7zip for zip files.
Here is the bug link.
This can be done as an option in a config file. But this way it is good for me too.
(In reply to Pilot6 from comment #40)
> I made a patch to ger File-roller always use unzip instead of p7zip for zip
> Here is the bug link.
> This can be done as an option in a config file. But this way it is good for
> me too.
Hmm, preferable unzip doesn't seem to work for me on Fedora 23 with unzip 6.0. Read some comments elsewhere the proper handling of non-ASCII coded filename is only added after 6.0 release, possibly in current Beta form of unzip 6.10.
Giving there's no tangible release date of unzip 6.10, would it be possible to let file-roller using unar first if presented on system while dealing with zip file?
I was downloading some music files I had bought, including files with French and German characters, which were all added into one zip file before it was sent down the line. file-roller did not display the French and German characters correctly.
file-roller is only a graphical front-end for a command line application, in my case unzip.
Using the unzip -l command on the zip file revealed exactly the same output as in file-roller (funny symbols where French or German characters should be).
The comment made here (https://bugs.launchpad.net/debian/+source/unzip/+bug/10979/comments/25) explains what needs to be done for unzip to display characters correctly. In my case using the option "-O UTF-8" would bring up the correct characters:
unzip -l -O UTF-8 somezipfilewithUTF-8characters.zip
According to: http://manpages.ubuntu.com/manpages/xenial/en/man1/unzip.1.html#contenttoc6 environment variables can be set in order to have unzip always use a certain character set.
I added the following to /etc/environment
then I needed to log out and log in again (no reboot) for the setting to take effect.
Using the command unzip -v revealed that my settings were in effect:
UnZip and ZipInfo environment options:
UNZIP: -O UTF-8
Now executing unzip -l somezipfilewithUTF-8characters.zip was enough to display the file names correctly in the terminal.
But file-roller wouldn't do that.
The comment by the developer (https://bugzilla.gnome.org/show_bug.cgi?id=306403#c27) helps to understand what file-roller is doing to list the content of a zip file: unzip -ZTs
The "Z" stands for ZipInfo mode (source: unzip --help), which I assumed calls the command zipinfo. This command offers again the same -O option to define a particular character set (source: zipinfo --help).
The unzip -v command indicated that the following environment options are active:
UnZip and ZipInfo environment options:
UNZIP: -O UTF-8
The ZIPINFO variable is empty and I concluded that I would need to add the same line as before in the /etc/environment file for UNZIP now again for ZIPINFO.
I added the following line to /etc/environment
logged out and in again and the ran the command unzip -v to show:
UnZip and ZipInfo environment options:
UNZIP: -O UTF-8
ZIPINFO: -O UTF-8
Now running file-roller on the zip file showed all characters correctly. :D
It works also for the file mentioned in https://bugzilla.gnome.org/show_bug.cgi?id=306403#c26 (http://multimedia.kataweb.it/xl/XL-VIDEODROME/mp3/GibilterraLand.zip)
For the Hebrew file (https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/580961/+attachment/1803463/+files/%D7%90%D7%A7%D7%95%D7%9C%D7%95%D7%92%D7%99%D7%94%20%D7%9C%D7%9E%D7%94%D7%A0%D7%93%D7%A1%D7%99%D7%9D.zip) in comment 25 (https://bugzilla.gnome.org/show_bug.cgi?id=306403#c25) I needed to exchange UTF-8 with 862 in both lines in /etc/environment. Afterwards (logging out and in again) it showed Hebrew characters in file-roller. I got the character set number from: https://bugs.launchpad.net/debian/+source/unzip/+bug/10979/comments/17
My system details:
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
file-roller version: 3.6.3
Kernel: 3.19.0-32-generic x86_64 (64 bit) Desktop: Cinnamon 2.8.8 Distro: Linux Mint 17.3 Rosa
I hope this long write-up will be helpful.
It would be nice if file-roller could, if not auto-detect, but make the character-set selectable. From this point-of-view its not a bug but maybe a badly documented feature.
I recently wrote patches to p7zip and unzip for OEM charset detection based on system locale. It's exactly that windows internal zip encoder does.
To get correct file names in file-roller you just need to install patched p7zip and set your system locale correctly. Or do something like
alias 7z='LC_ALL=el_GR.UTF-8 7z'
if you prefer opening archives using the locale different from system one.
Alkis Georgopoulos is planning to package patched p7zip to .deb's and upload to ppa: https://github.com/mate-desktop/engrampa/issues/5#issuecomment-648410042
bugzilla.gnome.org is being replaced by gitlab.gnome.org. We are closing all old bug reports and feature requests in GNOME Bugzilla which have not seen updates for a long time.
If you still use file-roller and if you still see this bug / want this feature in a currently supported version of GNOME (currently that would be 3.38), then please feel free to report it at https://gitlab.gnome.org/GNOME/file-roller/-/issues/
Thank you for creating this report and we are sorry it could not be implemented (volunteer workforce and time is limited).