GNOME Bugzilla – Bug 156371
The filename should be converted between remote charset and locale charset
Last modified: 2018-07-01 08:40:38 UTC
When getting and putting files, the filename should be converted between remote charset and locale charset. For they are not always the same charset. Especially for CJK users, there are so many charset for their languages that we can't assume the charset are the same. I found that gftp has converted the remote charset to utf8 for reading, but not for getting or putting file. Which results in that if you want to download some files from a server whose charset is different from the locale, the filename will be saved as garble charaters(in linux filesystem), or it will fail to save(when you want to save it to a mounted windows partition, in this case, it complains "open file, invalid arguments") To reproduce: 1)launch gftp, connect to some server whose charset is different from the locale charset. 2)download some files whose name contains non-english characters to linux partition. You 'll find the filename is in garble characters. 3)Repeat the step 2, replacing the local path with a mount win32 partition(with mount parameter "iocharset=utf8"). It will refuse to work, complaining that open file "...", invalied arguments. I guess this is because the filename is not a valid utf8 string. Solution: convert the filename encoding from/to the remote charset to/from the locale charset before download/upload
voting for this issue.
voting too. I tried to download a sjis file with ja_JP.UTF8 locale. Files were saved with garbage text as the file name.
Created attachment 37770 [details] [review] a patch for this bug I'm not the original writer of this patch. This patch not only fix this bug, but changed the appearance of Gftp as the author's taste, which should be ignored.
you can contact the original author via ajing99@mails.tsinghua.edu.cn
There were already people comfirm this bug, why its state is still unconfirmed? It's a very severce problem for we CJK users. Hope the patch above should be reviewed or applied, thanks!
Created attachment 51658 [details] [review] fixed patch
*** Bug 339618 has been marked as a duplicate of this bug. ***
*** Bug 313420 has been marked as a duplicate of this bug. ***
*** Bug 338530 has been marked as a duplicate of this bug. ***
*** Bug 153954 has been marked as a duplicate of this bug. ***
Although my bug report #339618 has been marked as duplicated, I think my patch is more comprehensive and absolutely worth reviewing. Please also review my patch. We fix the same bug with different methods, and there may be something which can be integrated. There are more things needed to be considered at the same time when fixing this bug. Hope the patch files provided in this bug report can be integrated with mime, which can yield better result. My patch: http://bugzilla.gnome.org/attachment.cgi?id=75225
Created attachment 75226 [details] [review] New patch for this bug (by Hong Jen Yee)
Hi, This is fixed in CVS. I created a tarball at http://www.gftp.org/gftp-test.tar.bz2 that has the latest code from CVS. Please let me know either way if this works. It is fixed for the local, FTP, FTPS and SSH protocols. It is not fixed in the HTTP, HTTPS and FSP protocols. I will fix those whenever I get confirmation that the other protocols are working properly. Sorry this took so long to do. Brian
(In reply to comment #13) > Hi, > This is fixed in CVS. I created a tarball at > http://www.gftp.org/gftp-test.tar.bz2 that has the latest code from CVS. Please > let me know either way if this works. It is fixed for the local, FTP, FTPS and > SSH protocols. It is not fixed in the HTTP, HTTPS and FSP protocols. I will fix > those whenever I get confirmation that the other protocols are working > properly. > > Sorry this took so long to do. > > Brian > No, this tarball doesn't work, at least for me. It even performs worse than the version reported, with respect to the encoding processing. I use it to open a ftp server, whose charset is GBK. and I set the remote_charsets = GBK,UTF-8 It could hardly display the Chinese file name. For my case, it did this correctly only once, and I couldn't reproduce it. Anyway, thanks for your effort to dig into this issue.
For years we have been lacking a single user-friendly ftp client software on Linux, this is a very important problem, I just wish to write to call attention for developers. Most people really worried about this issue are Asian poeple that probably not currently directly involved in the development (for sure, we are always thankful for all the work you have done for free). In recent event local campus LUG is trying to getting new non-techincal user involved in Linux, during organizing this event we list all potiential problems for normal user to use Linux, lack of working charset-conversion ftp client is listed the single most lacking feature, because there are simply too much resource distributed in campus. Totally more then 50TB resource that users rely one everyday, 100% of them use Windows server plus Serv-U ftp server offering ftp access in GB18030 charset, every folder name and file name are in Chinese. Linux users need to access these data. I think the situation is very much alike in a lot of places in Asia. If anyone know any user-friendly FTP client software that can correctly do charset conversion, please recommend and we will further recommend to other people. All other features (multi-thread, proxy ftp, FTPS or SFTP, HTTPS, send raw command...) are not important at all compare to this feature.
To Zhang Weiwu: Please consider applying my patch. (See my previous patch) I'm from Taiwan, and I use traditional Chinese (Big5 encoding). My patch can fix most of the problems. I'm sure it works since this has been proved by some Asian users. Besides, you can try filezilla3, which is a feature-rich ftp client supporting encoding conversion. Though it's still under development, and not complete, most of the important parts are useable. However, I personally hope this encoding issue can be fixed in gftp since I like it.
I got confirmation that the latest code in CVS works properly with different character sets. I have a test release at: http://www.gftp.org/gftp-test.tar.bz2 Please let me know if you run into any problems with this.
It's partially fixed, AFAICT. 1. If I set remote_charsets = "UTF-8,GB18030,GBK,GB2312,BIG5,Latin-1", the Chinese filenames will be displayed as garbage text, and they can be downloaded with the garbage names. 2. If I set the remote_charsets = "GBK", then everything works fine. 3. If I launcher gftp with an empty remote_charsets, and then set it to "GBK" by GUI, it works fine. 4. follow the step 3, reset the remote_charsets="UTF-8,GB18030,GBK,GB2312,Latin-1", reconnect to the server, then the chinese filenames can be displayed correctly, but they can't be downloaded because the server can find the file. (I think it's due to that the filename is not correctly iconv'ed). 5. Follow the step4, set remote_charsets = "UTF-8,GB18030,GBK,GB2312", everything works fine. 6. Relaunch gftp with remote_charsets = "UTF-8,GB18030,GBK,GB2312", connect to the server, it works fine, then append "BIG5" to the remote_charsets and download a file with a Chinese name. it crashed with "Error converting string '/水枪冲击墙面力的估算.doc' from character set UTF-8 to character set BIG5: Invalid byte sequence in conversion input Segmentation fault" Note that I didn't refresh/reconnect in step 6 after appending "BIG5". My conclusion: 1)Latin-1 can't be at the end of the remote_charsets; 2) you must reconnect/refresh after remote_charsets changes.
Brian's test case works for me. Local charset UTF-8, remote charset Shift-JIS. Uploading files with japanese characters in their name works perfectly. Downloading does too. I'm so glad to see this is finally fixed!
Basically, the latest test release works. However, some bugs were noted. I set remote_charset in the preference dialog to empty string "". Then, I add a ftp site using "big5" encoding to the bookmarks, and set its remote_charset to "big5". If I connect from the bookmark, "big5" encoding was used, and the file names on that site were correctly displayed. If I entered the IP of that site manually in the address bar, theoratically, the encoding of remote filenames should be "" in this case, according to the golbal settings. However, when I connect to that site again by enter the IP manually, the remote files were still be displayed using "big5", not the global setting, "". So, I closed gftp, and then, restarted it again. After gftp startup, I enter the same IP in the address bar again, and hit Enter key. This time, all remote files were displayed in invalid strings. That means, it used the remote_charset "", which was set in the global options. To be short, when you enter the IP manually, it will use global options sometimes, and will use the settings of previous connection sometimes. This kind of random behavior is definitely a bug. Besides, I've reviewed your latest source code, and the way you use it not correct. As far as I can know from your source code, you stored utf-8 encoded filename in gftp_file struct now. However, this will cause serious problems. When a file has a name that couldn't be converted to utf-8 (this really happens sometimes), its fle->file will be NULL, and that means, this file will be skipped and won't be displayed in the dir list. This is not correct at all since the file really exists, but you didn't show it. The filename in fle->file should be always in its native encoding, not UTF-8, or you won't be able to access some files with names that couldn't be converted to valid utf-8. You have to store the filename in its original encoding, not utf-8. And, everytime when you need to display the filename on the screen, such as adding it to listview, or displaying it in message dialog, you can use g_filename_diaplay_name() to get a utf-8 encoded filename for display. In this way, you can display the filename in utf-8, and store the name in non-utf8 native encodings. This is the correct way to deal with encoding problems, or some files with names that can't be converted to utf-8 will be lost. Thank you for the bug fix. Although there are still many problems, we're almost there.
I fixed the bug with the remote_charsets option in SVN. I am still storing the filename as UTF8. I added a boolean flag to the gftp_file structure called filename_utf8_encoded. The files will always be shown to the user. If the file can't be converted to UTF8, the filename will be shown as blank. I'm open to other suggestions. The user can select that blank entry and the file should be transferred properly.
*** Bug 384238 has been marked as a duplicate of this bug. ***
This bug is still not correctly fixed. 1). _gftp_get_next_charset in lib/protocols.c return a charset with an extra comma ',', except for the last/only charset. Namely, if your remote_charsets="UTF8, GBK", it will return "UTF8," and "GBK". This can explain why it fails for setting ending with a Latin-1 charset as stated in comment #18. 2). _do_convert_string in lib/protocols.c: two continue statement are missing. I will attach a patch for these two bugs. Unfortunately, even with my patch. the problem can be solved correctly, and trying to fix this problem ensure me that the correct method would be as proposed by Hong Jen Yee in Comment #20, using two separate variables to store the remote filename and the filename to be shown to the user. In current implementation, gftp try to guess the remote charset from the setting remote_charsets by iconv'ing the filenames. when iconv successes, the remote charset is got, and this is used to display the remote filenames. Unfortunately, this charset is not stored, so when the user wants to download a file, the filename need to be iconv'ed to remote charset. Since the remote charset is not stored, gftp need to guess the remote charset once again by iconv'ing the filename from "UTF-8" to those in remote_charset, while this is not logical. Because remote_charset may contains "UTF-8", in this case iconv will definitely success and the resultant remote_charset would be "UTF-8", which is wrong. Another possible solution would be storing the server charset when it was figured out.
Created attachment 87180 [details] [review] patch for _do_convert_string
Created attachment 87181 [details] [review] patch for _gftp_get_next_charset.
I investigated into this problem further today. and I found that it can't be solved by simply remembering the server charset as proposed in my previous comment #23. So, I propose a solution that stores the filename in remote_charset and in UTF8 separately and remembers the server charset. And the server charset can be figured out by analysising the filenames on the server. the pseudo code for guessing the server encoding may be like: if(server_charset is not set){ while(cur_charset = _get_next_charset()){ if(convert_to(filename, cur_charset) == SUCCESS){ server_charset = cur_charset; return converted_filename; } }else{ /* try the preselected charset first */ if(convert_to(filename, server_charset) == SUCCESS) return converted_filename; else{ while(cur_charset = _get_next_charset()){ if(convert_to(filename, cur_charset) == SUCCESS){ server_charset = cur_charset; return converted_filename; } } } by this algorithm, it's not necessary to iterate remote_charsets the for every string, which is much more efficient than the current algorithm in gftp. and the correct encoding will be figured out eventually, given it's contained in "remote_charsets" By keeping the remote filename intact, you don't need to do encoding converting, which may be error-prone, to operate the remote files. In current implementation, the remote filenames are incorrectly processed if remote_charsets begins with "UTF-8".
If my understanding is correct, current implementation remembers the charset for every request instead of the server, which is not proper I think.
Don't use gftp anymore, use filezilla instead, which is much better......
gftp is not under active development anymore and has not seen code changes for many years. Its codebase has been archived: https://gitlab.gnome.org/Archive/gftp/commits/master The maintainer states that "I would like to hand this project off to someone compotent" on https://www.gftp.org/ Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect reality. Please feel free to reopen this ticket (or rather transfer the project to GNOME Gitlab, as GNOME Bugzilla is deprecated) if anyone takes the responsibility for active development again.