GNOME Bugzilla – Bug 79071
yelp can't show properly localized man page.
Last modified: 2010-04-29 22:38:32 UTC
It seems like http://bugzilla.gnome.org/show_bug.cgi?id=47548 yelp shows localized man pages with broken characters. i think most of localized man pages are written with local charset not utf-8
hmm .. ok, I can take a look at this and see if I can solve it. If the problem is in the gnome2-man2html (which I fear) I might not have a clue what to do (that code is hairy). Thanks,
This is releated with gnome2-man2html, so reassigning to libgnome.
Created attachment 8648 [details] korean man page example (ls)
I tested gnome2-man2html with ls.1.gz (korean manpage sample) like this. $ zcat /usr/share/man/ko/man1/ls.1.gz |gnome2-man2html > test.html and render with galeon. galeon shows like this , http://tkp.ulsan.ac.kr/~ganadist/broken.png and i tested that convert to utf8 $ zcat /usr/share/man/ko/man1/ls.1.gz | iconv -t utf-8 -f euc-kr | gnome2-man2html > test.html and render with galeon. http://tkp.ulsan.ac.kr/~ganadist/broken1.png it seems gnome2-html handles non-ascii characters convert to esc character (like ì) . but html rendering engine treat one esc character as only one character
Pretty serious i18n problem, right, sander?
agreed, this is serious i18n problem.
Assigning to myself.
Possibly fixed on CVS; you need a new libgnome. Could someone please test this and reopen the bug if it doesn't work?
Hmm, I am really not sure if it is fixed. I changed gnome2-man2html not to escape bytes with the most significant bit set, e.g. it will output a byte 255 instead of "ÿ". With this, if I take your man page and do zcat ls.1.gz | gnome2-man2html > foo.html and then view foo.html with Galeon, I can ask Galeon to use EUC-KR encoding and it displays fine (with Korean glyphs). However, if I set export LANG=ko_KR.eucKR and then run gnome-help man:///home/federico/ls.1.gz (e.g. your original file) it displays the first Roman characters of the man page, and stops as soon as it finds the first Korean character. I'm not sure what's going on. What I'm pretty sure about is that gnome2-man2html is not munging characters now, so the bug should not be in it but rather in yelp.
hmm. gnome2-man2html works properly now. but output have no encoding information. so html rendering engine(libgtkhtml) shows broken characters. (gecko engine have auto-detect encoding features, so works properly.) now, we have two solution. 1. put encoding meta tag at gnome2-man2html 2. add auto-detect encoding feature in libgtkhtml
This is very borderline 2.0.0, no? Sander, thoughts?
i don't think we can realisticly take it as a 2.0.0 bug without knowingly causing slip with very high probablity or just punting it to 2.0.1 anyways.
Is it terribly difficult to include charset detection code in gtkhtml?
Created attachment 8799 [details] [review] Patch with incomplete fix :(
Is it even realistic in the 2.0.1 timeframe?
The patch I just attached has an incomplete fix. It makes gnome2-man2html output a META tag with language and charset information. However, it appears that what HTML would expect is not the same thing that you would put in your LANG or LC_MESSAGES variables. With this patch yelp makes the man page show up as a bunch of nonsense 8-bit characters, rather than proper multibyte Korean characters. I'm at a loss here.
What about always output utf8 and add this to the header in the outputed html? <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">
The problem is that to convert to UTF-8 I must first know the character set in which the man page is written. The program has no such information, and it would involve charset autodetection code.
*ocuh* that's crap. Is this a weakness of man?
Yes. Man pages are an oooold format and they do not contain any information about what language or charset they are written in.
Reassigning back to the default maintainers; this would be nice to do, still, but Sun feels that it is not important for them and so federico has other things to do with his time :)
Is this even fixable at all? Any point to keeping it open?
Just pointing out that the reporter's mail is bouncing.
I have no idea how to fix this. Perhaps someone with better knowledge on how to figure out which character set it's written in (is this possible at all?) have a better clue.
how about user can change charset in preference?
Created attachment 11314 [details] [review] man2html patch against libgnome 2.0.5
I found Federico's patch missed HTTP-EQUIV="Content-Type". after apply this patch, yelp shows manpage properly.
see http://ffii.org/archive/mails/groff/2002/Sep/0187.html there is some tries to put encoding information in manpage.
Ooh. Working patch. yay. Can we get this in, Anders?
I'll commit this if nobody tells me not to in three days
I lied. Commited to both branches.