Bug 477788 – reinvents the man wheel, and does it badly

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 477788 - reinvents the man wheel, and does it badly


Summary:	reinvents the man wheel, and does it badly


Status:	RESOLVED OBSOLETE

Product:	yelp
Classification:	Applications
Component:	Man Pages
Version:	unspecified
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Yelp maintainers
QA Contact:	Yelp maintainers

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2007-09-17 14:46 UTC by Colin Watson
Modified:	2018-05-22 12:45 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Initial work on troff-based man parser (67.22 KB, patch) 2007-09-28 20:03 UTC, Don Scorgie	none	Details \| Review

Description Colin Watson 2007-09-17 14:46:10 UTC

Hi, I'm the upstream maintainer of man-db (one of the two major implementations of /usr/bin/man et al on Linux) and the Debian maintainer of man-db and groff. Occasionally I get questions about why Yelp renders such-and-such a manual page badly. Rather than using groff or even just man to do the job, Yelp implements a complete manual page parser itself.

This is a fundamental design error. *roff is a full typesetting language and manual pages are fully entitled to use just about every bit of it if they so choose. I'm sure Yelp's parser works to some reasonable extent, but you are doomed to forever having to tack extra bits and pieces onto it every time somebody uses something new (bug 349677 was the case I came across recently). Not using groff (or troff if groff isn't available) is a mistake. I realise you want to have formatting appropriate to your frontend, but there are better ways to do that; pinfo and w3mman both do this by parsing the output (w3mman even manages to implement cross-references to other manual pages!), and as a result they do a much better job than Yelp. Admittedly they're text-based, but the same approach should work just as well in a graphical frontend.

Aside from the details of rendering the pages, Yelp (and librarian, but I have to file the bug somewhere) compounds its errors by reinventing man too. man is not as simple as it looks; I've been maintaining man-db for six years so I know what I'm talking about here. Different systems have different weird and wonderful compression schemes (not all of which you successfully handle). The encoding of manual pages is a nasty swamp that is handled differently on different systems; I guarantee that the current code will break as soon as Debian starts supporting UTF-8 manual pages properly, which is going to happen soon (http://www.chiark.greenend.org.uk/ucgi/~cjwatson/blosxom/2007-09-17-man-db-encodings.html). I'm told that Red Hat has already moved over to UTF-8 manual pages, I think in a somewhat different way, so Yelp's big list of encodings is probably already broken there (bug 473040 confirms my suspicion).

All this would be avoided if you just asked man to render pages for you and postprocessed the output. Yes, I suspect you'd have to do a bit of work to cope with the idiosyncrasies of different man implementations, but this pales in comparison to the horribleness of trying to reinvent the whole stack.

Thanks for your consideration.

Comment 1 Don Scorgie 2007-09-18 18:54:32 UTC

Hi.

There are a few issues here that I'll go through individually.

First: Rarian. There is some code in rarian to deal with man-db. I never got the chance to finish it (ran out of time, finding a system that would actually work properly). Now I know who you are, expect some questions about man-db (that I'll take off-bug-report).

Rarian is used to generate the TOC in yelp and find pages within it's "database" as requested. Right now, we manually trawl through all man pages in {$DATADIR,$PREFIX,...}/man/man{1,2,...} and generate names for them from that. One thing that drew me to man-db was that all the names and descriptions are available. The other thing is that it makes i18n man pages easier. I did encounter some issues as (please correct me) most distros / systems don't currently use man-db. e.g. does Solaris, *BSD, Redhat, ... use it or is it a Debian thing?

The second issue is compression. For now, I'm going to assume you mean .gz, .bz2 etc. These are the 2 we support (in addition to uncompressed). As far as I see, using groff wouldn't solve this problem as the pages need uncompressed before groff can deal with them.

The third issue is using groff to generate the html we display for man pages. In principle, I'm all for this as it makes our lives easier and gives us the i18n stuff for free. Once we receive the generated html back, we can then post-process it to add our own stuff (if I read the request properly). For this, I've tried experimenting somewhat with groff output. Some observations:
1. The html output is horrific. Everything is done using tables to fix the indentation issue. Not in itself a big issue, it just makes the html look horrific.
2. html (as I'm often reminded) may not be valid xml. This has bitten me several times doing yelp search.
3. The html output still isn't exact to the man output. It's (probably) better than we do currently, but still is confusing in places.

The above means I wouldn't be comfortable trying to post-process the generated html.

There is another possible option. Groff has a utf8 backend which does produce some nice formatting. In addition, all bold is marked using ^H between each character (or near enough). This would allow us to distinguish sections nicely. However, even this has pitfalls (grabbing wrong bolded text as sections and would require quite a bit of work in larger man pages).

The final option is to not run the post-processor and process the troff output ourselves. TBH, this option now intrigues me immensely. This would allow us to generate our own xml to do with as we please (meaning, easier to post-process ourselves). However, this (as with everything) is going to involve a fair chunk of work. I'd also be happier if there was a library to access troff with. I am tempted to do an experiment in seeing how this would work.

Shaun, care to weight in?

Comment 2 Shaun McCance 2007-09-18 19:23:04 UTC

One thing I'd contemplated in the past was writing our own tmac file that would turn man page markup into something we can actually parse reliably.  So we would have to implement the base macros, but if a man page made its own macro definitions (as some do), those would be resolved by troff.

I could be completely off my rocker here, but it seems feasible.

Comment 3 Colin Watson 2007-09-18 20:11:15 UTC

I'm certainly happy to answer questions about man-db by e-mail.

man-db is used by Debian (and derivatives e.g. Ubuntu) and SuSE to my knowledge. All systems have something that's pretty similar in the most important ways, though; Red Hat's man diverged a long time ago from the same code base and now looks almost entirely different inside but has many of the same options. Other man implementations typically have something like man-db's database because they need it to implement the standard whatis and apropos programs, but yes, if you make use of this then you'll need code specific to the man implementation. Rather than being vulnerable to internal implementation details of database formats, it's probably better to find a way to access it using tools provided by the implementation (e.g. man-db provides accessdb, the other man's database is textual IIRC; or maybe you could even just use something like "apropos ''").

You're correct that using groff directly won't help with compression. Using man, however, will. Most man implementations provide enough options to be able to get at different kinds of groff output.

I agree that groff's html backend is far from ideal. It has been getting better, but it's still not great. Not the best intermediate format.

The utf8 backend (and its relatives) are what man typically uses, and are what pinfo and w3mman rely on. Conventional manual page output is rigid enough that to be honest you can probably just go for spotting sections by looking for bold text without leading whitespace.

Using ditroff (troff's intermediate output) is a possibility; groff has an internal library for this, though it's not exported anywhere; 'man -Z' will produce this so you don't have to run groff directly. It's even fairly well-documented in groff_out(5). I'm not sure it's much better than just using utf8 output though, since it's basically a physical description and has little semantic/logical content. The things it would buy you (a bit less parsing to figure out where bold and underline fonts go) are probably not worth the significant code increase and difficulty.

Thanks for your open response to my admittedly somewhat acerbic report. :-)

Comment 4 Don Scorgie 2007-09-28 20:03:36 UTC

Created attachment 96344 [details] [review]
Initial work on troff-based man parser

On and off for a week or so, I've been looking at this.  This is a (very, very) crude first attempt at a troff-based man parser (~12 hours work in total).

A couple of warnings:
It contains a fair amount of voodoo and will probably break other things.
It also spews an insane amount of debug to stdout.
The code quality is pretty low ATM (just trying it out at the moment)

Anyway, it does a fairly decent job of parsing (I think).  There's a bundle missing.  But, it's a start.

Just posting here so people can have a look (if they're that interested).  Worth continuing?

Comment 5 Colin Watson 2008-01-31 20:25:00 UTC

Using troff is definitely an improvement, though I haven't looked into the patch in much detail except to note that it should be passing -mandoc to troff rather than -man.

I would like to make another attempt if possible to persuade you that invoking man rather than troff is the right thing to do, though. Debian/Ubuntu's man-db is now transitioning over to UTF-8 manual pages, as I noted in my original post, and the effect of this is that manual pages in any given hierarchy may be either UTF-8 or the legacy encoding. man-db has special code to deal with this which I'd hate to reimplement in yelp. Calling 'man --recode UTF-8' will give you a UTF-8-encoded version of the source, but it's difficult to detect whether that's available, although we could patch that in locally as I suggested in https://bugs.launchpad.net/ubuntu/+source/yelp/+bug/154829. man knows how to invoke troff (it varies from system to system). I believe that Solaris' man knows about the sgml2roff stuff they do. This is all the sort of thing that ideally yelp shouldn't have to care about.

I think most versions of man support something similar to 'man -Thtml', so you could get basically the same output as you're getting now with hopefully a relatively minor amount of portability goop, and reuse the HTML parsing effort you've been doing.

Comment 6 Rupert Swarbrick 2010-12-14 18:58:57 UTC

There's a newer version of a troff-based parser (using the intermediate format), which I've just posted to Gnome doc devel: http://article.gmane.org/gmane.comp.gnome.documentation.devel/491

Comment 7 Rupert Swarbrick 2011-01-14 23:57:40 UTC

The new man parser has been in the git tree for a couple of weeks now and seems to work pretty well.

Colin: Does this address some/all of your concerns?

There *is* currently a hack for understanding "special characters" (the C lines in the intermediate format). If anyone has any ideas how to do this properly, I'd be thrilled.

Comment 8 GNOME Infrastructure Team 2018-05-22 12:45:21 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/yelp/issues/32.