GNOME Bugzilla – Bug 147659
Hyphenation
Last modified: 2018-05-22 12:06:06 UTC
Pango breaks lines so it should do hyphenation when requested. And it should do it right, i.e., not simply fill up one line at a time, but look at the whole paragraph. There is an i18n issue here: languages that form compound words by juxtaposition have long words making good hyphenation more important. TeX is 20+ years old in it current form. We should not aim lower than what TeX can do. One extension to what TeX does would be to consider that some words have many valid hyphenation points, but some are more desireable than others. Compound words, for example, are better hyphenated at the compounding boundaries than within components. Don't forget the fun stuff, such as words that change spelling when hyphenated!
Heh, well, Pango does something that TeX doesn't do ... handle basically all the world's languages. This makes certain things considerably trickier. Hyphenation is definitely on the long-term TODO list for Pango, and in fact Damon Chaplin has made a fairly large start here, see, e.g.: http://mail.gnome.org/archives/gtk-i18n-list/2003-April/msg00052.html
I did some more work on it after that so that isn't the latest code. I keep meaning to update it for the latest Pango and do another release. I think it is pretty much ready now. We just need to sort out an API. But it isn't much use until we do justification as well. My code just uses the TeX algorithm, and the TeX pattern files. Better support for choosing nicer hyphenation points might be nice, but coming up with an algorithm and pattern files for all the languages may be a bit of a challenge.
Things that need to be figured out: - Can/should hyphenation points be made part of PangoLogAttr; is that going to be too slow? Is there enough information; what about alternate text at hyphenation points? - Does publically exporting a layout-engine mechanism make sense? We almost certainly need a greedy layout algorithm in Pango for speed and because TeX style algorithms are disconcerting for active editing. Having an optional TeX style algorithm in the Pango distribution as well wouldn't be a problem. But maybe two aren't sufficient - people may want special types of shaped paragraphs, etc. My concern about a publically exported mechanism is first API stability and second the complexity of the layout process - the number of attributes and layout options keeps on growing and most of them have some effect on the layout implementation. - How does the interaction between the shape engine and justification work ... insertion of Kashidas for Arabic almost certainly requires engine intervention, so we'd need to add a method to PangoEngineShape. (or modify script_shape() so that it could be called again with a different desired width) But there is also the question of how "stretchable/shrinkable" a run is that needs to be fed up to higher levels. If you want to see the full complexity of what a shape engine could do, read: http://www.microsoft.com/OpenType/OTSpec/jstf.htm You can do things like break apart a ligature to increase the set width of a line. Of course, coming up with a solution that handles everything is impossible, so we basically just need to figure out what we can do that improves the current situation and will be extensible in the future.
I don't quite share your fear that TeX' algorithm is too slow. After all, it's done paragraph-by-paragraph, not on the whole text at once. Once on even moderately recent hardware, TeX handles text quite fast. My PhD thesis of 207 pages typesets in 1.2 cpu seconds on a 3GHz pentium and it features tons of insane math stuff. (And opens 257 files in the process, presumably reading or writing to them.) For interactive use the current API for just setting the text is a bummer: delete a character in your 100 page document and you have to redo the entire layout process.
Regarding Owen's points, (1) a boolean 'hyphenation_point' flag in PangoLogAttr isn't enough. In my latest code, I have a struct containing data for each hyphenation point: struct _PangoxtHyphenationPoint { /* The character offset of the hyphenation point in the text, i.e. the place to break just before. */ gint offset; /* The text to insert before the break. */ const gchar *pre_break_text; /* The text to insert after the break. */ const gchar *post_break_text; /* The number of characters to remove from the original text, after the hyphenation point, if we do break here. FIXME: This is in chars when stored in the hyphenation exceptions data, but is in bytes when hyphenation points are returned. */ gint16 chars_to_remove; /* The penalty to use if we break here. This is always 50 in the basic hyphenator, though more advanced hyphenators may adjust this, so some hyphenation points are preferred over others. */ gint16 penalty; }; That can handle any necessary spelling changes and splitting of ligature characters (like 'fi', 'ffl'). Note that TeX tries to do the layout without hyphenation first, then tries again with hyphenation if that isn't good enough. I think calculating hyphenation points all the time may slow it down quite a bit. (2) Regarding a public layout-engine API - I think high-end DTP apps will want to do their own layout, so either we have an API or they have to use their own widgets and rewrite PangoLayout for themselves. So I think we should probably have an API eventually, but not for quite a while. (3) Kashidas - I think the shapers simply need to return an additional array containing information about any kashida-insertion points. e.g. the priority of the insertion point, and glyphs that could be inserted plus their widths. (I'm not sure if you are allowed to mix & match kashida glyphs to fill a particular width. I guess that is OK since they are basically horizontal lines) The layout engine can then do the layout calculations and then any necessary insertions of glyphs, in a similar way to how hyphenation points are handled. Personally I think we should forget about kashidas and the TeX justification algorithm for now, and just get basic greedy justification working. I have been meaning to have a go at this for a while.
I've updated my hyphenation code and finished it off. Get it here: http://www.dachaplin.dsl.pipex.com/pangoxt/ It could do with more testing, and there are a few other pattern files we can add, but other than that I think it is ready. It hyphenates English at about 900,000 words/sec, which should be enough! I'll have a go a simple greedy justification now.
what about looking patch availlable at http://www.dachaplin.dsl.pipex.com/pangoxt/ (last update 16 may 2006)?
Do you mean anything nonobvious, or just making some noise to draw attention to this bug?!
Unfortunately the new patch only supports justification. But after that goes into Pango I'll try to get the hyphenation working with it.
*** Bug 774762 has been marked as a duplicate of this bug. ***
Soft hyphens are something to take into consideration too. E.g. in the control center with French we have: Tablette Wa com instead of: Tablette Wacom because soft hyphens have been introduced in the English version and requested in translations (and they do make sense, but they should have a lower priority).
… and that’s #580275. Maybe that’s close enough that we can mark it as duplicate?
-- GitLab Migration Automatic Message -- This bug has been migrated to GNOME's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/pango/issues/17.