Bug 147659 – Hyphenation

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 147659 - Hyphenation


Summary:	Hyphenation


Status:	RESOLVED OBSOLETE

Product:	pango
Classification:	Platform
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	High enhancement
Target Milestone:	Big feature
Assigned To:	Behdad Esfahbod
QA Contact:	pango-maint

URL:
Whiteboard:

Duplicates:	774762 (view as bug list)
Depends on:	64538
Blocks:	Persian

Reported:	2004-07-15 16:10 UTC by Morten Welinder
Modified:	2018-05-22 12:06 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Morten Welinder 2004-07-15 16:10:36 UTC

Pango breaks lines so it should do hyphenation when requested.  And it should
do it right, i.e., not simply fill up one line at a time, but look at the whole
paragraph.  There is an i18n issue here: languages that form compound words by
juxtaposition have long words making good hyphenation more important.

TeX is 20+ years old in it current form.  We should not aim lower than what
TeX can do.  One extension to what TeX does would be to consider that some
words have many valid hyphenation points, but some are more desireable than
others.  Compound words, for example, are better hyphenated at the compounding
boundaries than within components.

Don't forget the fun stuff, such as words that change spelling when hyphenated!

Comment 1 Owen Taylor 2004-07-15 17:01:10 UTC

Heh, well, Pango does something that TeX doesn't do ... handle basically
all the world's languages. This makes certain things considerably trickier.

Hyphenation is definitely on the long-term TODO list for Pango, and
in fact Damon Chaplin has made a fairly large start here, see,
e.g.:

 http://mail.gnome.org/archives/gtk-i18n-list/2003-April/msg00052.html

Comment 2 Damon Chaplin 2004-07-16 11:13:03 UTC

I did some more work on it after that so that isn't the latest code.
I keep meaning to update it for the latest Pango and do another release.
I think it is pretty much ready now. We just need to sort out an API.
But it isn't much use until we do justification as well.

My code just uses the TeX algorithm, and the TeX pattern files.
Better support for choosing nicer hyphenation points might be nice, but
coming up with an algorithm and pattern files for all the languages may
be a bit of a challenge.

Comment 3 Owen Taylor 2004-07-16 14:02:49 UTC

Things that need to be figured out:

 - Can/should hyphenation points be made part of PangoLogAttr; is
   that going to be too slow? Is there enough information; what
   about alternate text at hyphenation points? 

 - Does publically exporting a layout-engine mechanism make sense?
   We almost certainly need a greedy layout algorithm in Pango
   for speed and because TeX style algorithms are disconcerting
   for active editing. Having an optional TeX style algorithm
   in the Pango distribution as well wouldn't be a problem.
   But maybe two aren't sufficient - people may want special
   types of shaped paragraphs, etc.

   My concern about a publically exported mechanism is first 
   API stability and second the complexity of the layout
   process - the number of attributes and layout options
   keeps on growing and most of them have some effect on the
   layout implementation.

 - How does the interaction between the shape engine and   
   justification work ... insertion of Kashidas for Arabic
   almost certainly requires engine intervention, so
   we'd need to add a method to PangoEngineShape. (or 
   modify script_shape() so that it could be called again
   with a different desired width)

   But there is also the question of how "stretchable/shrinkable"
   a run is that needs to be fed up to higher levels.

   If you want to see the full complexity of what a shape
   engine could do, read:

    http://www.microsoft.com/OpenType/OTSpec/jstf.htm

   You can do things like break apart a ligature to increase the
   set width of a line.

Of course, coming up with a solution that handles everything is
impossible, so we basically just need to figure out what we can do
that improves the current situation and will be extensible in the future.

Comment 4 Morten Welinder 2004-07-16 14:21:20 UTC

I don't quite share your fear that TeX' algorithm is too slow.  After all, it's
done paragraph-by-paragraph, not on the whole text at once.  Once on even
moderately recent hardware, TeX handles text quite fast.  My PhD thesis of
207 pages typesets in 1.2 cpu seconds on a 3GHz pentium and it features tons of
insane math stuff.  (And opens 257 files in the process, presumably reading or
writing to them.)

For interactive use the current API for just setting the text is a bummer: delete
a character in your 100 page document and you have to redo the entire layout
process.

Comment 5 Damon Chaplin 2004-07-16 16:59:32 UTC

Regarding Owen's points,

(1) a boolean 'hyphenation_point' flag in PangoLogAttr isn't enough.
In my latest code, I have a struct containing data for each hyphenation point:

struct _PangoxtHyphenationPoint
{
  /* The character offset of the hyphenation point in the text,
     i.e. the place to break just before. */
  gint offset;

  /* The text to insert before the break. */
  const gchar *pre_break_text;

  /* The text to insert after the break. */
  const gchar *post_break_text;

  /* The number of characters to remove from the original text,
     after the hyphenation point, if we do break here.
     FIXME: This is in chars when stored in the hyphenation exceptions data,
     but is in bytes when hyphenation points are returned. */
  gint16 chars_to_remove;

  /* The penalty to use if we break here. This is always 50 in the basic
     hyphenator, though more advanced hyphenators may adjust this, so
     some hyphenation points are preferred over others. */
  gint16 penalty;
};

That can handle any necessary spelling changes and splitting of ligature
characters (like 'fi', 'ffl').

Note that TeX tries to do the layout without hyphenation first, then tries
again with hyphenation if that isn't good enough. I think calculating
hyphenation points all the time may slow it down quite a bit.


(2) Regarding a public layout-engine API - I think high-end DTP apps will
want to do their own layout, so either we have an API or they have to use
their own widgets and rewrite PangoLayout for themselves. So I think we
should probably have an API eventually, but not for quite a while.


(3) Kashidas - I think the shapers simply need to return an additional array
containing information about any kashida-insertion points. e.g. the priority of
the insertion point, and glyphs that could be inserted plus their widths.
(I'm not sure if you are allowed to mix & match kashida glyphs to fill a
particular width. I guess that is OK since they are basically horizontal lines)

The layout engine can then do the layout calculations and then any
necessary insertions of glyphs, in a similar way to how hyphenation points
are handled.

Personally I think we should forget about kashidas and the TeX justification
algorithm for now, and just get basic greedy justification working.
I have been meaning to have a go at this for a while.

Comment 6 Damon Chaplin 2004-08-04 14:07:11 UTC

I've updated my hyphenation code and finished it off. Get it here:
   http://www.dachaplin.dsl.pipex.com/pangoxt/

It could do with more testing, and there are a few other pattern files we can
add, but other than that I think it is ready.

It hyphenates English at about 900,000 words/sec, which should be enough!

I'll have a go a simple greedy justification now.

Comment 7 Thierry Vignaud 2006-05-18 22:26:26 UTC

what about looking patch availlable at http://www.dachaplin.dsl.pipex.com/pangoxt/ (last update 16 may 2006)?

Comment 8 Behdad Esfahbod 2006-05-19 00:57:31 UTC

Do you mean anything nonobvious, or just making some noise to draw attention to this bug?!

Comment 9 Damon Chaplin 2006-05-19 08:57:09 UTC

Unfortunately the new patch only supports justification.

But after that goes into Pango I'll try to get the hyphenation working with it.

Comment 10 Carlos Soriano 2016-11-22 09:59:06 UTC

*** Bug 774762 has been marked as a duplicate of this bug. ***

Comment 11 Alexandre Franke 2017-06-29 19:51:30 UTC

Soft hyphens are something to take into consideration too. E.g. in the control center with French we have:

Tablette Wa
com

instead of:

Tablette
Wacom

because soft hyphens have been introduced in the English version and requested in translations (and they do make sense, but they should have a lower priority).

Comment 12 Alexandre Franke 2017-06-29 20:12:06 UTC

… and that’s #580275. Maybe that’s close enough that we can mark it as duplicate?

Comment 13 GNOME Infrastructure Team 2018-05-22 12:06:06 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to GNOME's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.gnome.org/GNOME/pango/issues/17.