Bug 97545 – Make pango_default_break follow Unicode TR #29

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 97545 - Make pango_default_break follow Unicode TR #29


Summary:	Make pango_default_break follow Unicode TR #29


Status:	RESOLVED FIXED

Product:	pango
Classification:	Platform
Component:	general
Version:	1.0.x
Hardware:	Other Linux

Importance:	High critical
Target Milestone:	Medium fix
Assigned To:	Behdad Esfahbod
QA Contact:	pango-maint

URL:
Whiteboard:

Duplicates:	57375 61726 63398 97861 138180 399829 539655 633097 751125 (view as bug list)
Depends on:
Blocks:	131576 131625 308126 354587 700103

Reported:	2002-11-03 08:02 UTC by Evan Martin
Modified:	2017-08-31 22:50 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
gtk2 test program exhibiting behavior -- jam on the "forward" and "backward" buttons to see where the words are breaking (2.34 KB, text/plain) 2002-11-03 08:03 UTC, Evan Martin		Details
new test program using tr29 test data (173.12 KB, patch) 2003-08-14 14:27 UTC, Noah Levitt	committed	Details \| Review
Patch against HEAD (609 bytes, patch) 2004-08-06 14:58 UTC, Behdad Esfahbod	committed	Details \| Review
Patch: Don't break words between adjacents numbers and letters (1.27 KB, patch) 2013-05-01 03:29 UTC, José Aliste	committed	Details \| Review
Patch: calculate word start and end based on word boundaries (4.48 KB, patch) 2014-08-14 21:10 UTC, Ernest A C	none	Details \| Review
Test program to test word starts and ends (1.89 KB, text/x-csrc) 2014-08-14 21:12 UTC, Ernest A C		Details
Draft patch from 2008 (26.56 KB, patch) 2014-10-15 05:03 UTC, Behdad Esfahbod	needs-work	Details \| Review

Description Evan Martin 2002-11-03 08:02:42 UTC

gtk_text_iter_starts_word, gtk_text_iter_forward_word_end, etc. get
confused when dealing with words that have apostrophes: "don't" is
recognized as the (non-)word "don" followed by "t".  It looks like this is
a Pango bug in how words are broken.

I'm interested in fixing this myself, but I'm not sure where to begin.  Are
apostrophes English-specific enough to require an English-specific
word-breaking module?  Will the potential use of single quotes confuse
apostrophe code?

(Also, I feel like I've submitted this before.  Sorry if this is a repeat
bug, but I can't remember any resolution if there was one.  This is an
important bug for spell-checking purposes, because I use an
internationalized dictionary library and have so far managed to not rely on
any particular language features by letting Pango handle word breaking.)

Comment 1 Evan Martin 2002-11-03 08:03:41 UTC

Created attachment 12007 [details]
gtk2 test program exhibiting behavior -- jam on the "forward" and "backward" buttons to see where the words are breaking

Comment 2 Owen Taylor 2002-11-03 17:49:09 UTC

Not necessarily a practical answer, but the problem here
is that U+0027 is an ambiguous character doing triple 
duty as:

 U+02BC modifier letter apostrophe (what you want)
 U+2018 left single quotation mark
 U+2019 right single quotation mark

You&#700;ll note that U+0027 isn't even the right glyph
for an apostrophe, though we&#700;ve gotten pretty used
to it.

You'll find that if you use U+02BC, it won&#700;t cause a 
word break; if you wanted to treat U+0027 as an
U+02BC for break determination, I think you'd basically
need a dictionary and an English specific shape engine.

I'm not sure of the range of use of the apostrophe. 
None of the languages I'm familiar with other than
English use it much.

Comment 3 Owen Taylor 2002-11-03 17:59:36 UTC

Hmm, actually looking at Unicode TR 29 -
http://www.unicode.org/reports/tr29/tr29-1.html
it's just a bug in the Pango break implementation.

U+0027 should have the class "MidLetter", though the
report does mention that some language-based tailoring
may be useful to make, for instance, the french
l'objectif two words.

[ Not trying to get fancy here after seeing what
  bugzilla+mozilla did to my last comment ]

Comment 4 Havoc Pennington 2002-11-03 18:28:23 UTC

TR29 seems to be an update of section 5.15, the current code is based
on 5.15 only (TR29 came out a few months after writing break.c).

We probably need to do a comprehensive rework of the code in light of
TR29, though perhaps this should wait until TR29 leaves draft status.

Comment 5 Owen Taylor 2002-12-06 00:19:51 UTC

Retitling to dup various associated bugs here.

Comment 6 Owen Taylor 2002-12-06 00:20:29 UTC

*** Bug 61726 has been marked as a duplicate of this bug. ***

Comment 7 Paolo Maggi 2002-12-09 09:01:09 UTC

*** Bug 97861 has been marked as a duplicate of this bug. ***

Comment 8 Owen Taylor 2002-12-16 20:48:02 UTC

*** Bug 63398 has been marked as a duplicate of this bug. ***

Comment 9 Noah Levitt 2003-08-13 03:30:14 UTC

> We probably need to do a comprehensive rework of the code in light of
> TR29, though perhaps this should wait until TR29 leaves draft status.

This has happened, it's now known as UAX #29. 
http://www.unicode.org/reports/tr29/

Comment 10 Havoc Pennington 2003-08-13 03:42:39 UTC

My primary emotion here is "fear" 

Though the new TR seems to have test data, so we could do a test suite 
based on that which would help.

Comment 11 Noah Levitt 2003-08-14 14:27:52 UTC

Created attachment 19215 [details] [review]
new test program using tr29 test data

Comment 12 Noah Levitt 2003-08-14 14:33:19 UTC

I'm not sure which of is_word_start or is_word_end corresponds to a
word boundary in TR29, so I tentatively made them both count. (Unicode
3.0 5.15 also talks about word boundaries, not start and end, so I'm
sure we don't have to add a new field or anything.)

Comment 13 Noah Levitt 2003-11-14 00:30:39 UTC

*** Bug 57375 has been marked as a duplicate of this bug. ***

Comment 14 Owen Taylor 2003-11-17 22:05:17 UTC

Making them both count looks right when we look at the 
examples in section 4 of TR #29. Though there may also
be boundaries in Unicode that won't be Pango boundaries
at all. I think a "word" in Pango is "a word that contains
a letter" in TR #29.

Comment 15 Owen Taylor 2004-03-26 14:27:04 UTC

*** Bug 138180 has been marked as a duplicate of this bug. ***

Comment 16 Owen Taylor 2004-06-20 17:17:45 UTC

*** Bug 144670 has been marked as a duplicate of this bug. ***

Comment 17 Behdad Esfahbod 2004-08-06 14:58:21 UTC

Created attachment 30280 [details] [review]
Patch against HEAD

Moving discussion from bug 118347 to here.  TR29 removes all Format characters
(General Category = Cf) before determining word boundaries.  One-line patch
attached.  I agree that a rewrite of break.c is much needed.

Comment 18 Behdad Esfahbod 2004-08-19 13:31:22 UTC

Should I apply the patch in comment #17?

Comment 19 Behdad Esfahbod 2004-12-01 10:05:10 UTC

Owen, now that you are applying patches, would you please apply the minimal
patch in comment #17.

Comment 20 Owen Taylor 2004-12-02 20:37:55 UTC

Thu Dec  2 15:31:33 2004  Owen Taylor  <otaylor@redhat.com>

        * pango/break.c (pango_default_break): Ignore formatting
        characters when determining word boundaries (Part of
        #97545, Behdad Esfahbod)

Comment 21 Behdad Esfahbod 2005-08-19 01:23:41 UTC

A related bug:

Bug 313907: Update break.c to handle new line-breaking types in Unicode 4.1

Comment 22 Behdad Esfahbod 2005-10-05 00:18:28 UTC

I'm actually working on this.

Comment 23 Paolo Maggi 2006-03-04 14:19:05 UTC

Behdad: any news on this bug?

Comment 24 Behdad Esfahbod 2006-03-04 21:15:16 UTC

I have a local tree that almost does this.  Pretty high priority in the next devel cycle.

Comment 25 Paolo Maggi 2006-12-09 13:00:37 UTC

Automatic ping service ;)

Behdad: any news on this bug?

This will be very important if we are going to add spell checking support to gtk+ stack (see bug #131576 on why solving this bug is important).

Comment 26 Marco Barisione 2006-12-09 23:42:18 UTC

I'm not sure I understood how this patch would work. With this change "don't" and other strings containing a "'" will be considered single words, is it right?

Paolo, if I am right, it could have negative effects on spell checking for some languages, such as Italian or French. In English "don't" is a single word but in Italian something like "un'arancia" ("an orange") is two words. AFAIK the dictionaries for this languages do not contain the strings formed by an article, an apostrophe and a word, adding them to the dictionaries would mean adding every combination of an article and a word starting with a vowel.

Comment 27 Behdad Esfahbod 2006-12-09 23:56:25 UTC

(In reply to comment #26)
> I'm not sure I understood how this patch would work. With this change "don't"
> and other strings containing a "'" will be considered single words, is it
> right?
> 
> Paolo, if I am right, it could have negative effects on spell checking for some
> languages, such as Italian or French. In English "don't" is a single word but
> in Italian something like "un'arancia" ("an orange") is two words. AFAIK the
> dictionaries for this languages do not contain the strings formed by an
> article, an apostrophe and a word, adding them to the dictionaries would mean
> adding every combination of an article and a word starting with a vowel.

Now that is a very good reason to write a Latin language engine for Pango.  This cycle we've added three lang engines (Arabic, Indic, Thai), and Latin can be next.  In a language engine you can taylor all logical attributes based on language.

Comment 28 Nick Lamb 2008-02-18 16:42:05 UTC

This blocks a problem labelled "High" priority which was scheduled for GNOME 2.6 and has still not happened although it continues to attract negative commentary.

So I'm raising this to High, and let's see if it gets any further attention. "Couldn't" is just one word in English and it shouldn't be impossible to get that right in software as complicated as Pango.

Comment 29 Behdad Esfahbod 2008-04-24 20:01:49 UTC

Some may have noticed that I'm working on this since last night.  Grapheme boundaries are updated to UAX#29.  Working on the rest.

2008-04-24  Behdad Esfahbod  <behdad@gnome.org>

        Part of Bug 97545 – Make pango_default_break follow Unicode TR #29
        Patch from Noah Levitt

        * tests/Makefile.am:
        * tests/runtests.sh.in:
        * tests/testboundaries_ucd.c (count_attrs), (parse_line),
        (attrs_equal), (make_test_string), (do_test), (main):
        Add test driver for UAX#14 and UAX#29 test data from Unicode Character
        Databse.  Just drop the following four files in pango/tests for it to
        use them:

                GraphemeBreakTest.txt
                LineBreakTest.txt
                SentenceBreakTest.txt
                WordBreakTest.txt


2008-04-24  Behdad Esfahbod  <behdad@gnome.org>

        Part of Bug 97545 – Make pango_default_break follow Unicode TR #29

        * pango/break.c (pango_default_break): Make Grapheme Boundary code
        exactly follow UAX#29 of Unicode 5.1.0

Comment 30 Behdad Esfahbod 2008-04-25 00:33:10 UTC

Word Boundaries implemented too.

Now need to adjust is_word_start/end to not cross word boundaries.

2008-04-24  Behdad Esfahbod  <behdad@gnome.org>

        Part of Bug 97545 – Make pango_default_break follow Unicode TR #29

        * docs/tmpl/main.sgml:
        * pango/break.c (pango_default_break):
        * pango/pango-break.h:
        * tests/testboundaries_ucd.c (main):
        Add new PangoLogAttr member is_word_boundary, that implements UAX#29's
        Word Boundaries semantics.  Test fully passes for it.

Comment 31 Behdad Esfahbod 2008-06-22 23:44:11 UTC

*** Bug 539655 has been marked as a duplicate of this bug. ***

Comment 32 Behdad Esfahbod 2009-11-27 03:46:09 UTC

*** Bug 399829 has been marked as a duplicate of this bug. ***

Comment 33 John Baptist 2010-07-21 02:27:12 UTC

Has there been any progress on this bug recently (i.e. in the last two years)? It is necessary to fix bug 131576. If progress has stopped, can it be re-assigned? Would anyone like to discuss the blocking issue?

Comment 34 Behdad Esfahbod 2010-10-25 18:53:02 UTC

*** Bug 633097 has been marked as a duplicate of this bug. ***

Comment 35 José Aliste 2013-05-01 03:29:33 UTC

Created attachment 242980 [details] [review]
Patch: Don't break words between adjacents numbers and letters

If I read UAX #29, version 6.2 correctly,  we are not supposed to break "123foo" into two words anymore. This patch fixes that. I didn't update the WordType enum because I didn't know what to do. should we have only two types now? can we do that without breaking anything?

Comment 36 Behdad Esfahbod 2013-05-01 22:10:01 UTC

That's probably fine.   I think you should update documentation a bit too.  I wouldn't mind if you push it to master.

Comment 37 José Aliste 2013-05-02 23:04:27 UTC

Review of attachment 242980 [details] [review]:

Committed as approved by Behdad

Comment 38 José Aliste 2013-05-02 23:04:32 UTC

Review of attachment 242980 [details] [review]:

Committed as approved by Behdad

Comment 39 José Aliste 2013-05-02 23:04:33 UTC

Review of attachment 242980 [details] [review]:

Committed as approved by Behdad

Comment 40 jmontane 2013-12-31 17:06:40 UTC

Hi, 

can a developer take care of dependent bug 700103?

Currently pango-break marks wrongly word start/ends, for instance, the string "abc·cde", where · is U+00B7- 

Word boundaries:
"abc·cde" is a single word, according word boundaries (marked here with |) "|abc·def|" That's fine following UAX29, :)

Word starts:
Pango marks 2 word starts: "|abc·|def". That's wrong.


Word ends:
Pango marks 2 word ends: "|abc·def|". That's wrong.

Expected word starts: "|abc·def"
Expected word ends: "abc·def|"

This bug is really annoying, because word selection with mouse, cursor movement and spell-checking fails for Catalan.

Thanks for your help.

Comment 41 Ernest A C 2014-08-14 21:06:51 UTC

Hi there,

I have made a patch that makes word starts and ends consistent with Unicode UAX TR29.  The algorithm is pretty simple: if it finds a letter or a number between a pair of word boundaries, then it takes these boundaries as the beginning and end of a word.  With this, "should'nt" and "can't" appear as a single word.  It also fixes bug #700103.

Based on some tests, it looks like it works OK, at least as far as English is concerned.  However it might have ruined word breaking for languages that I haven't tested.  I also attach a program that breaks a text into words, which can be used for testing purposes.

echo "Irregular forms: \"ain't\", \"don't\", \"won't\", \"shan't\". \"n't\" can only be attached to an auxiliary verb which is itself not contracted." | ./wordbreak 
> Irregular <
> forms <
> ain't <
> don't <
> won't <
> shan't <
> n't <
> can <
> only <
> be <
> attached <
> to <
> an <
> auxiliary <
> verb <
> which <
> is <
> itself <
> not <
> contracted <

Comment 42 Ernest A C 2014-08-14 21:10:55 UTC

Created attachment 283412 [details] [review]
Patch: calculate word start and end based on word boundaries

Comment 43 Ernest A C 2014-08-14 21:12:35 UTC

Created attachment 283413 [details]
Test program to test word starts and ends

Comment 44 Aleksander Morgado 2014-09-23 18:30:32 UTC

(In reply to comment #42)
> Created an attachment (id=283412) [details] [review]
> Patch: calculate word start and end based on word boundaries

I'm no pango developer, but (sadly) I do have some little experience with the word-breaking rules from the Unicode standard.

In the original code in the repo the logic is looking for a "word end" to see where to break; which is totally opposite to what Unicode suggests, which is to break a word in every position "except for when you don't have to break". The rules that specify where there isn't a wordbreak are not many (~13 IIRC), so the logic should try to apply those rules, and if no such rule applies then consider the position as a word break. This is what e.g. libunistring or libicu do, and this was actually one of the reasons that kind of forced us in Tracker to switch to a non-Pango based word breaker. The unicode rules will make the word-breaking work for any kind of script, including when the text mixes them (e.g. Latin mixed with Katakana). 

E.g. a raw implementation of the WB algorithm I wrote a while ago:
http://bazaar.launchpad.net/~gnu-pdf-team/gnupdf/trunk/view/head:/src/base/pdf-text-ucd-wordbreak.c#L1017

The new approach in the suggested patch actually tries to follow some of the rules in the standard, but not all of them. It's probably fixing some cases, but likely breaking others (e.g. non-latin scripts).

The following link shows some of the unit tests we have in Tracker that actually check the word breaking algorithm used (libunistring or libicu based, but with some additional random rule we added...):
https://git.gnome.org/browse/tracker/tree/tests/libtracker-fts/tracker-parser-test.c#n303

Comment 45 Matthias Clasen 2014-09-23 18:37:24 UTC

Thanks for the background information, thats useful.

I've recently started to add tests to pango, see tests/markup-parse.c and tests/test-layout.c. Having a test in the same style for word boundaries and breakpoints would be awesome to get this moving. There's an older test for boundaries as well, maybe that can serve as a starting point.

Comment 46 Aleksander Morgado 2014-09-23 19:14:46 UTC

(In reply to comment #45)
> Thanks for the background information, thats useful.
> 

Should probably have read a bit more the code before commenting, actually...

The current logic in Pango does try to follow some of the WB rules from the Unicode TR, so it's likely that it just needs a review to check why not every rule is being applied properly (I guess).

The last suggested patch doesn't help, it actually removes some of the mandatory logic in the WB algorithm.

> I've recently started to add tests to pango, see tests/markup-parse.c and
> tests/test-layout.c. Having a test in the same style for word boundaries and
> breakpoints would be awesome to get this moving. There's an older test for
> boundaries as well, maybe that can serve as a starting point.

That's actually nice to have. I'd suggest to use libunistring or libicu outputs to compare with the WB rules in Pango.

Comment 47 Ernest A C 2014-09-25 23:50:02 UTC

(In reply to comment #46)
> The last suggested patch doesn't help, it actually removes some of the
> mandatory logic in the WB algorithm.

The patch I posted in comment #42 sets the word_start and word_end attributes based on the word boundaries computations, which in theory follow the UAX #29 guidelines. The current situation is that these attributes are set according to different rules and are inconsistent with word boundaries (see comment #40).

UAX #29 suggests the following method for determining what is a word once you have a bunch of word boundaries:

"Proximity tests in searching determines whether, for example, “quick” is within three words of “fox”. That is done with the above boundaries by ignoring any words that do not contain a letter, as in Figure 2. Thus, for proximity, “fox” is within three words of “quick”. This same technique can be used for “get next/previous word” commands or keyboard arrow keys. Letters are not the only characters that can be used to determine the “significant” words; different implementations may include other types of characters such as digits or perform other analysis of the characters."

This is roughly what my patch does. If there is a letter or a digit within a pair of word boundaries, it is a word and we set the word_start and word_end attributes, otherwise it isn't. What other approach do you suggest for setting these attributes?

Comment 48 Behdad Esfahbod 2014-10-15 05:03:23 UTC

Created attachment 288560 [details] [review]
Draft patch from 2008

Attaching my patch from 2008.  Not sure if it's of any use.  I don't think I'll get to finish it any time soon.

Comment 49 Matthias Clasen 2015-06-22 12:03:42 UTC

*** Bug 751125 has been marked as a duplicate of this bug. ***

Comment 50 aziza 2015-06-24 11:34:24 UTC

on a travaillé sur ce bug on est arrivé au point où on a constaté que le bug est dû à l'implémentation de la fonction pango_default_break qui se trouve dans le fichier break.c qui ne gére pas le single quote et le considére comme une séparateur d'un mot.

pourriez-vous nous éclairer sur ce problème???.

Comment 51 Behdad Esfahbod 2017-08-31 22:50:13 UTC

I think this is for the most part done now, in other bugs.  Closing as keeping open doesn't help anything.