Bug 694534 – change to unichar_index() fixes assertion failure with U+1D173

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 694534 - change to unichar_index() fixes assertion failure with U+1D173


Summary:	change to unichar_index() fixes assertion failure with U+1D173


Status:	RESOLVED DUPLICATE of bug 668154

Product:	pango
Classification:	Platform
Component:	win32
Version:	unspecified
Hardware:	Other Windows

Importance:	Normal normal
Target Milestone:	---
Assigned To:	gtk-win32 maintainers
QA Contact:	pango-maint

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2013-02-23 15:32 UTC by Matthew Flatt
Modified:	2013-02-24 23:43 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description Matthew Flatt 2013-02-23 15:32:04 UTC

The punchline is that I think unichar_index() in "basic-win32.c" should
increment its counter on a low surrogate, and not on a high surrogate:

  for (i = 0; i < ix; i++)
    /* Ignore the high surrogate */
    if (!(wtext[i] >= 0xD800 && wtext[i] < 0xDBFF))
      index++;

This change should only matter when the supplied index
`ix' is in the middle of a surrogate pair. It makes the
function return the index of the character represented
by the surrogate pair instead of the character after.

The change is specifically intended to address a crash when using
the character U+1D173. We ran into the problem in the context of
Racket, but pasting a U+1D173 into GIMP crashes with the same
assertion failure: at g_assert (glyphs->log_clusters[glyphix] < n_chars)
in convert_log_clusters_to_byte_offsets().

The problem is that U+1D173 is complex in the sense of
ScriptIsComplex(), and so uniscribe_shape() is used,
which calls set_up_pango_log_clusters(). The `itemlen'
for that call is 2, since the character is represented by
two UTF-16 code units (i.e., a surrogate pair), and so
the log-cluster array has two elements. Here's the weird
part: each half of the surrogate has been mapped to itself
by the ScriptShape() function, instead of both mapping to
the beginning of the surrogate pair; that does not
make sense to me, and I thin kit may be a problem with
ScriptShape(). In any case, when unichar_index() is called
with offset 1 for the second of the pair, it reports a 1 back,
which would mean a second character --- but there is only
one character overall. 

Assuming that the middle-of-surrogate problem isn't
due to a misuse of ScriptShape() somehow, then having
unichar_index() effectively be more defensive seems like
a good general approach.


Related: I think the third argument to the
call to ScriptItemize() in itemize_shape_and_place()
should be G_N_ELEMENTS (items)-1 --- one less
than the size of `items'.

Comment 1 Matthew Flatt 2013-02-24 00:00:29 UTC

The "< 0xDBFF" in the suggested fix for unichar_index()
should be "<= 0xDBFF".

Comment 2 Daniel Atallah 2013-02-24 15:15:45 UTC

This is a duplicate of Bug 668154.

The patch here is a somewhat different approach than the patch attached to that ticket, although I think that both will work.

This ends up allocating less memory, but I'm not familiar enough with the intent of the current code to know which is a better solution.

Comment 3 Matthew Flatt 2013-02-24 15:23:38 UTC

I'm not certain, but I don't think the patch with Bug 668154 is right. I'll comment there.

Comment 4 Behdad Esfahbod 2013-02-24 23:43:01 UTC

Ok, lets mark dup and discuss in the other bug.

*** This bug has been marked as a duplicate of bug 668154 ***