After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 320666 - Avoid a bsearch() for gunichar->script and paired characters
Avoid a bsearch() for gunichar->script and paired characters
Status: RESOLVED FIXED
Product: pango
Classification: Platform
Component: general
1.10.x
Other All
: Normal normal
: ---
Assigned To: pango-maint
pango-maint
Depends on:
Blocks:
 
 
Reported: 2005-11-04 01:31 UTC by Federico Mena Quintero
Modified: 2005-11-14 23:34 UTC
See Also:
GNOME target: ---
GNOME version: 2.11/2.12


Attachments
fmq-easy-script-and-paired-chars-tables.diff (210.30 KB, patch)
2005-11-04 01:32 UTC, Federico Mena Quintero
none Details | Review
my version of fixing the issue (304.63 KB, patch)
2005-11-04 02:12 UTC, Behdad Esfahbod
none Details | Review
my version of federicos patch (221.42 KB, patch)
2005-11-05 05:24 UTC, Matthias Clasen
none Details | Review
small improvements of my version of federicos patch (17.87 KB, patch)
2005-11-07 16:35 UTC, Matthias Clasen
none Details | Review
pango-script-table.png (25.75 KB, image/png)
2005-11-08 03:36 UTC, Federico Mena Quintero
  Details
my patch (419.34 KB, patch)
2005-11-10 00:04 UTC, Behdad Esfahbod
none Details | Review
correct patch (418.07 KB, patch)
2005-11-10 00:53 UTC, Behdad Esfahbod
committed Details | Review

Description Federico Mena Quintero 2005-11-04 01:31:40 UTC
Pango uses a table to map a gunichar to a script (pango_script_table).  It also
uses another table to map a character to its paired char (paired_chars).

We can use a static table with 8192 entries for so many of the first code points
in Unicode.  With this table we can know both the script for each character
below U+2000, and also whether it has a paired character.

This change is described in detail here:

http://primates.ximian.com/~federico/news-2005-10.html#gtkfilechooser-profile-6

With such a change, the time spent in pango_script_iter_new() goes from 15% down
to 2.9%.

Also, this patch replaces one use of g_object_set/get_data() by the
corresponding GQuark functions.  This was also showing up high in the profile.
Comment 1 Federico Mena Quintero 2005-11-04 01:32:49 UTC
Created attachment 54300 [details] [review]
fmq-easy-script-and-paired-chars-tables.diff
Comment 2 Behdad Esfahbod 2005-11-04 02:12:14 UTC
Created attachment 54301 [details] [review]
my version of fixing the issue

I'm attaching a new pango-script-table.h, generated using packtab.c.  This
takes 10kb memory and handles all script lookups using three array lookups. 
This should be applied no matter we want to use federico's patch.  I would like
to apply this, and the quark part of federico's.  Will help to see some profile
results.

Federico, Matthias, review please.
Comment 3 Matthias Clasen 2005-11-04 21:44:38 UTC
Looks fine to me. 
Don't forget to add the gen-scripts-tab sources to the tools directory.
Comment 4 Matthias Clasen 2005-11-05 05:23:22 UTC
I had some ideas how to improve federico's approach further:

- make pango_script_for_unichar a simple wrapper around
  pango_script_for_unichar_with_last_index()

- remove ranges below 0x2000 from pango-script-table.h

- make better use of the information contained in last_index
  by using failed tests to reduce the range for binary search

- use the fact that all paired characters are Common to avoid
  more get_pair_index() calls

Note that I haven't timed this, and not verified that it works
beyond compiling it.
Comment 5 Matthias Clasen 2005-11-05 05:24:18 UTC
Created attachment 54346 [details] [review]
my version of federicos patch
Comment 6 Behdad Esfahbod 2005-11-05 07:49:24 UTC
Thanks Matthias.  Federico and I talked about the first two items.  I like 
your other two points.

My only objection is with this line:

+
+  *last_index = -1;
   return PANGO_SCRIPT_COMMON;
 }

I think leaving last_index to whatever it is the way to go.  This way common 
characters do not trash the cache.
Comment 7 Matthias Clasen 2005-11-05 12:01:12 UTC
oh, right.
Comment 8 Matthias Clasen 2005-11-05 14:12:57 UTC
It occurred to me that last_index is really just the initial value for the
midpoint of the bsearch, and there is no need to have the -1 special case. 
The resulting function looks nicer, imo. Of course, places where we used
-1 for last_index must be changed to use PANGO_SCRIPT_TABLE_MIDPOINT,
then.

#define PANGO_SCRIPT_TABLE_MIDPOINT (G_N_ELEMENTS (pango_script_table) / 2)

static PangoScript
pango_script_for_unichar_with_last_index (gunichar ch, int *last_index)
{
  int lower = 0;
  int upper = G_N_ELEMENTS (pango_script_table) - 1;
  int mid = *last_index;

  if (ch < PANGO_EASY_SCRIPTS_RANGE)
    return pango_easy_scripts_table[ch] & 0x7f;
  
  while (lower <= upper)
    {
      if (ch < pango_script_table[mid].start)
	upper = mid - 1;
      else if (ch >= pango_script_table[mid].start + pango_script_table[mid].chars)
	lower = mid + 1;
      else
	{
	  *last_index = mid;
	  return pango_script_table[mid].script;
	}
      mid = (lower + upper) / 2;
    }

  return PANGO_SCRIPT_COMMON;
}
Comment 9 Matthias Clasen 2005-11-06 06:57:55 UTC
The comparing this version against federicos yields almost identical results,
with minimal differences.
Comment 10 Matthias Clasen 2005-11-07 16:35:27 UTC
Created attachment 54431 [details] [review]
small improvements of my version of federicos patch
Comment 11 Federico Mena Quintero 2005-11-08 03:36:52 UTC
Created attachment 54451 [details]
pango-script-table.png

This shows the timings between Matthias's modified version of my patch and
Behdad's table.  Looks like Behdad buys the beer this time :)

I'll resurrect my gen-easy-scripts program, and apply my patch with Matthias's
modifications.
Comment 12 Federico Mena Quintero 2005-11-08 04:38:09 UTC
Committed to HEAD.

2005-11-07  Federico Mena Quintero  <federico@ximian.com>

	Fixes bug #320666:

	Instead of doing a bsearch() for every gunichar to map it to a
	PangoScript, use a precomputed table for the first 8192 code
	points.  Also, remember the last script that we computed on each
	invocation; this will also help CJK and the other scripts above
	U+2000.

	This table also holds information on whether the characters in it
	are paired characters.  We can use this to avoid doing the
	expensive get_pair_index() call most of the time.

	Many thanks to Matthias Clasen for his suggestions for this patch.

	* tools/gen-easy-scripts-table.c: New program to generate
	pango_easy_scripts_table.

	* tools/Makefile.am: Build gen-easy-scripts-table.

	* pango/pango-easy-scripts-table.h: New file with a mapping of the
	first 8192 Unicode characters to their corresponding scripts.  The
	table also says whether each character has a paired char or not.

	* pango/Makefile.am (libpango_1_0_la_SOURCES): Add pango-easy-scripts-table.h.

	* pango/pango-script-table.h: Remove everything below U+2000, and
	add a note to that effect.

	* pango/pango-script.c (pango_script_for_unichar_with_last_index):
	New function.  This is the old pango_script_for_unichar(), but it
	lets the caller keep around the computed index in
	pango_script_table.  This works under the assumption that a
	character is likely to be in the same script block as the
	preceding character in a string.
	(pango_script_for_unichar): First, do a quick check against the
	pango_easy_scripts_table.  Then, do the expensive check with
	pango_script_for_unichar_with_last_index().
	(pango_script_iter_next): If the character is within the easy
	script range, find out if it is a paired character by using
	PANGO_PAIRED_CHAR_FLAG.
	(struct _PangoScriptIter): Add a last_index_for_script_lookup
	field.  We use this to maintain the last-lookup index from
	pango_script_for_unichar_with_last_index().
	(pango_script_iter_next): If the character is not within the easy
	script range, use pango_script_for_unichar_with_last_index(), and
	store the index in the last_index_for_script_lookup field of the
	PangoScriptIter.
Comment 13 Behdad Esfahbod 2005-11-09 23:59:51 UTC
I suggest removing the pair bit from easy table.  Matthias's improvement for not
bsearching the pair table for nonCOMMON characters should be good enough to make
that almost useless.

Attaching a patch that removes the gen_easy_scripts_table.c and modifies
gen_script_table.pl to generate easy table too.
Comment 14 Behdad Esfahbod 2005-11-10 00:04:00 UTC
Created attachment 54550 [details] [review]
my patch

Review?
Comment 15 Behdad Esfahbod 2005-11-10 00:53:46 UTC
Created attachment 54551 [details] [review]
correct patch

Oops.  My generator was wrong.	Please review the generator to make sure it's
generating the correct output.
Comment 16 Matthias Clasen 2005-11-10 14:38:21 UTC
Reopening
Comment 17 Matthias Clasen 2005-11-11 19:13:53 UTC
Behdad, why is your patch removing the last_index handling ?
Comment 18 Behdad Esfahbod 2005-11-11 19:31:04 UTC
Instead of having a per-iter last_index, I just made int mid static which should
have the same effect.
Comment 19 Federico Mena Quintero 2005-11-11 22:36:00 UTC
Did you take timings of these changes?

I'm especially interested in the removal of the paired chars flag.
Comment 20 Behdad Esfahbod 2005-11-12 03:23:11 UTC
Unfortunately my system is really too unreliable to run benchmarks, due to bug
318168.  Can you maybe run that?
Comment 21 Behdad Esfahbod 2005-11-14 23:34:20 UTC
In HEAD now.  The main reason I had to apply this last patch was that Federico's
gen-easy-scripts-table.c was using the generated script table to create the easy
table.  As for removing last_index and making "int mid" static, since we are not
aiming for reentrancy, the per ScriptIter last_index is not justified.