GNOME Bugzilla – Bug 420850
The format of the .tags file should be changed to reduce the size of translated files
Last modified: 2019-03-23 20:34:07 UTC
gedit ships a 6.5MB HTML.tags. Thats just absurd. 1) no need to repeat the gedit: namespace prefix a gazillion times. Just set a default namespace. 2) the file does a reverse mapping from unicode characters to iso character entities, written down in verbose XML. There _has_ to be a better way to do this,
$ sed -i 's^<gedit:^<^g' HTML.tags.2 $ sed -i 's^</gedit:^</^g' HTML.tags.2 $ ll HTML.tags* -rw-r--r-- 1 sf 6.2M 2007-03-21 11:56 HTML.tags -rw-r--r-- 1 sf 4.9M 2007-03-21 11:57 HTML.tags.2 Effectively there is room for improvement in there... Ah, those old crappy plugins...
Created attachment 85034 [details] [review] [PATCH] Drastic diet for the data files of the Tag List plugin Set gedit XML namespace in the XML tag files as the default namespace, and rework indentation. This doesn't change anything for the (libxml-based) Tag list plugin, but lowers the total size of the translated XML files from about 9.2M to about 6.7M (ie ~27%). --- ChangeLog | 12 + plugins/taglist/HTML.tags.xml.in | 5289 ++++++++++++++++++------------------- plugins/taglist/Latex.tags.xml.in | 689 +++--- plugins/taglist/XSLT.tags.xml.in | 669 +++--- plugins/taglist/XUL.tags.xml.in | 1075 ++++---- 5 files changed, 3865 insertions(+), 3869 deletions(-)
Here is what ls says about the files: old new HTML 6.2M 4.5M Latex 844K 617K XSLT 860K 608K XUL 1.3M 975K total 9.2M 6.7M about 2), I don't think it has much influence since those chars are not translated (so it's negligible compared to the large amount of data the translated entries represent. But for those entries, the whole entry is duplicated for each language. Maybe it's a bit much since only the name attribute changes. Otherwise maybe it should use gettext... Also, what about storing gzipped tag files ? From my test, it makes the HTML.tags file go down to 466K, due to its high redundancy.
Created attachment 85036 [details] [review] [PATCH] Compress the tag files to reduce their size tags files of the Tag List plugins are now gzipped. This allowed to reduce the total size of tags files from 6.7M to 560K (~92%), due to their high redundancy. --- ChangeLog | 9 +++++++++ plugins/taglist/Makefile.am | 11 ++++++++--- plugins/taglist/gedit-taglist-plugin-parser.c | 3 ++- 3 files changed, 19 insertions(+), 4 deletions(-)
Created attachment 85041 [details] [review] [PATCH] Compress the tag files to reduce their size Tag files of the Tag List plugins are now gzipped. This allowed to reduce the total size of tags files from 6.7M to 560K (~92%), due to their high redundancy. --- ChangeLog | 11 +++++++++++ configure.ac | 1 + plugins/taglist/Makefile.am | 9 ++++++--- plugins/taglist/gedit-taglist-plugin-parser.c | 3 ++- 4 files changed, 20 insertions(+), 4 deletions(-)
This problem has been fixed in our software repository. The fix will go into the next software release. Thank you for your bug report.
I was already aware of this problem. The applied patch is only a "patch", the real solution consists in changing the format of the .tags file (and than compress it too). As mclasen suggested we can also remove the gedit namespace. Note also that the applied patch is broken: > - if (strncmp (e->d_name + strlen (e->d_name) - 5, ".tags", 5) == 0) > + if (strncmp (e->d_name + strlen (e->d_name) - 5, ".tags", 5) == 0 || > + strncmp (e->d_name + strlen (e->d_name) - 8, ".tags.gz", 5) == 0) ^ It should be: strncmp (e->d_name + strlen (e->d_name) - 8, ".tags.gz", 8) == 0 ^ Are we sure libxml is always able to read .gz files or we need to check it is compiled with some specific option? While we are at it may be we can use "-9" to have a better compression. HTML.tags uncompressed -> 6478973 bytes HTML.tags.gz with default options -> 643734 bytes HTML.tags.gz with -9 option -> 629480 bytes I'm wondering if the fact that we are now reading a smaller file give as some performance gain. Reopening and changing the summary.
I have fixed the bug reported in comment #7. I have also added "--best -f" arguments to gzip. The first one is needed to obtain a better compression. The second one is needed to overwrite the existing .tags.gz file in the case the corresponding .tags.in file is modified. Without --best: paolo@elilix:/gnome/gnome-218/svn/gedit/plugins/taglist$ du -h -c *.gz 468K HTML.tags.gz 64K Latex.tags.gz 16K XSLT.tags.gz 12K XUL.tags.gz 560K total paolo@elilix:/gnome/gnome-218/svn/gedit/plugins/taglist$ gzip -l *.gz compressed uncompressed ratio uncompressed_name 472010 4697969 90.0% HTML.tags 58365 631386 90.8% Latex.tags 15917 621801 97.4% XSLT.tags 11835 998019 98.8% XUL.tags 558127 6949175 92.0% (totals) With --best: paolo@elilix:/gnome/gnome-218/svn/gedit/plugins/taglist$ du -h -c *.gz 452K HTML.tags.gz 60K Latex.tags.gz 16K XSLT.tags.gz 12K XUL.tags.gz 540K total paolo@elilix:/gnome/gnome-218/svn/gedit/plugins/taglist$ gzip -l *.gz compressed uncompressed ratio uncompressed_name 458544 4697969 90.2% HTML.tags 55283 631386 91.2% Latex.tags 15372 621801 97.5% XSLT.tags 9796 998019 99.0% XUL.tags 538995 6949175 92.2% (totals) ----------------- I have a question: how can we manage the case in which gzip is not installed on the machine of the user compiling gedit? ----------------- Committed patch: Index: plugins/taglist/gedit-taglist-plugin-parser.c =================================================================== --- plugins/taglist/gedit-taglist-plugin-parser.c (revision 5582) +++ plugins/taglist/gedit-taglist-plugin-parser.c (working copy) @@ -579,7 +579,7 @@ parse_taglist_dir (const gchar *dir) while ((e = readdir (d)) != NULL) { if (strncmp (e->d_name + strlen (e->d_name) - 5, ".tags", 5) == 0 || - strncmp (e->d_name + strlen (e->d_name) - 8, ".tags.gz", 5) == 0) + strncmp (e->d_name + strlen (e->d_name) - 8, ".tags.gz", 8) == 0) { gchar *tags_file = g_strconcat (dir, e->d_name, NULL); parse_taglist_file (tags_file); Index: plugins/taglist/Makefile.am =================================================================== --- plugins/taglist/Makefile.am (revision 5582) +++ plugins/taglist/Makefile.am (working copy) @@ -41,7 +41,7 @@ plugin_in_files = taglist.gedit-plugin.d %.tags.gz: %.tags.xml.in $(INTLTOOL_MERGE) $(wildcard $(top_srcdir)/po/*.po) LC_ALL=C $(INTLTOOL_MERGE) $(top_srcdir)/po $< $(@:.gz=) -x -u -c $(top_builddir)/po/.intltool-merge-cache - $(GZIP) $(@:.gz=) + $(GZIP) --best -f $(@:.gz=) plugin_DATA = $(plugin_in_files:.gedit-plugin.desktop.in=.gedit-plugin)
Seems that here we miss the question about not having gzip installed? If so, I think we didn't have problems about that, so what about closing it?
$:andre\> pwd /opt/git-gnome/gedit-plugins/plugins/taglist $:andre\> ls -l total 152 -rw-rw-r--. 1 andre andre 51539 Jul 30 19:17 HTML.tags.xml.in -rw-rw-r--. 1 andre andre 6576 Jul 30 19:17 Latex.tags.xml.in -rw-rw-r--. 1 andre andre 302 Jul 30 19:17 taglist.plugin.desktop.in.in -rw-rw-r--. 1 andre andre 7603 Jul 30 19:17 XSLT.tags.xml.in -rw-rw-r--. 1 andre andre 12417 Jul 30 19:17 XUL.tags.xml.in Looks acceptable to me, can we close this as FIXED / OBSOLETE?
Yeah, let's close it. Please reopen if there is an issue left regarding the tags.