Bug 616403 – Improve & fix reading msoffice/powerpoint files

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 616403 - Improve & fix reading msoffice/powerpoint files


Summary:	Improve & fix reading msoffice/powerpoint files


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	0.9.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-04-21 14:24 UTC by Aleksander Morgado
Modified:	2010-04-21 15:13 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Fixes all reported issues (18.62 KB, patch) 2010-04-21 14:30 UTC, Aleksander Morgado	none	Details \| Review

Description Aleksander Morgado 2010-04-21 14:24:42 UTC

Following changes stated in bug #615765, for msoffice/powerpoint (ppt) files, found some other things to fix as well.

As background, PPT files store internally the strings in two types of records: "CharsAtom"s and "BytesAtom"s:
 * If the record is a "CharsAtom", the string comes in pure UTF-16. (http://msdn.microsoft.com/en-us/library/dd772921.aspx)
 * If the record is a "BytesAtom", the string only contains the low byte of the UTF-16 encoded unicode point, and the high byte is to be considered 0x0000.

Some bugs which should be fixed:
 1) "CharsAtom"s are really read as "BytesAtom"s, and vice-versa. This is due to the fact that the numerical IDs are swapped in the source code: "CharsAtom" should be identified by 0x0FA8; and "BytesAtom" should be identified by 0x0FA0.
 2) The read strings are not NIL-terminated, so when they are normalized, reading goes beyond the real limits of the string, generating Invalid Reads in Valgrind logs.
 3) Strings are never converted to UTF-8 before normalizing them.

Actually, due to having bugs 1) and 3) at the same time, some contents were more or less extracted, but definitely not all of them. Bug 2) can be solved just by fixing bug 3) using g_convert() which always returns NIL-terminated strings.

In addition to the previous fixes, this improvements can also be done:
 * Re-use the same buffer when reading the string records, to avoid new allocations over and over.
 * Stop reading when max number of bytes read.

Comment 1 Aleksander Morgado 2010-04-21 14:30:05 UTC

Created attachment 159257 [details] [review]
Fixes all reported issues

Comment 2 Aleksander Morgado 2010-04-21 15:13:21 UTC

Pushed to git master after Carlos' review.