After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 616403 - Improve & fix reading msoffice/powerpoint files
Improve & fix reading msoffice/powerpoint files
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Extractor
0.9.x
Other Linux
: Normal normal
: ---
Assigned To: tracker-extractor
Jamie McCracken
Depends on:
Blocks:
 
 
Reported: 2010-04-21 14:24 UTC by Aleksander Morgado
Modified: 2010-04-21 15:13 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Fixes all reported issues (18.62 KB, patch)
2010-04-21 14:30 UTC, Aleksander Morgado
none Details | Review

Description Aleksander Morgado 2010-04-21 14:24:42 UTC
Following changes stated in bug #615765, for msoffice/powerpoint (ppt) files, found some other things to fix as well.

As background, PPT files store internally the strings in two types of records: "CharsAtom"s and "BytesAtom"s:
 * If the record is a "CharsAtom", the string comes in pure UTF-16. (http://msdn.microsoft.com/en-us/library/dd772921.aspx)
 * If the record is a "BytesAtom", the string only contains the low byte of the UTF-16 encoded unicode point, and the high byte is to be considered 0x0000.

Some bugs which should be fixed:
 1) "CharsAtom"s are really read as "BytesAtom"s, and vice-versa. This is due to the fact that the numerical IDs are swapped in the source code: "CharsAtom" should be identified by 0x0FA8; and "BytesAtom" should be identified by 0x0FA0.
 2) The read strings are not NIL-terminated, so when they are normalized, reading goes beyond the real limits of the string, generating Invalid Reads in Valgrind logs.
 3) Strings are never converted to UTF-8 before normalizing them.

Actually, due to having bugs 1) and 3) at the same time, some contents were more or less extracted, but definitely not all of them. Bug 2) can be solved just by fixing bug 3) using g_convert() which always returns NIL-terminated strings.

In addition to the previous fixes, this improvements can also be done:
 * Re-use the same buffer when reading the string records, to avoid new allocations over and over.
 * Stop reading when max number of bytes read.
Comment 1 Aleksander Morgado 2010-04-21 14:30:05 UTC
Created attachment 159257 [details] [review]
Fixes all reported issues
Comment 2 Aleksander Morgado 2010-04-21 15:13:21 UTC
Pushed to git master after Carlos' review.