GNOME Bugzilla – Bug 616403
Improve & fix reading msoffice/powerpoint files
Last modified: 2010-04-21 15:13:21 UTC
Following changes stated in bug #615765, for msoffice/powerpoint (ppt) files, found some other things to fix as well. As background, PPT files store internally the strings in two types of records: "CharsAtom"s and "BytesAtom"s: * If the record is a "CharsAtom", the string comes in pure UTF-16. (http://msdn.microsoft.com/en-us/library/dd772921.aspx) * If the record is a "BytesAtom", the string only contains the low byte of the UTF-16 encoded unicode point, and the high byte is to be considered 0x0000. Some bugs which should be fixed: 1) "CharsAtom"s are really read as "BytesAtom"s, and vice-versa. This is due to the fact that the numerical IDs are swapped in the source code: "CharsAtom" should be identified by 0x0FA8; and "BytesAtom" should be identified by 0x0FA0. 2) The read strings are not NIL-terminated, so when they are normalized, reading goes beyond the real limits of the string, generating Invalid Reads in Valgrind logs. 3) Strings are never converted to UTF-8 before normalizing them. Actually, due to having bugs 1) and 3) at the same time, some contents were more or less extracted, but definitely not all of them. Bug 2) can be solved just by fixing bug 3) using g_convert() which always returns NIL-terminated strings. In addition to the previous fixes, this improvements can also be done: * Re-use the same buffer when reading the string records, to avoid new allocations over and over. * Stop reading when max number of bytes read.
Created attachment 159257 [details] [review] Fixes all reported issues
Pushed to git master after Carlos' review.