After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 616329 - Improve & fix reading msoffice/excel files
Improve & fix reading msoffice/excel files
Status: RESOLVED FIXED
Product: tracker
Classification: Core
Component: Extractor
0.9.x
Other Linux
: Normal normal
: ---
Assigned To: tracker-extractor
Jamie McCracken
Depends on:
Blocks:
 
 
Reported: 2010-04-20 19:24 UTC by Aleksander Morgado
Modified: 2010-04-21 12:11 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
Implemented all improvements and errors reported (27.36 KB, patch)
2010-04-20 19:32 UTC, Aleksander Morgado
none Details | Review
Small update to the original patch (27.13 KB, patch)
2010-04-20 19:42 UTC, Aleksander Morgado
none Details | Review
Re-use the string flag reader method (27.26 KB, patch)
2010-04-20 20:27 UTC, Aleksander Morgado
needs-work Details | Review

Description Aleksander Morgado 2010-04-20 19:24:58 UTC
Following changes stated in bug #615765, for msoffice/excel (xls) files.

Currently, the behavior is the following:
 * Using libgsf, iterate through all strings in all records, loading all the contents in a single GString in memory
 * Once the GString contains all concatenated strings, normalize it and limit it to max words.

Also, found several important bugs to be fixed:
 * Strings which are split into two ExcelExtendedStringRecord are not properly read. The change of record mechanism when this happens in the middle of a string is not currently implemented.
 * String contents are added as they come to the GString, without previously converting them to UTF-8. This actually causes that the string to be normalized may have CP1252-specific bytes or even worse, fully UTF-16 encoded chunks.
 * When string contents are added to the GString they are added either byte per byte or two-bytes per two-bytes. This is definitely not efficient, and it's just easier to read the whole string in one operation (or in two if it's split in two records as explained before).


Additionally, the reading can be improved in the following way:
 * Limit the max number of bytes to be read from the stream, to some safe
limit like 3*max_words*max_word_size.
 * Don't load the whole contents in a single string in heap: use a buffer to read the contents, convert to UTF-8, perform normalization and word count (chunk by chunk). The same method as per the msoffice/word documents can be used.
 * Stop reading the contents when max bytes reached.
 * Stop reading the contents when max number of words reached.
Comment 1 Aleksander Morgado 2010-04-20 19:32:14 UTC
Created attachment 159189 [details] [review]
Implemented all improvements and errors reported
Comment 2 Aleksander Morgado 2010-04-20 19:42:34 UTC
Created attachment 159191 [details] [review]
Small update to the original patch
Comment 3 Aleksander Morgado 2010-04-20 20:27:26 UTC
Created attachment 159193 [details] [review]
Re-use the string flag reader method
Comment 4 Martyn Russell 2010-04-21 10:55:06 UTC
Comment on attachment 159193 [details] [review]
Re-use the string flag reader method

> /* ExtendendString Record offset in stream and length */
> typedef struct {
>-	guint32 offset;
>-	guint32	length;
>+	gsf_off_t offset; /* 64 bits!! */
>+	gsize     length;

Oh eeek. Were we assuming offset was 32 bits when it wasn't? Great catch!

> } ExcelExtendedStringRecord;
> 
> typedef enum {
>@@ -367,7 +367,7 @@ read_8bit (const guint8 *buffer)
>  * @param buffer data to read integer from
>  * @return 16 bit unsigned integer
>  */
>-static gint
>+static guint
> read_16bit (const guint8 *buffer)
> {
> 	return buffer[0] + (buffer[1] << 8);
>@@ -378,7 +378,7 @@ read_16bit (const guint8 *buffer)
>  * @param buffer data to read integer from
>  * @return 32 bit unsigned integer
>  */

Hmm, I was wondering about this, it looks like the maximum size here is a guint16. Shouldn't we be returning a guint16? The issue with guint is that it depends on the platform of course (for me it is a guint64, it could be a guint32 too and both are big enough, but guint16 sounds more correct here)

>-static gint
>+static guint
> read_32bit (const guint8 *buffer)
> {
> 	return buffer[0] + (buffer[1] << 8) + (buffer[2] << 16) + (buffer[3] << 24);
>@@ -784,6 +784,92 @@ open_uri (const gchar *uri)
> 	return infile;
> }

Again, shouldn't we be more specific than guint here and use a guint32 instead.
 
>+	g_return_if_fail (buffer != NULL &&
>+	                  chunk_size > 0 &&
>+	                  p_words_remaining != NULL &&
>+	                  p_bytes_remaining != NULL &&
>+	                  p_content != NULL);

Please avoid using one statement like the above for checking. The problem is, if any one of these fails, in the logs, how do we know which one it was?
If we have one call per logical statement, it is easier to determine what the real problem was if it ever happens.

>+	}
>+	else {

Should be } else {

>+		g_warning ("Couldn't convert %d bytes from %s to UTF-8: %s",
>+		           chunk_size,
>+		           is_ansi ? "CP1252" : "UTF-16",
>+		           error ? error->message : NULL);

Usually, we use "no error given" in the case of NULL above so it is clear that there was no specific error.

>+	}
>+
>+	/* Note that error may be set even if some converted text is
>+	 * available, due to G_CONVERT_ERROR_ILLEGAL_SEQUENCE for example */
>+	g_clear_error (&error);

As I understand it then, we read n bytes of chunk where n is valid or can be converted and we repeat this until we reach the limit according to the config right? Just wondering if that's right, if we get invalid sequences, can we really continue reading the next chunks like that? Wouldn't that normally be garbage?

>+/* Reads and interprets the flags of a given string. May be
>+ * used just to skip the fields, as when this bitmask-byte
>+ * comes as the first byte of a new record. */
>+static void
>+read_excel_string_flags (GsfInput *stream,
>+                         gboolean *p_is_high_byte,
>+                         guint16  *p_c_run,
>+                         guint16  *p_cb_ext_rst)

We should use more meaningful parameters here. What is p_c_run (a pointer to something?), same for p_cb_ext_rst, looks like a pointer to a callback for something?

>+	/* Reading  1 byte for mask */

Small typo, extra space before "1"

>+	gsf_input_read (stream, 1, tmp_buffer);
>+	bit_mask = read_8bit (tmp_buffer);
>+
>+	/* Get flags */
>+	if (p_is_high_byte) {
>+		*p_is_high_byte = (bit_mask & 0x01) == 0x01;
>+	}
>+	is_ext_string = (bit_mask & 0x04) == 0x04;

I presume this is "is_external_string" ?

>+	is_rich_string = (bit_mask & 0x08) == 0x08;

What is a rich_string?

>+
>+	/* If the c_run value is required as output, read it */
>+	if (p_c_run) {
>+		if (is_rich_string) {
>+			/* Reading 2 Bytes */
>+			gsf_input_read (stream, 2, tmp_buffer);

Extra space here please :)

>+			/* Reading cRun */
>+			*p_c_run = read_16bit (tmp_buffer);
>+		}
>+		else {

} else {

>+			*p_c_run = 0;
>+		}
>+	}
>+	/* If not required, just skip those bytes */
>+	else if (is_rich_string) {

} else if () {

Usually I put comments under the condition if it pertains to the condition.

>+		gsf_input_seek (stream, 2, G_SEEK_CUR);
>+	}
>+
>+	/* If the cb_ext_rst value is required as output, read it */
>+	if (p_cb_ext_rst) {
>+		if (is_ext_string) {
>+			/* Reading 4 Bytes */
>+			gsf_input_read (stream, 4, tmp_buffer);

Space please :)

>+			/* Reading cRun */
>+			*p_cb_ext_rst = read_16bit (tmp_buffer);
>+		}
>+		else {

} else {

>+			*p_cb_ext_rst = 0;
>+		}
>+	}
>+	/* If not required, just skip those bytes */
>+	else if (is_ext_string) {

Same as above.

>+/* Returns TRUE if record was changed. BUT, the value of the
>+ *  current_record should be checked by the caller to know
>+ *  if there are no more records */
>+static gboolean
>+change_excel_record_if_needed (GsfInput *stream,
>+                               GArray   *record_array,
>+                               guint    *p_current_record)
>+{
>+	ExcelExtendedStringRecord *record;
>+
>+	/* Get current record */
>+	record = &g_array_index (record_array,
>+	                         ExcelExtendedStringRecord,
>+	                         *p_current_record);
>+
>+	/* We may already have surpased the record, so adjust if so */

Typo, should be "surpassed".

>+	if (gsf_input_tell (stream) >= (record->offset + record->length)) {
>+		/* Switch records and read from the second one... */
>+		(*p_current_record)++;

No need for brackets there.

>+		if (*p_current_record < record_array->len) {
>+			record = &g_array_index (record_array,
>+			                         ExcelExtendedStringRecord,
>+			                         *p_current_record);
>+
>+			gsf_input_seek (stream, record->offset, G_SEEK_SET);
>+		}

Space here please.

>+		return TRUE;
>+	}

Space here please.

>+	return FALSE;
>+}

Only one lines between functions (below there are 3 it seems):

>+
>+
>+
>+/* Returns TRUE if correctly read


>+static gboolean
>+read_excel_string (GsfInput *stream,
>+                   guint8   *buffer,
>+                   gsize     chunk_size,
>+                   GArray   *record_array,
>+                   guint    *p_current_record)
>+{
>+	ExcelExtendedStringRecord *record;
>+	gsf_off_t current_position;
>+	gsf_off_t current_record_end;

Unnecessary extra line below:

>+
>+
>+	/* Record may have changed when we want to read the string contents


>+	 *  This is a pretty special case, where the new CONTINUE record
>+	 * shouldn't start with a bitmask */
>+	if (change_excel_record_if_needed (stream, record_array, p_current_record) &&
>+	    *p_current_record >= record_array->len) {
>+		/* When reached max number of records, just return */
>+		return FALSE;
>+	}

Extra line here too:

>+
>+
>+	/* Get current record */


>+	record = &g_array_index (record_array,
>+	                         ExcelExtendedStringRecord,
>+	                         *p_current_record);
>+
>+	/* Compute current position in the stream and end of current record*/
>+	current_position = gsf_input_tell (stream);
>+	current_record_end = record->offset + record->length;
>+
>+	/* The best case is when the whole number of bytes to read are in the
>+	 * current record, as no record switching is therefore needed */
>+	if (current_position + chunk_size <= current_record_end) {
>+		return gsf_input_read (stream, chunk_size, buffer) != NULL ? TRUE : FALSE;
>+	}

} else if {

>+	/* Safety check, actually pretty important */
>+	else if (current_record_end < current_position) {
>+		return FALSE;
>+	}
>+	/* Read the string in two chunks */
>+	else {

} else {

>+			(*p_current_record)++;

No need for brackets here

>+
>+
>+
> /**

Extra lines above should be avoided

>+	/* Note: The first record is ALWAYS the SST, so coming with cst_total and
>+	 * cst_unique values.
>+	 * Some extra background: Records with data longer than 8,224 bytes MUST be
>+	 * split into several records, so in this case, if the SST record is big
>+	 * enough, it will have one or more CONTINUE records */

Might want to explain what SST is... I would have to look this up.

> 
> 	/* Reading cst total */
> 	gsf_input_read (stream, 4, tmp_buffer);
>@@ -1125,118 +1360,107 @@ xls_get_extended_record_string (GsfInput *stream,
> 	gsf_input_read (stream, 4, tmp_buffer);
> 	cst_unique = read_32bit (tmp_buffer);

We do quite a number of gsf_input_read() calls with read_{16|32}bit() calls, I wonder if we should have one function to do this?

>+	i=0;

Should be i = 0

>+		/* If High Byte, chunk size *2 as stream is in UTF-16 */
>+		if (is_high_byte) {
>+			chunk_size *=2;

Should be *= 2 (space needed)

>-			gsf_input_seek (stream, c_run, G_SEEK_CUR);
>+			gsf_input_seek (stream, 4*c_run, G_SEEK_CUR);

Should be 4 * c_run

>-		if (is_ext_string) {
>+		if (cb_ext_rst > 0) {

Again, I am wondering what this variable is (name needs to be clearer)

> 			while (header2.id == RECORD_TYPE_CONTINUE) {
>+

Hmm, why did we add a line here?

--

The patch is quite good though, thanks Aleksander, please commit once the above issues are resolved.
Comment 5 Martyn Russell 2010-04-21 11:14:30 UTC
> >+	if (gsf_input_tell (stream) >= (record->offset + record->length)) {
> >+		/* Switch records and read from the second one... */
> >+		(*p_current_record)++;
> 
> No need for brackets there.

> >+			(*p_current_record)++;
> 
> No need for brackets here

These two are incorrect. As discussed outside of this bug report. Sorry for the mix up Aleksander
Comment 6 Aleksander Morgado 2010-04-21 11:59:10 UTC
> > typedef enum {
> >@@ -367,7 +367,7 @@ read_8bit (const guint8 *buffer)
> >  * @param buffer data to read integer from
> >  * @return 16 bit unsigned integer
> >  */
> >-static gint
> >+static guint
> > read_16bit (const guint8 *buffer)
> > {
> > 	return buffer[0] + (buffer[1] << 8);
> >@@ -378,7 +378,7 @@ read_16bit (const guint8 *buffer)
> >  * @param buffer data to read integer from
> >  * @return 32 bit unsigned integer
> >  */
> 
> Hmm, I was wondering about this, it looks like the maximum size here is a
> guint16. Shouldn't we be returning a guint16? The issue with guint is that it
> depends on the platform of course (for me it is a guint64, it could be a
> guint32 too and both are big enough, but guint16 sounds more correct here)
> 

Yes, definitely sounds more correct. Changing it..

> >-static gint
> >+static guint
> > read_32bit (const guint8 *buffer)
> > {
> > 	return buffer[0] + (buffer[1] << 8) + (buffer[2] << 16) + (buffer[3] << 24);
> >@@ -784,6 +784,92 @@ open_uri (const gchar *uri)
> > 	return infile;
> > }
> 
> Again, shouldn't we be more specific than guint here and use a guint32 instead.

Also changing it.

> 
> >+	g_return_if_fail (buffer != NULL &&
> >+	                  chunk_size > 0 &&
> >+	                  p_words_remaining != NULL &&
> >+	                  p_bytes_remaining != NULL &&
> >+	                  p_content != NULL);
> 
> Please avoid using one statement like the above for checking. The problem is,
> if any one of these fails, in the logs, how do we know which one it was?
> If we have one call per logical statement, it is easier to determine what the
> real problem was if it ever happens.
> 

Aha, ok.

> >+	}
> >+	else {
> 
> Should be } else {
> 
> >+		g_warning ("Couldn't convert %d bytes from %s to UTF-8: %s",
> >+		           chunk_size,
> >+		           is_ansi ? "CP1252" : "UTF-16",
> >+		           error ? error->message : NULL);
> 
> Usually, we use "no error given" in the case of NULL above so it is clear that
> there was no specific error.

I believe I need to change this also then in some other patches I sent.

> 
> >+	}
> >+
> >+	/* Note that error may be set even if some converted text is
> >+	 * available, due to G_CONVERT_ERROR_ILLEGAL_SEQUENCE for example */
> >+	g_clear_error (&error);
> 
> As I understand it then, we read n bytes of chunk where n is valid or can be
> converted and we repeat this until we reach the limit according to the config
> right? Just wondering if that's right, if we get invalid sequences, can we
> really continue reading the next chunks like that? Wouldn't that normally be
> garbage?
> 

Not really. Usually chunks in these document types are self-contained, where a chunk doesn't have to do anything with the next one, so even if this chunk just processed is not fully ok, the next one could still be perfect, and so on.

> >+/* Reads and interprets the flags of a given string. May be
> >+ * used just to skip the fields, as when this bitmask-byte
> >+ * comes as the first byte of a new record. */
> >+static void
> >+read_excel_string_flags (GsfInput *stream,
> >+                         gboolean *p_is_high_byte,
> >+                         guint16  *p_c_run,
> >+                         guint16  *p_cb_ext_rst)
> 
> We should use more meaningful parameters here. What is p_c_run (a pointer to
> something?), same for p_cb_ext_rst, looks like a pointer to a callback for
> something?

In this case, c_run and cb_ext_rst are actually the real names of the fields as per the MSDN documentation. Will add a comment with a link to MSDN, which btw is the following:
http://msdn.microsoft.com/en-us/library/dd943830.aspx

> 
> >+	/* Reading  1 byte for mask */
> 
> Small typo, extra space before "1"
> 

Fixed.

> >+	gsf_input_read (stream, 1, tmp_buffer);
> >+	bit_mask = read_8bit (tmp_buffer);
> >+
> >+	/* Get flags */
> >+	if (p_is_high_byte) {
> >+		*p_is_high_byte = (bit_mask & 0x01) == 0x01;
> >+	}
> >+	is_ext_string = (bit_mask & 0x04) == 0x04;
> 
> I presume this is "is_external_string" ?

Ha, no idea. The fExtSt flag says if the string contains "phonetic string data". The link to the MSDN reference should help.

> 
> >+	is_rich_string = (bit_mask & 0x08) == 0x08;
> 
> What is a rich_string?
> 

Again, no idea :-). A Rich string seems to come with font-related info, which we don't use in Tracker. Again, the link to the MSDN reference should help.

> >+
> >+	/* If the c_run value is required as output, read it */
> >+	if (p_c_run) {
> >+		if (is_rich_string) {
> >+			/* Reading 2 Bytes */
> >+			gsf_input_read (stream, 2, tmp_buffer);
> 
> Extra space here please :)

Fixed.

> 
> >+			/* Reading cRun */
> >+			*p_c_run = read_16bit (tmp_buffer);
> >+		}
> >+		else {
> 
> } else {

Fixed.

> 
> >+			*p_c_run = 0;
> >+		}
> >+	}
> >+	/* If not required, just skip those bytes */
> >+	else if (is_rich_string) {
> 
> } else if () {
> 
> Usually I put comments under the condition if it pertains to the condition.

Fixed.

> 
> >+		gsf_input_seek (stream, 2, G_SEEK_CUR);
> >+	}
> >+
> >+	/* If the cb_ext_rst value is required as output, read it */
> >+	if (p_cb_ext_rst) {
> >+		if (is_ext_string) {
> >+			/* Reading 4 Bytes */
> >+			gsf_input_read (stream, 4, tmp_buffer);
> 
> Space please :)

Fixed.

> 
> >+			/* Reading cRun */
> >+			*p_cb_ext_rst = read_16bit (tmp_buffer);
> >+		}
> >+		else {
> 
> } else {

Fixed.

> 
> >+			*p_cb_ext_rst = 0;
> >+		}
> >+	}
> >+	/* If not required, just skip those bytes */
> >+	else if (is_ext_string) {
> 
> Same as above.

Fixed.

> 
> >+/* Returns TRUE if record was changed. BUT, the value of the
> >+ *  current_record should be checked by the caller to know
> >+ *  if there are no more records */
> >+static gboolean
> >+change_excel_record_if_needed (GsfInput *stream,
> >+                               GArray   *record_array,
> >+                               guint    *p_current_record)
> >+{
> >+	ExcelExtendedStringRecord *record;
> >+
> >+	/* Get current record */
> >+	record = &g_array_index (record_array,
> >+	                         ExcelExtendedStringRecord,
> >+	                         *p_current_record);
> >+
> >+	/* We may already have surpased the record, so adjust if so */
> 
> Typo, should be "surpassed".

Fixed.

> 
> >+	if (gsf_input_tell (stream) >= (record->offset + record->length)) {
> >+		/* Switch records and read from the second one... */
> >+		(*p_current_record)++;
> 
> No need for brackets there.

Already discussed that they're needed, so leaving them.

> 
> >+		if (*p_current_record < record_array->len) {
> >+			record = &g_array_index (record_array,
> >+			                         ExcelExtendedStringRecord,
> >+			                         *p_current_record);
> >+
> >+			gsf_input_seek (stream, record->offset, G_SEEK_SET);
> >+		}
> 
> Space here please.

Fixed.

> 
> >+		return TRUE;
> >+	}
> 
> Space here please.

Fixed.

> 
> >+	return FALSE;
> >+}
> 
> Only one lines between functions (below there are 3 it seems):
> 

Fixed.


> >+static gboolean
> >+read_excel_string (GsfInput *stream,
> >+                   guint8   *buffer,
> >+                   gsize     chunk_size,
> >+                   GArray   *record_array,
> >+                   guint    *p_current_record)
> >+{
> >+	ExcelExtendedStringRecord *record;
> >+	gsf_off_t current_position;
> >+	gsf_off_t current_record_end;
> 
> Unnecessary extra line below:
> 

Fixed.

> >+
> >+
> >+	/* Record may have changed when we want to read the string contents
> 
> 
> >+	 *  This is a pretty special case, where the new CONTINUE record
> >+	 * shouldn't start with a bitmask */
> >+	if (change_excel_record_if_needed (stream, record_array, p_current_record) &&
> >+	    *p_current_record >= record_array->len) {
> >+		/* When reached max number of records, just return */
> >+		return FALSE;
> >+	}
> 
> Extra line here too:
> 

Fixed.

> >+
> >+
> >+	/* Get current record */
> 
> 
> >+	record = &g_array_index (record_array,
> >+	                         ExcelExtendedStringRecord,
> >+	                         *p_current_record);
> >+
> >+	/* Compute current position in the stream and end of current record*/
> >+	current_position = gsf_input_tell (stream);
> >+	current_record_end = record->offset + record->length;
> >+
> >+	/* The best case is when the whole number of bytes to read are in the
> >+	 * current record, as no record switching is therefore needed */
> >+	if (current_position + chunk_size <= current_record_end) {
> >+		return gsf_input_read (stream, chunk_size, buffer) != NULL ? TRUE : FALSE;
> >+	}
> 
> } else if {

Fixed.

> 
> >+	/* Safety check, actually pretty important */
> >+	else if (current_record_end < current_position) {
> >+		return FALSE;
> >+	}
> >+	/* Read the string in two chunks */
> >+	else {
> 
> } else {

Fixed.
> 
> >+			(*p_current_record)++;
> 
> No need for brackets here

Same as above.

> 
> >+
> >+
> >+
> > /**
> 
> Extra lines above should be avoided
> 

Fixed.

> >+	/* Note: The first record is ALWAYS the SST, so coming with cst_total and
> >+	 * cst_unique values.
> >+	 * Some extra background: Records with data longer than 8,224 bytes MUST be
> >+	 * split into several records, so in this case, if the SST record is big
> >+	 * enough, it will have one or more CONTINUE records */
> 
> Might want to explain what SST is... I would have to look this up.

Same as before, will put the link there. The links to be added are:
 * SST: http://msdn.microsoft.com/en-us/library/dd773037%28v=office.12%29.aspx
 * CONTINUE: http://msdn.microsoft.com/en-us/library/dd949081%28v=office.12%29.aspx

> 
> > 
> > 	/* Reading cst total */
> > 	gsf_input_read (stream, 4, tmp_buffer);
> >@@ -1125,118 +1360,107 @@ xls_get_extended_record_string (GsfInput *stream,
> > 	gsf_input_read (stream, 4, tmp_buffer);
> > 	cst_unique = read_32bit (tmp_buffer);
> 
> We do quite a number of gsf_input_read() calls with read_{16|32}bit() calls, I
> wonder if we should have one function to do this?

I really tried to avoid as much as possible making too many reads already. In this specific case, cst_unique and cst_total are actually read only once, so should be ok to leave them like that.

> 
> >+	i=0;
> 
> Should be i = 0

Fixed.

> 
> >+		/* If High Byte, chunk size *2 as stream is in UTF-16 */
> >+		if (is_high_byte) {
> >+			chunk_size *=2;
> 
> Should be *= 2 (space needed)

Fixed.

> 
> >-			gsf_input_seek (stream, c_run, G_SEEK_CUR);
> >+			gsf_input_seek (stream, 4*c_run, G_SEEK_CUR);
> 
> Should be 4 * c_run

Fixed.

> 
> >-		if (is_ext_string) {
> >+		if (cb_ext_rst > 0) {
> 
> Again, I am wondering what this variable is (name needs to be clearer)

The previous links to MSDN should help (name is actually same as field name in the documentation).

> 
> > 			while (header2.id == RECORD_TYPE_CONTINUE) {
> >+
> 
> Hmm, why did we add a line here?

:-) removed

> 
> --
> 
> The patch is quite good though, thanks Aleksander, please commit once the above
> issues are resolved.

Ok then.
Comment 7 Aleksander Morgado 2010-04-21 12:11:19 UTC
Pushed to git master