Bug 616158 – Improve reading msoffice files

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 616158 - Improve reading msoffice files


Summary:	Improve reading msoffice files


Status:	RESOLVED FIXED

Product:	tracker
Classification:	Core
Component:	Extractor
Version:	0.9.x
Hardware:	Other Linux

Importance:	Normal normal
Target Milestone:	---
Assigned To:	tracker-extractor
QA Contact:	Jamie McCracken

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2010-04-19 10:52 UTC by Aleksander Morgado
Modified:	2010-04-20 11:11 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Improved reading doc files (9.09 KB, patch) 2010-04-19 10:55 UTC, Aleksander Morgado	needs-work	Details \| Review

Description Aleksander Morgado 2010-04-19 10:52:04 UTC

As per bug #615765, the contents of the msoffice (doc, xls, ppt) files are currently read in the following way:
 * Using libgsf, load all the contents in a single stream in memory
 * Convert the whole string to UTF-8
 * Normalize the string and limit it to max words

Currently, this can be improved in the following way:
 * Limit the max number of bytes to be read from the stream, to some safe
limit like 3*max_words*max_word_size.
 * Don't load the whole doc in heap: use a buffer to read the contents, convert to UTF-8, perform normalization and word count (chunk by chunk).
 * Stop reading the contents when max bytes reached.
 * Stop reading the contents when max number of words reached.

Comment 1 Aleksander Morgado 2010-04-19 10:55:45 UTC

Created attachment 159059 [details] [review]
Improved reading doc files

msoffice word files reading improvement. If the approach is considered OK, will also modify excel and powerpoint file reading accordingly.

Comment 2 Martyn Russell 2010-04-20 10:31:28 UTC

Comment on attachment 159059 [details] [review]
Improved reading doc files

>+static gint
>+fts_max_word_length (void)
>+{
>+	TrackerFTSConfig *fts_config = tracker_main_get_fts_config ();

Prefer spacing between variable definition and code blocks.

>+	return tracker_fts_config_get_max_word_length (fts_config);
>+}
>+	while ((n_words_remaining > 0) &&
>+	       (n_bytes_remaining > 0) &&
>+	       (i < piece_count)) {

No need to put each condition here in brackets.

>+
>+		/* NOTE: Very very long pieces may appear. In fact, a single
>+		 *  piece document seems to be quite normal. Thus, we limit
>+		 *  here the number of bytes to read from the stream, based
>+		 *  on the maximum number of bytes in UTF-8. Assuming, then
>+		 *  that a safe limit is 2*n_bytes_remaining if UTF-16 input,
>+		 *  and just n_bytes_remaining in CP1251 input */
>+		piece_size = MIN (piece_size, n_bytes_remaining);
>+
>+		/* UTF-16 uses twice as many bytes as CP1252
>+		 *  NOTE: Not quite sure about this. Some unicode points will be
>+		 *  encoded using 4 bytes in UTF-16 */
> 		if (!is_ansi) {
> 			piece_size *= 2;
> 		}
> 
>-		if (piece_size < 1) {
>-			continue;
>-		}
>+		/* Avoid empty pieces */
>+		if (piece_size >= 1) {
>+			GError *error = NULL;
>+			gsize n_bytes_utf8;
>+			guint n_words_normalized;
>+
>+			/* Re-allocate buffer to make it bigger if needed.
>+			 *  This text buffer is re-used over and over in each
>+			 *  iteration.  */
>+			if (piece_size > text_buffer_size) {
>+				text_buffer = g_realloc (text_buffer, piece_size);
>+				text_buffer_size = piece_size;
>+			}

Hmm, it might make more sense to use a GString here to do the reallocations?

>-		/* read single text piece from document_stream */
>-		text_buffer = g_malloc (piece_size);
>-		gsf_input_seek (document_stream, fc, G_SEEK_SET);
>-		gsf_input_read (document_stream, piece_size, text_buffer);
>-
>-		/* pieces can have different encoding */
>-		converted_text = g_convert (text_buffer,
>-		                            piece_size,
>-		                            "UTF-8",
>-		                            is_ansi ? "CP1252" : "UTF-16",
>-		                            NULL,
>-		                            NULL,
>-		                            NULL);
>-
>-		if (converted_text) {
>-			if (!content) {
>-				content = g_string_new (converted_text);
>-			} else {
>-				g_string_append (content, converted_text);
>+			/* read single text piece from document_stream */
>+			gsf_input_seek (document_stream, fc, G_SEEK_SET);
>+			gsf_input_read (document_stream, piece_size, text_buffer);
>+
>+			/* pieces can have different encoding
>+			 *  TODO: Using g_iconv, this extra heap allocation could be
>+			 *   avoided, re-using over and over again the same output buffer
>+			 *   or the UTF-8 encoded string */
>+			converted_text = g_convert (text_buffer,
>+			                            piece_size,
>+			                            "UTF-8",
>+			                            is_ansi ? "CP1252" : "UTF-16",
>+			                            NULL,
>+			                            &n_bytes_utf8,
>+			                            &error);
>+
>+			if (converted_text) {
>+				gchar *normalized_chunk;

Again, space needed between code block/variables.

>+				/* Get normalized chunk */
>+				normalized_chunk = tracker_text_normalize (converted_text,
>+				                                           n_words_remaining,
>+				                                           &n_words_normalized);

Prefer spaces between comments/previous code too.

>+				/* Update number of words remaining.
>+				 * Note that n_words_normalized should always be less or
>+				 * equal than n_words_remaining */
>+				n_words_remaining = (n_words_normalized <= n_words_remaining ?
>+				                     n_words_remaining - n_words_normalized : 0);
>+
>+				/* Update accumulated UTF-8 bytes read */
>+				n_bytes_remaining = (n_bytes_utf8 <= n_bytes_remaining ?
>+				                     n_bytes_remaining - n_bytes_utf8 : 0);

No need for brackets here... but then Philip and Carlos disagree with me here... so :)

>+
>+				g_debug ("(%s) Piece '%u'; Words normalized: %u (remaining: %u); "
>+				         "Bytes read (UTF-8): %" G_GSIZE_FORMAT " bytes "
>+				         "(remaining: %" G_GSIZE_FORMAT ")", __FUNCTION__, i,

Prefer to have parameters on a separate line especially if there are many of them like here.

>+			else {
>+				g_warning ("Couldn't convert '%d' bytes from '%s' to UTF-8: %s",
>+				           piece_size,

We don't use '..' for %d, we only use it for %s in case the string is empty and not clear in the logs.

>-	return normalized;
>+	return (content ? g_string_free (content, FALSE) : NULL);

No need for brackets :)

> }
> 
> 
>@@ -1410,6 +1489,8 @@ extract_msoffice (const gchar          *uri,
> 	GsfInfile *infile = NULL;
> 	gchar *content = NULL;
> 	gboolean is_encrypted = FALSE;
>+	gint max_words;
>+	gsize max_bytes;
> 
> 	file = g_file_new_for_uri (uri);
> 
>@@ -1444,15 +1525,24 @@ extract_msoffice (const gchar          *uri,
> 
> 	mime_used = g_file_info_get_content_type (file_info);
> 
>+	/* Set max words to read from content */
>+	max_words = fts_max_words ();

Need space here.

>+	/* Set max bytes to read from content.
>+	 * Assuming 3 bytes per unicode point in UTF-8, as 4-byte UTF-8 unicode
>+	 *  points are really pretty rare */
>+	max_bytes = 3 * max_words * fts_max_word_length();

Space before () needed.

--

The patch looks great apart from that. Really great comments there to help people understand the code too.
Have you been able to test with the encoding types to make sure it works properly?

If you can make the small alterations listed above, you can go ahead and commit this, thanks Aleksander!

Comment 3 Aleksander Morgado 2010-04-20 11:01:40 UTC

> >+			/* Re-allocate buffer to make it bigger if needed.
> >+			 *  This text buffer is re-used over and over in each
> >+			 *  iteration.  */
> >+			if (piece_size > text_buffer_size) {
> >+				text_buffer = g_realloc (text_buffer, piece_size);
> >+				text_buffer_size = piece_size;
> >+			}
> 
> Hmm, it might make more sense to use a GString here to do the reallocations?
> 

Not really. The buffer is not resized to append more data at the end. This buffer is used in each iteration, and in the new iteration the number of bytes to read may be bigger than in the previous iteration, so that's why the resize is done.

> >+				/* Update number of words remaining.
> >+				 * Note that n_words_normalized should always be less or
> >+				 * equal than n_words_remaining */
> >+				n_words_remaining = (n_words_normalized <= n_words_remaining ?
> >+				                     n_words_remaining - n_words_normalized : 0);
> >+
> >+				/* Update accumulated UTF-8 bytes read */
> >+				n_bytes_remaining = (n_bytes_utf8 <= n_bytes_remaining ?
> >+				                     n_bytes_remaining - n_bytes_utf8 : 0);
> 
> No need for brackets here... but then Philip and Carlos disagree with me
> here... so :)

Well, if the decision for this is not fixed, I will leave the parenthesis in this specific case where the ternary condition is split in several lines, so that proper alignment of the second line is done automatically by the editor ;-)

> Have you been able to test with the encoding types to make sure it works
> properly?
> 

Tested with msoffice word documents. Will apply the same approach to PPT and XLS docs.

> If you can make the small alterations listed above, you can go ahead and commit
> this, thanks Aleksander!

Ok

Comment 4 Aleksander Morgado 2010-04-20 11:11:30 UTC

Pushed to git master