After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.
Bug 509487 - beagle-extract-content fails on doc files when LANG != C (+ patch)
beagle-extract-content fails on doc files when LANG != C (+ patch)
Status: RESOLVED WONTFIX
Product: beagle
Classification: Other
Component: General
0.2.18
Other All
: Normal critical
: ---
Assigned To: Beagle Bugs
Beagle Bugs
gnome[unmaintained]
Depends on:
Blocks:
 
 
Reported: 2008-01-14 21:32 UTC by drago01
Modified: 2018-07-03 09:55 UTC
See Also:
GNOME target: ---
GNOME version: ---


Attachments
increase memory limit to 200M (560 bytes, patch)
2008-01-14 21:33 UTC, drago01
none Details | Review
Patch to use LANG=C for filters (2.38 KB, patch)
2008-02-10 18:35 UTC, Debajyoti Bera
none Details | Review
include glue-changes in the patch (2.87 KB, patch)
2008-02-10 20:46 UTC, Debajyoti Bera
committed Details | Review

Description drago01 2008-01-14 21:32:29 UTC
Steps to reproduce:
1. Build beagle with --enable-wv1
2. make sure LANG is set to something else than C
3. run beagle-extract-content somefile.doc
4. it crashes


Stack trace:
 beagle-extract-content htb.doc
Filename: file:///home/linux/Desktop/htb.doc
Debug: Loaded 53 filters from /usr/lib64/beagle/Filters/Filters.dll
Filter: Beagle.Filters.FilterDOC (determined in ,25s)
MimeType: application/msword

Properties:
  Timestamp = 2008-01-14 18:24:05 (Utc)

Content:
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (/usr/lib64/beagle/DocExtractor.exe:9528): GLib-WARNING **: getpwuid_r(): failed due to unknown user id (500)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ** (/usr/lib64/beagle/DocExtractor.exe:9528): CRITICAL **: _wapi_shm_attach: mmap error: Nicht genügend Hauptspeicher verfügbar
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ** ERROR **: file shared.c: line 337 (shm_semaphores_init): assertion failed: (tmp_shared != NULL)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: aborting...
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Stacktrace:
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Native stacktrace:
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono [0x5203a9]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       /lib64/libpthread.so.0 [0x338da0e540]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       /lib64/libc.so.6(gsignal+0x35) [0x338ce30ec5]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       /lib64/libc.so.6(abort+0x110) [0x338ce32970]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       /lib64/libglib-2.0.so.0(g_logv+0x3b5) [0x3f55a374a5]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       /lib64/libglib-2.0.so.0(g_log+0x83) [0x3f55a37543]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       /lib64/libglib-2.0.so.0(g_assert_warning+0x76) [0x3f55a375c6]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono [0x4ca79c]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono [0x4d018c]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono(mono_once+0x44) [0x4cd324]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono [0x4d02c3]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono [0x4ccd38]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono [0x49e317]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono(mono_runtime_init+0x1d) [0x4c2d2d]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono [0x4ec4b7]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono(mono_main+0x335) [0x413e55]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       /lib64/libc.so.6(__libc_start_main+0xf4) [0x338ce1e074]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]:       mono(realloc+0x341) [0x413579]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Debug info from gdb:
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Using host libthread_db library "/lib64/libthread_db.so.1".
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: [Thread debugging using libthread_db enabled]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: [New Thread 46912496316128 (LWP 9528)]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found)
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ../../gdb/utils.c:931: internal-error: virtual memory exhausted: can't allocate 33072 bytes.
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: A problem internal to GDB has been detected,
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: further debugging may prove unreliable.
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Quit this debugging session? (y or n) [answered Y; input not from terminal]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ../../gdb/utils.c:931: internal-error: virtual memory exhausted: can't allocate 33072 bytes.
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: A problem internal to GDB has been detected,
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: further debugging may prove unreliable.
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Create a core file of GDB? (y or n) [answered Y; input not from terminal]
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: =================================================================
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Got a SIGABRT while executing native code. This usually indicates
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: a fatal error in the mono runtime or one of the native libraries 
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: used by your application.
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: =================================================================
Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: 
(no content)
HotContent:
(no hot content)

Text extracted in ,05s


Other information:
It seems to not depend on the file, it seems to always happen. 
When I set LANG to C first it works fine; thats why the error message is still part german. "Nicht genügend Hauptspeicher verfügbar" means "out of memory". 
While debugging it I found out that beagle limits the memory for the doc extractor to 100M. That seems not to be enough to to be able to work at all (at least on x86_64 have not tested an i386). Increasing the limit to 200M seems to solve the problem for me. I will attach the patch after submitting this bug.
Comment 1 drago01 2008-01-14 21:33:07 UTC
Created attachment 102858 [details] [review]
increase memory limit to 200M
Comment 2 Debajyoti Bera 2008-01-15 17:15:40 UTC
Beagle imposes a limit of 100MB while extracting data from word doc and thats why you are seeing this crash. That much I understand; but what has LANG!=C got to do with it ? Do you mean if you set LANG=C, then docextractor works ?
Comment 3 drago01 2008-01-15 17:37:42 UTC
(In reply to comment #2)
> Beagle imposes a limit of 100MB while extracting data from word doc and thats
> why you are seeing this crash. That much I understand; but what has LANG!=C got
> to do with it ? Do you mean if you set LANG=C, then docextractor works ?
> 
Exactly its seems to need more than 100MB of memory when LANG is not C.
When LANG is set to C it works fine.

Comment 4 Joe Shaw 2008-01-15 18:41:57 UTC
We should probably set LANG=C anyway for most child processes we spawn.

It seems pretty ridiculous that more than 100 megs of memory would be needed to extract text from a word document.  I thought 100 was a pretty high value when I set it in the first place.
Comment 5 drago01 2008-01-15 19:02:21 UTC
(In reply to comment #4)
> We should probably set LANG=C anyway for most child processes we spawn.

Yeah, that would work too, and I don't think that it will break anything.

> It seems pretty ridiculous that more than 100 megs of memory would be needed to
> extract text from a word document.  I thought 100 was a pretty high value when
> I set it in the first place.

I couldn't believe it at first, but after increasing the limit it indeed started to work. But I think the correct fix is to find out why we need so much memory at all.

Comment 6 Debajyoti Bera 2008-01-17 21:04:47 UTC
LANG=C won't break anything ? If someone has set LANG to be xxx, does it not mean that he expects all applications to start with that LANG env variable ?

For word docs, beagle runs the beagle-doc-extractor program. I wonder if there is anything we do wrong there ? Which LANG setting were you using - I will try to reproduce this here.
Comment 7 drago01 2008-01-17 21:23:09 UTC
(In reply to comment #6)
> LANG=C won't break anything ? If someone has set LANG to be xxx, does it not
> mean that he expects all applications to start with that LANG env variable ?

sure but this apps runs in the background and the user should never see the output; so the language of it _should_ not matter. 

> For word docs, beagle runs the beagle-doc-extractor program. I wonder if there
> is anything we do wrong there ? Which LANG setting were you using - I will try
> to reproduce this here.

de_DE.UTF-8

also tried with plain "de" and "fr" both caused the crash.

(I am on x86_64; this might be a factor too)
Comment 8 Debajyoti Bera 2008-02-10 18:35:56 UTC
Created attachment 104859 [details] [review]
Patch to use LANG=C for filters

Sorry, this one slipped my mind. I used your LANG settings, but could not reproduce with the single doc file I had. Maybe your doc files are with a different settings or it could be the 64-bit factor.

Can you test with the attached patch ? It forces LANG=C always.
Comment 9 drago01 2008-02-10 20:19:05 UTC
(In reply to comment #8)
> Created an attachment (id=104859) [edit]
> Patch to use LANG=C for filters
> 
> Sorry, this one slipped my mind. I used your LANG settings, but could not
> reproduce with the single doc file I had. Maybe your doc files are with a
> different settings or it could be the 64-bit factor.
> 
> Can you test with the attached patch ? It forces LANG=C always.
> 

Seems like it breaks the beagle-extract-content in a weird way:

Unable to filter file:///home/linux/test.doc: An exception was thrown by the type initializer for Beagle.Daemon.FilterFactory

Tried with other file formats too; same problem.

Comment 10 Debajyoti Bera 2008-02-10 20:46:30 UTC
Created attachment 104868 [details] [review]
include glue-changes in the patch

Oops ... I forgot to add the changes to the glue code. Ok, try this one.
Comment 11 drago01 2008-02-10 21:09:47 UTC
(In reply to comment #10)
> Created an attachment (id=104868) [edit]
> include glue-changes in the patch
> 
> Oops ... I forgot to add the changes to the glue code. Ok, try this one.
> 

This one works; thx!
Comment 12 Debajyoti Bera 2008-02-10 21:17:32 UTC
Great. Checked in the patch (r4471).
Comment 13 Debajyoti Bera 2008-09-20 21:06:56 UTC
Ok. I am about to revert this change. I think using "C" locale is a rather bad idea when programs can output UTF-8. From what I could gather, C locale does not support multi-byte encodings.

Which means, we need so see why lang != "C" was causing the memory blowup. Since it does not happen on my machine, I need some help in narrowing this down. Hope no one minds.
Comment 14 drago01 2008-09-20 21:37:09 UTC
(In reply to comment #13)
> Ok. I am about to revert this change. I think using "C" locale is a rather bad
> idea when programs can output UTF-8. From what I could gather, C locale does
> not support multi-byte encodings.
> 
> Which means, we need so see why lang != "C" was causing the memory blowup.
> Since it does not happen on my machine, I need some help in narrowing this
> down. Hope no one minds.
> 

This might be the reason:

du -hs /usr/lib/locale/locale-archive
76M	/usr/lib/locale/locale-archive

This file gets mapped which causes the high memory usage.
When LANG==C the file does not get mapped and it works.
Comment 15 André Klapper 2018-07-03 09:55:35 UTC
Beagle is not under active development anymore and had its last code changes in early 2011. Its codebase has been archived (see bug 796735):
https://gitlab.gnome.org/Archive/beagle/commits/master

"tracker" is an available alternative.

Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect
reality. Please feel free to reopen this ticket (or rather transfer the project
to GNOME Gitlab, as GNOME Bugzilla is deprecated) if anyone takes the
responsibility for active development again.