GNOME Bugzilla – Bug 509487
beagle-extract-content fails on doc files when LANG != C (+ patch)
Last modified: 2018-07-03 09:55:35 UTC
Steps to reproduce: 1. Build beagle with --enable-wv1 2. make sure LANG is set to something else than C 3. run beagle-extract-content somefile.doc 4. it crashes Stack trace: beagle-extract-content htb.doc Filename: file:///home/linux/Desktop/htb.doc Debug: Loaded 53 filters from /usr/lib64/beagle/Filters/Filters.dll Filter: Beagle.Filters.FilterDOC (determined in ,25s) MimeType: application/msword Properties: Timestamp = 2008-01-14 18:24:05 (Utc) Content: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (/usr/lib64/beagle/DocExtractor.exe:9528): GLib-WARNING **: getpwuid_r(): failed due to unknown user id (500) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ** (/usr/lib64/beagle/DocExtractor.exe:9528): CRITICAL **: _wapi_shm_attach: mmap error: Nicht genügend Hauptspeicher verfügbar Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ** ERROR **: file shared.c: line 337 (shm_semaphores_init): assertion failed: (tmp_shared != NULL) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: aborting... Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Stacktrace: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Native stacktrace: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono [0x5203a9] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: /lib64/libpthread.so.0 [0x338da0e540] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: /lib64/libc.so.6(gsignal+0x35) [0x338ce30ec5] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: /lib64/libc.so.6(abort+0x110) [0x338ce32970] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: /lib64/libglib-2.0.so.0(g_logv+0x3b5) [0x3f55a374a5] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: /lib64/libglib-2.0.so.0(g_log+0x83) [0x3f55a37543] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: /lib64/libglib-2.0.so.0(g_assert_warning+0x76) [0x3f55a375c6] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono [0x4ca79c] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono [0x4d018c] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono(mono_once+0x44) [0x4cd324] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono [0x4d02c3] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono [0x4ccd38] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono [0x49e317] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono(mono_runtime_init+0x1d) [0x4c2d2d] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono [0x4ec4b7] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono(mono_main+0x335) [0x413e55] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: /lib64/libc.so.6(__libc_start_main+0xf4) [0x338ce1e074] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: mono(realloc+0x341) [0x413579] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Debug info from gdb: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Using host libthread_db library "/lib64/libthread_db.so.1". Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: [Thread debugging using libthread_db enabled] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: [New Thread 46912496316128 (LWP 9528)] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no debugging symbols found) Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ../../gdb/utils.c:931: internal-error: virtual memory exhausted: can't allocate 33072 bytes. Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: A problem internal to GDB has been detected, Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: further debugging may prove unreliable. Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Quit this debugging session? (y or n) [answered Y; input not from terminal] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ../../gdb/utils.c:931: internal-error: virtual memory exhausted: can't allocate 33072 bytes. Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: A problem internal to GDB has been detected, Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: further debugging may prove unreliable. Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Create a core file of GDB? (y or n) [answered Y; input not from terminal] Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ================================================================= Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: Got a SIGABRT while executing native code. This usually indicates Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: a fatal error in the mono runtime or one of the native libraries Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: used by your application. Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: ================================================================= Warn: doc extractor [file:///home/linux/Desktop/htb.doc]: (no content) HotContent: (no hot content) Text extracted in ,05s Other information: It seems to not depend on the file, it seems to always happen. When I set LANG to C first it works fine; thats why the error message is still part german. "Nicht genügend Hauptspeicher verfügbar" means "out of memory". While debugging it I found out that beagle limits the memory for the doc extractor to 100M. That seems not to be enough to to be able to work at all (at least on x86_64 have not tested an i386). Increasing the limit to 200M seems to solve the problem for me. I will attach the patch after submitting this bug.
Created attachment 102858 [details] [review] increase memory limit to 200M
Beagle imposes a limit of 100MB while extracting data from word doc and thats why you are seeing this crash. That much I understand; but what has LANG!=C got to do with it ? Do you mean if you set LANG=C, then docextractor works ?
(In reply to comment #2) > Beagle imposes a limit of 100MB while extracting data from word doc and thats > why you are seeing this crash. That much I understand; but what has LANG!=C got > to do with it ? Do you mean if you set LANG=C, then docextractor works ? > Exactly its seems to need more than 100MB of memory when LANG is not C. When LANG is set to C it works fine.
We should probably set LANG=C anyway for most child processes we spawn. It seems pretty ridiculous that more than 100 megs of memory would be needed to extract text from a word document. I thought 100 was a pretty high value when I set it in the first place.
(In reply to comment #4) > We should probably set LANG=C anyway for most child processes we spawn. Yeah, that would work too, and I don't think that it will break anything. > It seems pretty ridiculous that more than 100 megs of memory would be needed to > extract text from a word document. I thought 100 was a pretty high value when > I set it in the first place. I couldn't believe it at first, but after increasing the limit it indeed started to work. But I think the correct fix is to find out why we need so much memory at all.
LANG=C won't break anything ? If someone has set LANG to be xxx, does it not mean that he expects all applications to start with that LANG env variable ? For word docs, beagle runs the beagle-doc-extractor program. I wonder if there is anything we do wrong there ? Which LANG setting were you using - I will try to reproduce this here.
(In reply to comment #6) > LANG=C won't break anything ? If someone has set LANG to be xxx, does it not > mean that he expects all applications to start with that LANG env variable ? sure but this apps runs in the background and the user should never see the output; so the language of it _should_ not matter. > For word docs, beagle runs the beagle-doc-extractor program. I wonder if there > is anything we do wrong there ? Which LANG setting were you using - I will try > to reproduce this here. de_DE.UTF-8 also tried with plain "de" and "fr" both caused the crash. (I am on x86_64; this might be a factor too)
Created attachment 104859 [details] [review] Patch to use LANG=C for filters Sorry, this one slipped my mind. I used your LANG settings, but could not reproduce with the single doc file I had. Maybe your doc files are with a different settings or it could be the 64-bit factor. Can you test with the attached patch ? It forces LANG=C always.
(In reply to comment #8) > Created an attachment (id=104859) [edit] > Patch to use LANG=C for filters > > Sorry, this one slipped my mind. I used your LANG settings, but could not > reproduce with the single doc file I had. Maybe your doc files are with a > different settings or it could be the 64-bit factor. > > Can you test with the attached patch ? It forces LANG=C always. > Seems like it breaks the beagle-extract-content in a weird way: Unable to filter file:///home/linux/test.doc: An exception was thrown by the type initializer for Beagle.Daemon.FilterFactory Tried with other file formats too; same problem.
Created attachment 104868 [details] [review] include glue-changes in the patch Oops ... I forgot to add the changes to the glue code. Ok, try this one.
(In reply to comment #10) > Created an attachment (id=104868) [edit] > include glue-changes in the patch > > Oops ... I forgot to add the changes to the glue code. Ok, try this one. > This one works; thx!
Great. Checked in the patch (r4471).
Ok. I am about to revert this change. I think using "C" locale is a rather bad idea when programs can output UTF-8. From what I could gather, C locale does not support multi-byte encodings. Which means, we need so see why lang != "C" was causing the memory blowup. Since it does not happen on my machine, I need some help in narrowing this down. Hope no one minds.
(In reply to comment #13) > Ok. I am about to revert this change. I think using "C" locale is a rather bad > idea when programs can output UTF-8. From what I could gather, C locale does > not support multi-byte encodings. > > Which means, we need so see why lang != "C" was causing the memory blowup. > Since it does not happen on my machine, I need some help in narrowing this > down. Hope no one minds. > This might be the reason: du -hs /usr/lib/locale/locale-archive 76M /usr/lib/locale/locale-archive This file gets mapped which causes the high memory usage. When LANG==C the file does not get mapped and it works.
Beagle is not under active development anymore and had its last code changes in early 2011. Its codebase has been archived (see bug 796735): https://gitlab.gnome.org/Archive/beagle/commits/master "tracker" is an available alternative. Closing this report as WONTFIX as part of Bugzilla Housekeeping to reflect reality. Please feel free to reopen this ticket (or rather transfer the project to GNOME Gitlab, as GNOME Bugzilla is deprecated) if anyone takes the responsibility for active development again.