GNOME Bugzilla – Bug 123538
Text import is slow with long lines
Last modified: 2018-05-06 01:58:16 UTC
Package: Gnumeric Severity: normal Version: 1.2.0-bonobo Synopsis: very slow to split columns on large text file import Bugzilla-Product: Gnumeric Bugzilla-Component: Text Import/Export Description: Description of Problem: splitting columns using "fixed field width" import option is really slow - like 5 to 10 seconds per split Steps to reproduce the problem: 1. import 4.5MB text file as "fixed width columns" format 2. split lines (lines up to 10,000 characters in length 3. each split (or widen/narrow) takes much too long Actual Results: split takes a long time, eg. 10 seconds Expected Results: column split should take no more than 2 seconds (_max_!) How often does this happen? When importing large text file - 4.5MB - with long line lengths - eg. 10 thousand characters width. Every time. ------- Bug moved to this database by unknown@bugzilla.gnome.org 2003-09-30 02:21 ------- The original reporter (zen@iptaustralia.net) of this bug does not have an account here. Reassigning to the exporter, unknown@bugzilla.gnome.org. Reassigning to the default owner of the component, jody@gnome.org.
This appears to be a pango problem -- I see most of the time spent trying to find places to break long lines. For the record, my test case has 102 lines of 10000 x's. I'll upgrade to current cvs pango and see what happens.
Problem persists with current cvs, see bug 123646.
We'll have to blame the treeview for some of this too, see bug 80868.
1) It should be able to be done quick, because the font used is (or should be!) a fixed-width font - there should not be multiple calculations really - it's a simple multiplication of character width times number x columns. That should be basically instant. 2) I tried using MS Excel 2000. It is very quick. And you simply left-click to set column break points (it's instant), and left-double-click to unset a column break point. It looks like a custom widget (certainly not a normal table/ spreadsheet type view. However MS Excel has the tiniest preview window within which to make these column selections - which absolutely sucks when you need to scroll a lot, in a large sparse(ish) file, to determine where to put those column breaks. As is so common with MS dialog boxes, when you should be able to change it's size, you can't. zen@iptaustralia.net
We agree that this should be a fast operation. It's just not something to be fixed in gnumeric, but in the support libraries. Fixing bug 80868 should give us a facter of 10 or 20, assuming 10 lines visible out of 100. (An extra factor of two seems possible as something does a measurement twice.) When that is done, we can reevaluate where to attack the problem.
If you have 10000 character strings in a single text renderer there is *nothing* that GtkTreeView can do that is going to make things perform decently. Have you considered the possibility that Gnumeric should keep 10000 character lines from the display? You'd need much more complexity at all levels to make horizontal scrolling fast, since every pixel scrolled sideways is going to require 500000 or so characters to be passed through the Pango layout pipeline. (Either that, or many megabytes of information are going to have to be cached. A PangoLayout is about 17 bytes/character.) Really, if you want 10,000 character wide displays to be fast, you need to write a custom widget for displaying char grids in a fixed width font. Just because the first 9,999 characters are the same width, Pango can't know that the 10,000th isn't some Arabic character that requires a different font with a different width.
"less" performs pretty much the same function and it is darn fast. (Control characters have different widths, so even that is covered.) > Have you considered the possibility that Gnumeric should > keep 10000 character lines from the display? Yes, indeed I have, but the answer comes out as "no". 1. Doing it right would require that Gnumeric should know lots about unicode (zero-width characters, for example) and the font in use (how-to-make glyphs and how wide glyphs are). How else, can I know when to cut off? 2. Those strings aren't 10000 characters long just to offend you. Really. They contain bonafida information that people want to see and get at. Truncating would make Gnumeric a lesser program. > You'd need much more complexity at all levels to make horizontal > scrolling fast, since every pixel scrolled sideways is going to > require 500000 or so characters to be passed through the Pango > layout pipeline. 1. There is no need to even access invisible characters on lines not in view. That cuts 90%. 2. If you are talking about rendering, I don't quite follow: once you are off the right edge of the window/rectangle/whatever, that's it. Notice, that beyond treeview measuring 10 times too many strings, the top cpu offender is pango_default_break which is not needed at all for a string that is set to non-bounded width -- there isn't going to be any line breaking.
It looks like *something* has changed in the past 15 years for the better. And I don't think it's just that computers have gotten faster. Tentatively calling it fixed.