Bug 123538 – Text import is slow with long lines

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 123538 - Text import is slow with long lines


Summary:	Text import is slow with long lines


Status:	RESOLVED FIXED

Product:	Gnumeric
Classification:	Applications
Component:	import/export Text
Version:	1.2.x
Hardware:	Other All

Importance:	Normal normal
Target Milestone:	---
Assigned To:	Morten Welinder
QA Contact:	Jody Goldberg

URL:
Whiteboard:

Depends on:	80868 123646
Blocks:

Reported:	2003-09-30 06:21 UTC by zen
Modified:	2018-05-06 01:58 UTC

See Also:
GNOME target:	---
GNOME version:	---

Description zen 2003-09-30 06:21:29 UTC

Package: Gnumeric
Severity: normal
Version: 1.2.0-bonobo
Synopsis: very slow to split columns on large text file import
Bugzilla-Product: Gnumeric
Bugzilla-Component: Text Import/Export

Description:
Description of Problem:
splitting columns using "fixed field width" import option is really slow
- like 5 to 10 seconds per split

Steps to reproduce the problem:
1. import 4.5MB text file as "fixed width columns" format
2. split lines (lines up to 10,000 characters in length
3. each split (or widen/narrow) takes much too long

Actual Results:
split takes a long time, eg. 10 seconds

Expected Results:
column split should take no more than 2 seconds (_max_!)

How often does this happen?
When importing large text file - 4.5MB - with long line lengths - eg. 10
thousand characters width.
Every time.




------- Bug moved to this database by unknown@bugzilla.gnome.org 2003-09-30 02:21 -------

The original reporter (zen@iptaustralia.net) of this bug does not have an account here.
Reassigning to the exporter, unknown@bugzilla.gnome.org.
Reassigning to the default owner of the component, jody@gnome.org.

Comment 1 Morten Welinder 2003-10-01 17:00:17 UTC

This appears to be a pango problem -- I see most of the time spent
trying to find places to break long lines.

For the record, my test case has 102 lines of 10000 x's.

I'll upgrade to current cvs pango and see what happens.

Comment 2 Morten Welinder 2003-10-01 17:34:16 UTC

Problem persists with current cvs, see bug 123646.

Comment 3 Morten Welinder 2003-10-02 13:34:09 UTC

We'll have to blame the treeview for some of this too, see bug 80868.

Comment 4 zen 2003-10-02 20:44:30 UTC

1)
It should be able to be done quick, because the font used is (or
should be!) a fixed-width font - there should not be multiple
calculations really - it's a simple multiplication of character width
times number x columns. That should be basically instant.

2)
I tried using MS Excel 2000. It is very quick. And you simply
left-click to set column break points (it's instant), and
left-double-click to unset a column break point.

It looks like a custom widget (certainly not a normal table/
spreadsheet type view.

However MS Excel has the tiniest preview window within which to make
these column selections - which absolutely sucks when you need to
scroll a lot, in a large sparse(ish) file, to determine where to put
those column breaks. As is so common with MS dialog boxes, when you
should be able to change it's size, you can't.

zen@iptaustralia.net

Comment 5 Morten Welinder 2003-10-03 17:54:32 UTC

We agree that this should be a fast operation.  It's just not something
to be fixed in gnumeric, but in the support libraries.

Fixing bug 80868 should give us a facter of 10 or 20, assuming 10 lines
visible out of 100.  (An extra factor of two seems possible as something
does a measurement twice.)

When that is done, we can reevaluate where to attack the problem.

Comment 6 Owen Taylor 2003-10-03 18:49:25 UTC

If you have 10000 character strings in a single text renderer
there is *nothing* that GtkTreeView can do that is going
to make things perform decently.

Have you considered the possibility that Gnumeric should
keep 10000 character lines from the display? 

You'd need much more complexity at all levels to make horizontal 
scrolling fast, since every pixel scrolled sideways is going to
require 500000 or so characters to be passed through the Pango 
layout pipeline.

(Either that, or many megabytes of information are going to
have to be cached. A PangoLayout is about 17 bytes/character.)

Really, if you want 10,000 character wide displays to be fast,
you need to write a custom widget for displaying char grids
in a fixed width font. Just because the first 9,999 characters
are the same width, Pango can't know that the 10,000th isn't
some Arabic character that requires a different font with a 
different width.

Comment 7 Morten Welinder 2003-10-03 19:21:03 UTC

"less" performs pretty much the same function and it is darn fast.
(Control characters have different widths, so even that is covered.)

> Have you considered the possibility that Gnumeric should
> keep 10000 character lines from the display? 

Yes, indeed I have, but the answer comes out as "no".

1. Doing it right would require that Gnumeric should know lots about
   unicode (zero-width characters, for example) and the font in use
   (how-to-make glyphs and how wide glyphs are).  How else, can I know
   when to cut off?

2. Those strings aren't 10000 characters long just to offend you.
   Really.  They contain bonafida information that people want to see
   and get at.  Truncating would make Gnumeric a lesser program.

> You'd need much more complexity at all levels to make horizontal 
> scrolling fast, since every pixel scrolled sideways is going to
> require 500000 or so characters to be passed through the Pango 
> layout pipeline.

1. There is no need to even access invisible characters on lines not
   in view.  That cuts 90%.

2. If you are talking about rendering, I don't quite follow: once you
   are off the right edge of the window/rectangle/whatever, that's it.

Notice, that beyond treeview measuring 10 times too many strings,
the top cpu offender is pango_default_break which is not needed at
all for a string that is set to non-bounded width -- there isn't going
to be any line breaking.

Comment 8 Morten Welinder 2018-05-06 01:58:16 UTC

It looks like *something* has changed in the past 15 years for the better.
And I don't think it's just that computers have gotten faster.

Tentatively calling it fixed.