Bug 125593 – Counting words on the GTP translation status pages

After an evaluation, GNOME has moved from Bugzilla to GitLab. Learn more about GitLab.
No new issues can be reported in GNOME Bugzilla anymore.
To report an issue in a GNOME project, go to GNOME GitLab.
Do not go to GNOME Gitlab for: Bluefish, Doxygen, GnuCash, GStreamer, java-gnome, LDTP, NetworkManager, Tomboy.

Bug 125593 - Counting words on the GTP translation status pages


Summary:	Counting words on the GTP translation status pages


Status:	RESOLVED FIXED

Product:	damned-lies
Classification:	Infrastructure
Component:	general
Version:	unspecified
Hardware:	Other All

Importance:	Low enhancement
Target Milestone:	---
Assigned To:	Gil Forcada
QA Contact:	damned-lies Maintainer(s)

URL:
Whiteboard:

Depends on:
Blocks:

Reported:	2003-10-27 12:59 UTC by Christian Rose
Modified:	2011-08-20 07:45 UTC

See Also:
GNOME target:	---
GNOME version:	---

Attachments
Migration script to add the word fields on pofile table (17.60 KB, patch) 2011-08-19 16:29 UTC, Gil Forcada	accepted-commit_now	Details \| Review
Call pocount to get word statistics on po files (1.45 KB, patch) 2011-08-19 16:33 UTC, Gil Forcada	accepted-commit_now	Details \| Review
Register words related fields and ensure they are us (3.98 KB, patch) 2011-08-19 16:37 UTC, Gil Forcada	accepted-commit_now	Details \| Review

Description Christian Rose 2003-10-27 12:59:09 UTC

Most professional translators don't count "messages" but instead measure
the amount of work in words.

Since some teams have the ability to get some translation work funded and
use professional translators, having a feature on the status pages that
displayed the number of words in msgids for each module and the total
number of words would probably be useful.

This should probably not be the primary method of displaying stats, but
having the amount of words available when checking could be useful.

Comment 1 Carlos Perelló Marín 2003-10-27 13:09:22 UTC

That's a hard feature.

Will we count %s, %d, etc.. as words?

Should we count .pot words or also translation words?

etc...

I think that we will not have this feature until Bruno finish my
requests about gettext so we stop parsing directly the .po files.

Comment 2 Christian Rose 2003-10-27 14:00:52 UTC

Yes, perhaps this should wait until msgfmt can report the number of
translated / untranslated words. Will you add a request for that?

Comment 3 Carlos Perelló Marín 2003-10-27 14:02:09 UTC

Hmm I think we should wait for the urgent requests are finished and
then we start asking more fetures :-P

Comment 4 Danilo Segan 2006-07-31 20:28:12 UTC

Nice idea, though not in the immediate future. Will blend nicely with the idea for proper "supportedness" measure and better PO file checking.

Comment 5 Claude Paroz 2007-09-08 10:28:13 UTC

Here is some python code that could be a start for implementing this.

import string
import re

def countWords(line):
   for word in re.split("[" + string.whitespace + string.punctuation + "]+" ,line):
      word = string.lower(word)
      # check to make sure the string is considered a word
      if re.match( "^[" + string.lowercase + "]+$" , word):
          wordcount += 1
   return wordcount

Comment 6 Claude Paroz 2008-03-06 21:11:51 UTC

For future reference, added link to pocount (from translate-toolkit): http://translate.sourceforge.net/wiki/toolkit/pocount

Comment 7 Gil Forcada 2009-01-10 11:52:33 UTC

Now that everything is in Python should be more easy to use pocount, right?

Like:

def po_words(po_file):
  if os.access(po_file, os.R_OK)
    return call_to_pocount(po_file)
  else
    return 0

I've done something similar for my job (translations.openbravo.com) and I just used the commandline option:

from commands import getstatusoutput

command = 'pocount --csv %(file)s | tail -n1'
(status, output) = getstatusoutput(command)

stats = output.split(',')[1:] # discard the name

# now in stats[0] to stats[8] we have the {strings,words}_{translated,fuzzy,untranslated,total}


Hope it helps

Comment 8 Gil Forcada 2011-08-18 20:50:49 UTC

Ok, I'm going for it.

As I see the statistics are kept on three tables:

- pofile
- statistics
- statistics_archived

And are generated (at least) on stats/utils.py method po_file_stats.

Am I correct?

As translation toolkit is already used on stats/utils.py I will use the pocount method (AFAIK there would be needed to create a tempfile since pocount expects a file not a string).

btw the docs/DataModel.odg seems a bit outaded, I will file a bug and a "patch" to update it.

Comment 9 Claude Paroz 2011-08-19 07:30:33 UTC

Hi Gil, great to here from you.

The StatisticsArchived and InformationArchived are not used at all currently. Just ignore them. The statistics fields of the Statistics table are obsolete (as indicated in the code), so don't touch them. Only pofile is of interest to you.

My main worry currently is to find a good way to show these word counts on an already cluttered interface. I'm sure you will have good ideas :-)

I don't think you will have to create any temp file. All files are already available somewhere on the file system.

Good luck!

Comment 10 Gil Forcada 2011-08-19 12:13:44 UTC

Ok, you are talking about stats/models.py right? (I still have to figure out how everything works on a django based apps).

As for how to display them I'm not really concerned by now, first I want to have the statistics generated and later we will figure out how to show them, some random options:

- a toggle (or user preference) to display either words or strings
- use a hover on strings statistics to show the words
- remove strings and only use words?
- ask on gnome-i18n ML?

I'm playing with zenity module and I hope that by the end of this week I will have some more questions and doubts :)

Comment 11 Gil Forcada 2011-08-19 16:29:35 UTC

Created attachment 194239 [details] [review]
Migration script to add the word fields on pofile table

Here I go, first patch:

Adds translated_words, fuzzy_words and untranslated_words fields on pofile table.

Comment 12 Gil Forcada 2011-08-19 16:33:22 UTC

Created attachment 194240 [details] [review]
Call pocount to get word statistics on po files

Second patch:

Adds the call to pocount to get the word statistics and returns them with the other statistics.

Comment 13 Gil Forcada 2011-08-19 16:37:29 UTC

Created attachment 194241 [details] [review]
Register words related fields and ensure they are us

This patch does two things:

- registers the fields on PoFile class
- ensures that they are used while saving statistics

Comment 14 Claude Paroz 2011-08-19 20:27:12 UTC

Comment on attachment 194240 [details] [review]
Call pocount to get word statistics on po files

Might import those on the same line:
from translate.tools import pogrep, pocount

Comment 15 Claude Paroz 2011-08-19 20:28:13 UTC

Comment on attachment 194241 [details] [review]
Register words related fields and ensure they are us

Thanks for your work Gil

Comment 16 Gil Forcada 2011-08-19 22:01:24 UTC

(In reply to comment #14)
> (From update of attachment 194240 [details] [review])
> Might import those on the same line:
> from translate.tools import pogrep, pocount

Done!

Thanks for reviewing, I already sent the commits.

Now to think on how and when to show them :)

Comment 17 Daniel Mustieles 2011-08-20 07:45:23 UTC

1 vote for the hover option proposed by Gil. I think it is more useful for a translator to know how many strings are fuzzy/untranslated, rather than words.

I think that stats based in words, instead of strings, may be more confusing for translators; note that a string with five words in english can result in a (for example) string with 7 words in spanish, or a string with 1 word in german, so stats may not say the true at all (damned lies!!)