GNOME Bugzilla – Bug 125593
Counting words on the GTP translation status pages
Last modified: 2011-08-20 07:45:23 UTC
Most professional translators don't count "messages" but instead measure the amount of work in words. Since some teams have the ability to get some translation work funded and use professional translators, having a feature on the status pages that displayed the number of words in msgids for each module and the total number of words would probably be useful. This should probably not be the primary method of displaying stats, but having the amount of words available when checking could be useful.
That's a hard feature. Will we count %s, %d, etc.. as words? Should we count .pot words or also translation words? etc... I think that we will not have this feature until Bruno finish my requests about gettext so we stop parsing directly the .po files.
Yes, perhaps this should wait until msgfmt can report the number of translated / untranslated words. Will you add a request for that?
Hmm I think we should wait for the urgent requests are finished and then we start asking more fetures :-P
Nice idea, though not in the immediate future. Will blend nicely with the idea for proper "supportedness" measure and better PO file checking.
Here is some python code that could be a start for implementing this. import string import re def countWords(line): for word in re.split("[" + string.whitespace + string.punctuation + "]+" ,line): word = string.lower(word) # check to make sure the string is considered a word if re.match( "^[" + string.lowercase + "]+$" , word): wordcount += 1 return wordcount
For future reference, added link to pocount (from translate-toolkit): http://translate.sourceforge.net/wiki/toolkit/pocount
Now that everything is in Python should be more easy to use pocount, right? Like: def po_words(po_file): if os.access(po_file, os.R_OK) return call_to_pocount(po_file) else return 0 I've done something similar for my job (translations.openbravo.com) and I just used the commandline option: from commands import getstatusoutput command = 'pocount --csv %(file)s | tail -n1' (status, output) = getstatusoutput(command) stats = output.split(',')[1:] # discard the name # now in stats[0] to stats[8] we have the {strings,words}_{translated,fuzzy,untranslated,total} Hope it helps
Ok, I'm going for it. As I see the statistics are kept on three tables: - pofile - statistics - statistics_archived And are generated (at least) on stats/utils.py method po_file_stats. Am I correct? As translation toolkit is already used on stats/utils.py I will use the pocount method (AFAIK there would be needed to create a tempfile since pocount expects a file not a string). btw the docs/DataModel.odg seems a bit outaded, I will file a bug and a "patch" to update it.
Hi Gil, great to here from you. The StatisticsArchived and InformationArchived are not used at all currently. Just ignore them. The statistics fields of the Statistics table are obsolete (as indicated in the code), so don't touch them. Only pofile is of interest to you. My main worry currently is to find a good way to show these word counts on an already cluttered interface. I'm sure you will have good ideas :-) I don't think you will have to create any temp file. All files are already available somewhere on the file system. Good luck!
Ok, you are talking about stats/models.py right? (I still have to figure out how everything works on a django based apps). As for how to display them I'm not really concerned by now, first I want to have the statistics generated and later we will figure out how to show them, some random options: - a toggle (or user preference) to display either words or strings - use a hover on strings statistics to show the words - remove strings and only use words? - ask on gnome-i18n ML? I'm playing with zenity module and I hope that by the end of this week I will have some more questions and doubts :)
Created attachment 194239 [details] [review] Migration script to add the word fields on pofile table Here I go, first patch: Adds translated_words, fuzzy_words and untranslated_words fields on pofile table.
Created attachment 194240 [details] [review] Call pocount to get word statistics on po files Second patch: Adds the call to pocount to get the word statistics and returns them with the other statistics.
Created attachment 194241 [details] [review] Register words related fields and ensure they are us This patch does two things: - registers the fields on PoFile class - ensures that they are used while saving statistics
Comment on attachment 194240 [details] [review] Call pocount to get word statistics on po files Might import those on the same line: from translate.tools import pogrep, pocount
Comment on attachment 194241 [details] [review] Register words related fields and ensure they are us Thanks for your work Gil
(In reply to comment #14) > (From update of attachment 194240 [details] [review]) > Might import those on the same line: > from translate.tools import pogrep, pocount Done! Thanks for reviewing, I already sent the commits. Now to think on how and when to show them :)
1 vote for the hover option proposed by Gil. I think it is more useful for a translator to know how many strings are fuzzy/untranslated, rather than words. I think that stats based in words, instead of strings, may be more confusing for translators; note that a string with five words in english can result in a (for example) string with 7 words in spanish, or a string with 1 word in german, so stats may not say the true at all (damned lies!!)