GNOME Bugzilla – Bug 527592
Rework a bit Locations.xml
Last modified: 2008-08-31 10:19:12 UTC
I have no idea how Locations.xml was generated in the first place, and maybe we should keep in mind that it should be easy to regenerate it from the same data... However, right now, Locations.xml is really too much about weather stations and not enough about real places in the world. For example, we don't have Tokyo, but we have a "Tokyo Heliport" and "Tokyo International Airport", etc. We should consolidate everything in cities and have good names instead of names of weather stations. It's a big task, though...
(In reply to comment #0) > I have no idea how Locations.xml was generated in the first place, and maybe we > should keep in mind that it should be easy to regenerate it from the same > data... the raw data seems to have come from http://weather.noaa.gov/data/nsd_cccc.txt, described here: http://weather.noaa.gov/tg/site.shtml. (If the links break in the future, that's NOAA's "Meteorological Station Location Information" "Keyed by Location Indicator") A comment at the bottom implies it was generated in 2004. The data file uses country names rather than ISO codes, so if we change the names of any countries in Locations.xml.in, we may want to include another tag/attribute with the original form of the name for easy syncing later. The stations are only divided into sub-national units within the US, so we would need to find some other way to correctly assign the Mexican, Canadian, Belgian, German, British, Australian, Chinese, and Brazilian locations to the appropriate states/provinces/territories. (There are already a bunch of un-figured-out Canadian locations commented out at the bottom of the file.) We could use the data from geonames.org to do this, except that their free data is CC-BY and our free data is GPL, so we can't combine them. Yay freedom.
(In reply to comment #1) > (In reply to comment #0) > > I have no idea how Locations.xml was generated in the first place, and maybe we > > should keep in mind that it should be easy to regenerate it from the same > > data... > > the raw data seems to have come from http://weather.noaa.gov/data/nsd_cccc.txt, > described here: http://weather.noaa.gov/tg/site.shtml. Great! You're obviously way better than me at finding a document on the web ;-) I couldn't find it even after quite some time of search... > (If the links break in the future, that's NOAA's "Meteorological Station > Location Information" "Keyed by Location Indicator") > > A comment at the bottom implies it was generated in 2004. > > The data file uses country names rather than ISO codes, so if we change the > names of any countries in Locations.xml.in, we may want to include another > tag/attribute with the original form of the name for easy syncing later. > > The stations are only divided into sub-national units within the US, so we > would need to find some other way to correctly assign the Mexican, Canadian, > Belgian, German, British, Australian, Chinese, and Brazilian locations to the > appropriate states/provinces/territories. (There are already a bunch of > un-figured-out Canadian locations commented out at the bottom of the file.) We > could use the data from geonames.org to do this, except that their free data is > CC-BY and our free data is GPL, so we can't combine them. Yay freedom. I thought that the 3.0 version of the CC licenses solved most issues, and the geonames.org data is under CC-BY 3.0. Hrm, I'll ask a few people what we can do here.
(In reply to comment #2) > > The stations are only divided into sub-national units within the US, so we > > would need to find some other way to correctly assign the Mexican, Canadian, > > Belgian, German, British, Australian, Chinese, and Brazilian locations to the > > appropriate states/provinces/territories. (There are already a bunch of > > un-figured-out Canadian locations commented out at the bottom of the file.) We > > could use the data from geonames.org to do this, except that their free data is > > CC-BY and our free data is GPL, so we can't combine them. Yay freedom. > > I thought that the 3.0 version of the CC licenses solved most issues, and the > geonames.org data is under CC-BY 3.0. Hrm, I'll ask a few people what we can do > here. The 3.0 licenses may solve some problems, but CC-BY is an "advertising clause"-type license and therefore non-GPL-compatible regardless. But it turns out this doesn't matter; we can just bypass geonames.org and go right to the public domain sources of most of their data; http://geonames.usgs.gov/domestic/download_data.htm has voluminous data on US locations, and http://earth-info.nga.mil/gns/html/namefiles.htm has even more voluminous data on non-US locations, including division into "first-order adminisistrative divisions", aka <state>s. The USGS data also gives the county for each US city, which may allow us to more easily fix up timezones in Indiana and other timezone-spanning states. Merging all of this together is going to be tough though... we'll probably need to write a program that takes the various NOAA/USGS/NGA database dumps and the existing Locations.xml.in file, loads them all into sqlite or something, and then regenerates an updated Locations.xml.in from that. Re: bug 530178, it would also be nice if we could add entries for any major cities that don't have their own METAR stations as well, so that if people try to choose them as a location, we can DTRT. The NGA database has a field indicating the relative size/importance of each city, but unfortunately it seems to be blank for most countries. We may be able to get this information from elsewhere? (We also don't have a good way of generating <zone> and <radar> tags for any newly-added locations.)
(In reply to comment #3) > Re: bug 530178, it would also be nice if we could add entries for any major > cities that don't have their own METAR stations as well, so that if people try > to choose them as a location, we can DTRT. The NGA database has a field > indicating the relative size/importance of each city, but unfortunately it > seems to be blank for most countries. We may be able to get this information > from elsewhere? http://world-gazetteer.com/ has a downloadable world population database containing data from many different sources. The license on the database would not allow us to redistribute it, but we could use it as a filter when generating Locations.xml.in from the other databases. It turns out to be hard to define "major city" though... Including every city with population > 100,000 would give us a database about the same size as we have now, but would eliminate 43 countries and 5 US states... (maybe in some cases those countries are being eliminated just because not enough population data is available). Presumably merging the two lists (weather stations and big cities) together would result in a list more useful/correct than the current one, but not actually twice as large. > (We also don't have a good way of generating <zone> and <radar> tags for any > newly-added locations.) bug 533787 and bug 533788 propose ways of getting rid of these tags at least for US <zone>s. We'd still need a solution for the UK and Australia, although they're both much smaller (in terms of number of locations) than the US, so it's not as big a deal there.
(In reply to comment #4) > http://world-gazetteer.com/ has a downloadable world population database > containing data from many different sources. On further investigation, this database turns out to be not very usable because of inconsistencies in the way they merged the data from those many different sources. (It is difficult to reliably determine if an entry corresponds to a city or to some larger or smaller division, which may have the same name as a city...) However, the NGA database has a special flag for cities which are the capitals of major administrative regions, and using that as our definition of "major city" would pick up both Manizales, Colombia (bug 530178) and Lodz, Poland (bug 534047), so that's probably sufficient. Going back to comment #0: > However, right now, Locations.xml is really too much about weather > stations and not enough about real places in the world. For example, > we don't have Tokyo, but we have a "Tokyo Heliport" and "Tokyo > International Airport", etc. We should consolidate everything in > cities and have good names instead of names of weather stations. Yup. I think what we want to do is rename <location> to <station> (to make it clear that it represents a weather station specifically), and make <city> mandatory (ie, you can't have a <station> that isn't contained in a <city>). And then organize the UI around <city> nodes rather than <station> nodes. To deal with major cities that don't have any weather stations of their own, we'd just allow weather stations to appear under multiple <city> nodes. (So a <station> node for SKPE would appear under the <city> nodes for both Pereira and Manizales. Or else there'd be a "symlink" station node or something.) The <station> nodes would mostly be an invisible implementation detail; if there's only one station in a given city, it really doesn't matter that much exactly where it is / who it belongs to, and so there's not much reason to ever show that information. The only time you'd really care to see the station names is if there is more than one available for a city, in which case the user *might* possibly want to choose between them to get the closest one. (Though as per bug 527593, there should be a default; the city databases above have longitude and latitude info for each city (presumably the city center), so we can just pick the closest station to that for the default.) So I think we should *not* mark weather station names for translation (or even include them in Locations.xml), except when there is more than one for a city.
Created attachment 111971 [details] [review] work-in-progress Locations.xml.in updater This is a work in progress, but basically this takes the existing Locations.xml.in, the weather stations source file, and two geographic names files (see the README.sources file in the patch), and creates a new Locations.xml.in in which every <location> appears inside a <city>, whose name is actually correct. Usage: download and uncompress the source files, put them into sources/ ./build-locationdb.pl # wait 15 minutes or so for it to finish ./update-locations.py > Locations.xml.in.new # wait 15 minutes or so for it to finish as I said, it's a work-in-progress. There are various bugs, notably: 1. It drops <radar>, <zone>, and <tz-hint> tags from <location>s, and all translator comments. (It puts in some translator comments of its own, but I'm not convinced that that feature will stay.) 2. It generates invalid xml (empty <state>s sometimes), though in some cases this points out other bugs. (eg, there is no longer a city/ location for Wasington, DC.) 3. If you run ./update-locations.py twice on consecutive days, you'll get a slightly different set of stations the second time, because the output only includes stations that have reported in the last two days, and so stations that report irregularly might or might not get included on any given run. Of course, what we want is to include "stations that report 'regularly'", but since NOAA doesn't seem to keep old reports around, that would require checking several days in a row to get it right... (There's a question of how regularly we want to require the station to report. Clearly it's better to get today's weather from a station 20 km away than it is to get last week's weather from a station 1 km away. But where do you draw the line. Or should we just include any station that ever reports, but provide some indication in the UI when you're getting a stale report? I'm going to poke at some other bugs now, but I'll be returning to this later
i was going to attach a copy of the current output Locations.xml.in, but bugzilla says it's too large for a non-patch, so you can find it at http://www.gnome.org/~danw/Locations.xml.in
*** Bug 161882 has been marked as a duplicate of this bug. ***
*** Bug 416909 has been marked as a duplicate of this bug. ***
Blue-sky idea: maybe we could ship smaller data files and make it possible to get more detailed data (more cities, eg) in a transparent way via a web API?
(In reply to comment #10) > Blue-sky idea: maybe we could ship smaller data files and make it possible to > get more detailed data (more cities, eg) in a transparent way via a web API? Updating the data files, yes. But shipping only parts of the data is useless. It still has to come from somewhere, so whether it's "online" or on a CD, it still has to be distributed.
(In reply to comment #10) > Blue-sky idea: maybe we could ship smaller data files and make it possible to > get more detailed data (more cities, eg) in a transparent way via a web API? There's definitely something to be said for having a larger-than-normal db available online. This could also include other search keys, like postal codes. I'm not sure shipping a smaller-than-normal db in the package makes sense though; if you assume the user will have access to an online db whenever they need to pick a city, then there's no reason to ship *any* db at all. But if you assume the user might need to pick a city while not online, then you want to ship a reasonably-sized db. Although if you're doing a country-specific distro/spin, you might not care about including locations in other countries.
Some ideas: To make Locations.xml smaller, you could split it into multiple files based on the user language, like aspell-en, aspell-el etc, to be automatically installed when the user installs language-pack-gnome-XX. So e.g. I'd only want to install the english & greek translations of the city names, not the whole bunch. You could also split it by country, like Locations-gr.xml, Locations-us.xml, and the user would have to select which countries he/she wants detailed (apart from the country selected upon OS installation/time zone settings, of course). But really, no matter how many pieces Locations.xml is splitted into, it'll never be detailed enough to have all the locations everyone will need. E.g. I live in a city of > 100.000 people, and Locations.xml doesn't have it, even with a size of 17Mb. It'll never have my parents' village with < 100 people. So, a web service is needed for searching location data and ideally also for storing new ones. I don't know where/how it should be hosted, but it would be nice if it was easy for users to input new locations. Not only town names, translations and coordinates, but e.g. also the URL/regex/whatever for libgweather to fetch the forecasts from. I'm sure that if you had a "Submit a new location" button in the interface, which would direct the users e.g. to open an account somewhere and submit the needed data, you'd multiply Locations.xml size by a thousandfold in no time. And of course e.g. libgweather would then just ask (just once and store the answer in a config file) this server to provide all the needed data (=town code, wheather server, ...). Kind regards, Alkis Georgopoulos
Created attachment 115375 [details] [review] new Locations.xml.in-rebuilding patch Updated patch. This is still not perfect, and in fact, the code is a mess, but the code itself isn't really that important; it's the output that we care about (and as before, that's at http://gnome.org/~danw/Locations.xml.in). We can continue to clean up/speed up/simplify/etc the code later, because it's very easy to test that changes to it are correct (just diff the newly-output Locations.xml.in against the old). It's difficult to compare the entire Locations.xml.in against the old one, but it's easy enough to compare small pieces; just pick your favorite country/state/province, and compare the old set of cities/locations to the new one. In general, the greatest improvement is to be found in the countries with the fewest GNOME hackers/users, because those are the countries where we haven't already manually fixed the old Locations.xml.in. The new set of cities/location names create 2000 new translatable strings (and remove 1500 old ones). There are likely to be additional changes on a smaller scale as people GNOME-Love-ify the new data file. At any rate, most of the new strings are the names of smaller/less-historic cities that aren't going to have translations into other languages anyway (although the non-Latin-alphabet translators will have a lot of transliterating to do). At any rate, if we want to get this into 2.24, we should commit it soon.
Committed!
*** Bug 171791 has been marked as a duplicate of this bug. ***
*** Bug 441818 has been marked as a duplicate of this bug. ***
*** Bug 528191 has been marked as a duplicate of this bug. ***
Does this mean when I want to ensure my favourite city is included in Locations.xml I should add it to major-cities.txt (which update-locations.py uses to identify additional important cities) and regenerate Locations.xml.in, or edit Locations.xml after it is generated? It certainly seems easier to do the former, anyway.