User talk:The Anome/Gminas for geocoding

Latest comment: 15 years ago by Kotniski in topic Gminas problem

Gminas problem


Hi, maybe we should continue this discussion here, to avoid overloading the WP:Poland page. Can you let me know what format you have the GNS data in? Do the place names include the Polish diacritics? Are they tagged with at least the province name? It seems we will have to find an effective way of mapping between your data (i.e. names + coords + not much(?) location info) and the data we have on WP now (i.e. names + full location info + only some coords).--Kotniski (talk) 18:13, 18 March 2009 (UTC)Reply

The data files are at ; a description of the format is at It's UTF-8 encoded, so there's no problem with accents.

Here's an example record, with empty fields omitted:

2 -494241 -705199 50.316667 17.366667 501900 172200 33UXR6848776519 NM33-06 P PPL PL 48 D BODZANOW Bodzanów Bodzanow 1993-12-28

where the key fields are:

 50.316667 17.366667 is the WGS84 location lat/long in signed decimal degrees
 501900 172200 is the same thing in signed degrees/minutes/seconds
 P is "populated place type feature"
 PPL is a generic "populated place"
 PL = FIPS 10-4 country code for Poland
 48 = in FIPS subregion 48 within Poland
 D = Not verified or daggered name
 BODZANOW = canonical sort string
 Bodzanów = full name including diacritics
 Bodzanow = full name without diacritics

Most of the other fields are unique identifiers of various sorts.

The data is far from perfect or fully comprehensive, and frequently has one-to-many and many-to-one problems, so I generally try to sanity check it against other data, such as heuristics based on article naming, content, template fields and category tree data. The combination of all these checks generally produces 99%+ reliable matches. Unfortunately, the high level of name reuse in Poland was defeating my checks, and letting far too many bad matches through.

GNS is generally rather bad at determining the difference between a populated place and the administrative district of the same name, frequently coding them as a single entry. It's also often quite bad at coding subregions: subnational region fields are often missing (coded as "00") or obsolete region data can be present. There are all sorts of other small glitches: for example GNS has "Góra Święty Małgorzaty" where Wikipedia has Góra Świętej Małgorzaty: is this a typo, or possibly a grammatical declension issue? If it's the latter, this would require a knowledge of the Polish language to perform the appropriate matchups. -- The Anome (talk) 21:01, 18 March 2009 (UTC)Reply

Oh, as regards the list I said I could generate, it won't be a problem, but could you give me a week or two to finish off sorting out the last few sets of villages? Then I'll have a relatively complete list of villages for each gmina, and will be able to produce data about them in pretty much whatever form is going to be useful.--Kotniski (talk) 18:30, 18 March 2009 (UTC)Reply
That's fine by me. I should also be in a better position to run my existing matching engine when you have finished populating the placename articles, which should at least catch those places in Poland with one-of-a-kind names. Can you let me know via my talk page when you are ready? -- The Anome (talk) 21:01, 18 March 2009 (UTC)Reply

Thanks for the details; I'll have a look later. When I finish populating the articles (probably in about two weeks) I'll let you know.--Kotniski (talk) 13:37, 19 March 2009 (UTC)Reply