Geocoding


From text to coordinates

Geocoding

A major goal of the Toponimika project is to reconcile historical place names with modern locations using code instead of human labor. Reconciling historical place names with places identified by GeoNames allows for the creation of a historical gazetteer, using the identifiers in GeoNames to directly link historical documents to one another and to the present. It also allows for new visualizations of historical information, putting historical texts and the information that they contain onto modern day, interactive maps. Finally, this linkage integrates historical information into the wider linked data ecosystem, allowing for broad and creative re-use of historical documents.

Our geocoding process used place names and administrative divisions in a query to the GeoNames.org search API to reconcile the entries in our text with the entities present in GeoNames. For approximately 35% of our entries, the query produced only one response, indicating that the combination of name and administrative division was sufficient for automatic reconciliation. For the remaining entries, manual intervention was needed to determine which of the places returned by our query was the place described by the text.

Administrative divisions

We used administrative divisions as a primary means of narrowing our search criteria because top-level administrative divisions are explicit in ~95% of entries. To create the bounding boxes, we either took the administrative seat of the unit as a center and searched in an 8° by 8° box centered on that city, or we used contemporaneous maps to establish more accurate, manual bounding boxes. The creation of the bounding boxes was relatively labor-intensive, but worthwhile, as it allowed for the automatic geocoding of ~250 entries, and the resulting boundaries are usable in other projects using texts from the same time period.

Name Variation

Many of the places described by our source text used names that differ from the names used to describe those places today. This is due either to drifting language or the fact that some cities and towns were part of the Russian Empire in 1788, but are now part of independent countries that speak different languages. To deal with this, we used Geonames’ “fuzzy search” option. This makes the search include locations whose names have a certain percentage of letters in common with the search term. In the script used, the value for the “fuzziness” was gradually decreased, allowing for names with less and less similarity, until a search with results was returned. This way, every search produced at least a starting point for identification, and at best an immediately usable result.

Manual Reconciliation

In order to manually establish links between our text entries and GeoNames entities, we had to determine with some amount of research which modern location corresponds to the place described by the entry. To facilitate this process, responses from geonames were transformed into GeoJSON files and saved, then pushed to GitHub with other changes from the geocoding process. Since GitHub automatically maps GeoJSON files, this allowed us to look at the results of our searches on a map, so that we could more easily identify correct matches. When a confident match was found, the IDs for the text and GeoNames entities could be put into a script to add this geographic information to our dataset. These connections were made on the basis of information from a contemporaneous atlas, (both a version from Harvard and one from Etomesto.ru), Russian Wikipedia, the Great Soviet Encyclopedia, the Brockhaus and Efron Encyclopedia, and other sources detailed on our Resources Used page.