About


A short project summary

About the Project

The Historical Gazetteer of Russia project is a pilot project with two main goals:

  • To test the possibilities of extracting, structuring, and manipulating textual information from historical Russian-language texts with as little human intervention as possible.
  • To develop and refine workflows for this extraction and for the automatic geocoding of historical place names for use in an online historical gazetteer.

For both aims this means that we rely on Python scripts, not students, to extract information from our chosen text.

The Text

Our inital source text was Словарь учрежденных в России ярмарок и торгов or Dictionary of Fairs and Trades Established in Russia. It contains an annotated list of roughly 800 market fairs held throughout Russia in the late 1700s, organized alphabetically by town name. Each entry describes which administrative unit the town is part of, when market fairs were held there, and what goods were traded at the fairs, all in varying levels of detail.

The text is a series of entries that can be parsed out into discrete entities. It has a clear topical context, so there is less room for ambiguity in the usage of certain words and phrases. The text also contains relatively consistent descriptions of both the towns and the fairs that they hold, making it easier to extract specific pieces of information. It is also not the largest contemporaneous work of its kind. There are several other geographic encyclopedias and dictionaries printed around the same time that have more content. We can use the tools developed for this project to tackle those texts, and continue to build more generalized processes as we work through those texts.

The Information

There were two basic datasets that we wanted to derive from the text. The first is a set of associations between places described by the text and accurate longitude and latitude coordinates for each place. The second is structured set of the descriptions of market activity from the text to discern what we could about aggregate trading patterns.

Geographic Information

Reconciling the historical places mentioned in the text with their present-day locations and coordinates is a primary aim of the pilot project. As Peter Bol has observed, the absence of an adequate “world-historical gazetteer” is one of “the major cyberinfrastructural challenges” facing historians wishing to incorporate geospatial analysis in to their work. Toponimika works with this global scope in mind, associating each entry in our historical text with an identifier from GeoNames, linking our work to one of the largest web-based toponymies in existence. It is our hope that as this project and others like it grow, an adequate world-historical gazetteer can begin to emerge.

To map the towns described in the text to their modern equivalents, we primarily used Geonames.org as both the system by which towns are uniquely identified and the service through which we reconciled our data to the modern world. We did this by querying the Geonames API with the names of our towns in the areas of their namiestnichestva (or “viceroyalty,” a state-, province-, or oblast-level administrative division which existed in Russia from 1775 to 1796). Through this process, we were able to automatically geocode ~35% of the dataset, with the remainder of the dataset requiring human intervention to determine which possible matches were likely to be the one described by the text. Although the purely automated geocoding wasn’t initially quite as successful as we had hoped, starting from a list of possible candidates instead of from scratch for each entry was also quite useful for manually reconciling additional towns.

The process is described in more detail on the Geocoding page.

Trading Information

Another avenue for adding value to digital historical texts is using systematic approaches for extracting specific, structured information from the texts. This allows for quantitative analysis of the historical objects described by the text. In our case, we were able to extract administrative divisions, goods being sold at market fairs, and when these market fairs were held. Not only can these data be aggregated, they can also be cross-referenced, making it easier to ask and answer questions about the relationship of space, time, and trade during this time period. Again, linking the towns described by the text to modern equivalents opens up cross-text analyses that would not be possible otherwise.

To extract information about trading patterns from the text, we ended up taking the same general approach for text describing both when market fairs were held, and what goods were exchanged at them. Using a combination of regular expressions and Python’s Natural Language Toolkit (NLTK), we examined the text for useful patterns. From those patterns, we derived lookup strategies to extract and standardize textual data. For example, trade goods are frequently described in the text in the instrumental case, and in sentences with verbs related to trade. To find trade goods, we found words in the instrumental case in sentences with those trade-related verbs, removed the ones that weren’t actually describing trade goods, and used the rest as a dictionary to look up those trade items in any form. This general approach of finding a common pattern, extracting meaning from it, and using the results to find all cases of that pattern was found to be quite useful in automating a relatively comprehensive text extraction process.

This process is described more fully on the Text Processing page.

Team

This project is an outgrowth of a larger project led by Kelly O’Neill, Associate Professor of History at Harvard University, the Imperiia project to promote the study of Russia’s spatial history. It is managed by Hugh Truslow, Librarian for the Fung Library and Davis Center for Russian and Eurasian Studies Collection at Harvard’s Fung Library. Project code was written by Jeremy Guillette, Research Assistant at Fung Library. Invaluable advuice and guidance was provided by Lex Berman, Project Manager of the China Historical GIS project, whose expertise in the creation of historical gazetteers cannot be overstated. One way we are making the project data available is through Lex’s Temporal Gazetteer API project, as well as by download from this site. We have also benefitted in ways large and small from our close proximity to Harvard’s Center for Geographic Analysis, where Lex works. Finally, thanks are due to the Davis Center for Russian and Eurasian Studies for their support of this project, especially Alexandra Vacroux, Executive Director, and Maria C. Altamore, Director of Administration and Finance.