Text Processing


Aggregating the qualitative

Text Processing

To supplement the geographic information that we derived from the text, we engaged in text processing to extract additional information. This enabled us to compare descriptions of market frequency and goods traded at markets on a large scale with relative ease. For other sources with their own topical focuses, we can use the same techniques to extract different information. Our approach consists of finding a common pattern of description in the text, extracting information using that pattern, and then using that information for a comprehensive document search.

The advantage of extracting this information with a degree of automation is not just in the time saved for researchers, but also in the re-usability of the resources we’ve generated as part of this process. Having a machine-readable list of holidays and the dates on which they occur is useful beyond the scope of this single project, for instance. Subsequent projects can use parts of this project to enhance their own workflows.

Trade Goods

The descriptions of yearly market fairs in our text often include descriptions of goods that were exchanged at those fairs. The descriptions ranged from the general (e.g. “мелочными товарами” or “petty goods”), to the specific, such as bread, wine, or fabric. To extract these descriptions, we found words in the instrumental case in sentences that contained verbs related to trade. This is a common, but not universal, construction of the description of trade goods in this document. We cleaned the resulting list of words up, categorized them into descriptions of general kinds of goods and descriptions of specific goods, and used that list as the basis of a second search for all instances of those goods, regardless of case. We were able to do this general search by using the Russian Snowball Stemmer included in Python’s NLTK package, which removes word endings to arrive at a root form that joins the same word in a different case shares. The results of this second, cleaner search are what was included in the dataset.

Market Fair Dates

Market fair dates were a bit more difficult to extract. Many of the dates were described as calendar dates, and there were only a few different forms used, so regular expressions were sufficient to extract these dates. However, other fair dates were described in terms of local or religious holidays. To attach calendar dates to these, we first needed to get a collection of names of holidays. Fortunately, in the text many fair dates are given in lists that follow a simple form (In the town there are 5 fairs a year, the first on… the second on… etc). This pattern was used to generate a list of holidays described in the text. From these, we found dates to assign to instances of these holidays, and also created regular expressions to match various names of the same holiday. Some of the dates had to be determined in reference to Easter, which moves around the calendar year, as it is based on a lunar calendar. For these, we calculated the dates as they would have occurred in 1788, the year the text was published. This was done using a method for calculating the date of Easter developed by The Astronomical Society of South Australia. The regular expressions created to match the holidays were then used in a second, targeted search in the text to find instances of those holidays outside of the most common pattern.