Early Canadiana Online is one of the largest repositories of digitized Canadian periodicals. Its website boasts 3,500,000 pages of word-searchable content that users can access for a subscription of $10 per month or $100 per year. I cannot deny that this collection places a lot of material at one’s fingertips, but I have always found it difficult to sort through the website’s search results. The search results display the name of each document in which the search term can be found, and identifies the pages on which the search term appears, but to assess the relevance of a document requires the reader to open the document, turn to the right page, and skim through to find the search term in its context. Unlike many databases that offer word-searchable of digitized documents, Early Canadiana Online does not highlight the search terms in the results, making it just a little more time-consuming to determine whether a document is relevant to my own research.
A search for documents relating to Montenegro or Montenegrin between the years 1914 and 1918, for instance, produces thirty-one results. It would not be so tedious to click on each document and locate the keyword to find out how each publication referred to Montenegro but a more common search term, like ’empire,’ or a broader date range could produce many more results than is practical to examine one by one.
My solution to this predicament was to write a program that presents the keyword in context. This is the end-goal of the original lessons of the Programming Historian, but those lessons are written for Python and it has been a few years since I have use Python. I am much more comfortable using bash scripts in the Linux command line after Bill Turkel’s class on Digital Research Methods, which I was lucky to audit last fall. The idea of building a program to search Early Canadiana Online came to me after reading Ian Milligan’s blog post ‘Historians love JSON, or one quick example of why it rocks‘, which explained how to use JSON files to access full-text versions of content from the Canadiana collection. As the blog explains, each scanned document has a corresponding JSON file that contains the OCR’d text and all the metadata associated with the document. Combining the skills learned in Bill’s class with this new information about JSON files, I started to code (slowly).
Using the Canadiana API
I was able to piece most of this together from Ian Milligan’s blog post, which explained that every item in the Early Canadian Online collection has a unique identifier that can be parsed out of a JSON file produced by a keyword search. The JSON file produced by the search for ‘montenegr*’ looks like this:
The JSON file lists some metadata about each search result, including the document’s language, its author, place of publication, and – most importantly – the document’s identifier. By scrolling through the above link, we can see that the identifier for the first search result is ‘oocihm.8_06774_29’. To view the corresponding scanned document, we can add the identifier to the url ‘eco.canadiana.ca/view/’ to get http://eco.canadiana.ca/view/oocihm.8_06774_29. The url for the JSON file that contains the document’s full-text is http://eco.canadiana.ca/view/oocihm.8_06774_29/1?r=0&s=1&fmt=json&api_text=1. Substituting the identifier from a different document, such as oocihm.8_04240_71 produces similar results.
To download all the text files of documents relating to Montenegro is to collect all of the identifiers from all JSON files with the search results, plug them into the url eco.canadiana.ca/view/[identifier]/1?r=0&s=1&fmt=json&api_text=1, and download it as a text file. Sounds simple enough…
Step 1: Getting All the Identifiers
The first twist is that the JSON file that lists the search results only shows one page of results. The above example was only the first of four pages of results. By scrolling to the very bottom of the JSON file, we find ‘”pages” : 4’, indicating that there are four pages of results.
The url for each subsequent page of results is ‘eco.canadiana.ca/search/[page number]?df=1914&dt=1918q=montenegr*&fmt=json’. To collect all the relevant identifiers meant writing a script that parsed the number of pages of search results – four in this case – and then insert each page number into the above url to download the JSON file for each page of search results. Here’s what I wrote:
The first line uses the program jq to parse out the number of pages from the JSON file. The second line tells the computer that $pages is the total number of pages of search results. The third line begins a ‘for loop’ that inserts integers between 1 and $pages into the url, then parses out ‘key’ which is the identifier for each document, and prints those out in a bunch of documents titled monteneg[page number].txt. As I write this, I realize I could have just used two ‘>’ and had the program keep adding subsequent identifiers onto the same file. Oh well, it still works. (A huge thank-you to Bill who figured out how to get the program to recognize $pages in the ‘for loop’).
After a bit of tidying up using ‘sed’ to remove curly brackets and other clutter, I was left with a document that had compiled the thirty-one unique identifiers for each document that came up in the search result:
Step 2: Downloading all the JSON files
This is actually pretty easy. Using Bill’s lesson on Building a Simple Web Spider as a template, I wrote the next part of the program to take identifiers from the document, and insert them one by one into the url ‘eco.canadiana.ca/view/[identifier]/1?r=0&s=1&fmt=json&api_text=1′ to then download each JSON file as a .txt file.
Step 3: Displaying the Keywords in Context
This is where I had the most trouble. Again recalling one of Bill’s lessons on Pattern Matching and Permuted Term Indexing I set out to write a program that would use the command ‘ptx’ to create a concordance for each JSON file, then use ‘grep’ to find all the entries in each concordance that matched my search term ‘montenegr’ and show the keyword in context. After a number of setbacks that can only be attributed to my inexperience, a few dozen google searches for ‘how do i _ in linux’, and one totally unnecessary email to Bill (which he very kindly answered, as always), my program was yielding the results I wanted.
One bug that took me a while to spot was that I had typed a command incorrectly. One of the steps of turning a text document into a functional concordance is the removal of the spurious carriage return characters at the end of each line. The script to do this with ‘tr’ is “tr -d ‘r'”, but I had typed “tr -d ‘/r'”, which removed all lower case ‘r’s from the documents. The accidental removal of all ‘r’s made it impossible for ‘grep’ to find matches for ‘montenegr’. Another difficulty was the realization that ‘grep’ does not match characters with accents. Adding the argument ‘-i’ tells grep not to distinguish between capital and lower case letters, but this does not extend to ignoring the difference between ‘e’ and ‘é’. Because many of Canadiana’s documents are in French, grep was not matching ‘montenegr’ with Monténégro or Monténégrin. This was easily fixed by searching for ‘mont(e|é)n(e|é)gr’, which tells the computer to look for words with either character.
The biggest challenge was that the program ‘ptx’ was having difficulty creating a complete concordance for each JSON file. My program was meant to print an identifier into a text file, then use ‘grep’ to find and append matches from the concordance below the corresponding identifier. In theory, that should show me each instance of the words Montenegro or Montenegrin, in context. The results fell a little short of that.
We can see here that the there are no matches displayed for the identifiers oocihm.73604, oocihm.74543, oocihm.76024, and so on. A complete concordance should list each word of a document in the middle of a line, with three blank spaces to the left. I was using the command ‘egrep -i [[:alpha:]] “mont(e|é)n(e|é)gr”‘ should only show the lines of a concordance with the word ‘montenengr’ in the centre. Some additional greping revealed that the files that produced no results in the final product did indeed contain the keyword, but some complication prevented ptx from producing a complete concordance for each file. A word that could be matched with ‘mont(e|é)n(e|é)gr’ was there, but because it did not appear in the centre of a line with three blank spaces next to it, so the command ‘egrep -i [[:alpha:]] “mont(e|é)n(e|é)gr”‘ did not pick it up.
An easy fix was to use the -C argument for grep. Rather than build a concordance, the argument -C 2 shows the full line on which the keyword appears and the two lines of text above and below it. Making use of this function meant running each JSON file through the command ‘fmt’ to cut down the lines of text into manageable chunks. After that, results were displayed for all of the identifiers.
Using ‘egrep -C 2’ rather than a concordance means that the keyword does not stand out as easily, but the keyword always appears in the middle line and five lines of text is not only easy to skim through, it provides more context. More importantly, this second method does not omit files because of faulty concordances. The first set of results did not display any matches for oocihm.73604 or oocihm.74543, but these can be found with the second method.
Firstly, I’m pretty happy I was able to build this program even if it was just the application of basic skills. It has been almost nine months since the last class of Digital Research Methods, and I am glad that I can still put those skills to use. Regular meetings with my department’s digital history workshop definitely helped keep those skills sharp.
Secondly, this is a pretty useful search engine. I have not found many documents in the archives relating to the Montenegrin contingent that Canada mobilized during the First World War, so I was hoping to find one or two mentions of it in Early Canadiana Online. The .txt file that my program produced gave me thirty-one five-line excerpts that I can look through relatively quickly. If any of them are of interest, the identifier is clearly displayed above each passage. I can use the identifier to call up the full text version of the JSON file, which my program has already downloaded, or complete the url eco.canadiana.ca/view/[identifier] to look at the scanned version online. It will be very easy to edit the program to look for different search terms the next time I need to find something.
And after skimming the excerpts from the thirty-one search results, I was able to find one match that mentioned the Montenegrin Contingent.
This passage is from page 39 of Le prix courant of November 1918: http://eco.canadiana.ca/view/oocihm.8_06619_154/24?r=0&s=1&fmt=json&api_text=1
The scanned image is behind a paywall, so I have to access it through my institution’s server. Mission accomplished.
Want to know more about programming?
Of course, I have to recommend Bill Turkel’s lessons for teaching Linux bash scripts using a virtual machine. Learning to writes scripts on the command line makes for a steep learning curve, but after overcoming that curve there was something about the simplicity and universality of Linux that appealed to me. The use of the virtual machines is another method I appreciate because it puts a barrier between my clumsy scripts and my computer’s hard drive.
The Programming Historian offers a growing number of lessons and tutorials to teach programming skills that are applicable to historical research. Most of the programming modules are written for Python. Python is a very powerful programming language, the the lessons in the Programming Historian are certainly very useful, but I am not yet ready to become a multi-lingual programmer. I will stick to Linux for now.