This post combines one of my earlier posts on Google Books’ Ngram Viewer and my bash script that built a search engine for the Early Canadiana Online database. Google Books probably has the best-known Ngram Viewer, but Tim Sherratt has produced a similar ap called QueryPic to search the Trove and Papers Past newspaper databases and graph the number of articles containing a keyword. The advantage of an ap such as QueryPic is the ability to search a smaller corpus, which produces results that reflect a particular context. As I tinkered with the JSON files that make up the scaffolding of the Canadiana database, one of the projects that came to mind was to write a script that generates a Ngram for search terms in the database.
Writing a script that would chart word frequencies was relatively simple. I figured it would be easy to modify the script from my earlier search engine to display frequencies by year, so the crux of it would be finding a command-line tool to turning a CSV file into a line graph. After googling around, gnuplot seemed like the most straightforward option.
The first step was to modify my earlier script to show word frequencies by year, rather than keywords in context. When read in JSON, each Early Canadiana Online record includes a “published” field, which details the place, publisher, and year of publication of each item. We can see that an issue of “The Listening Post,” for example, was published in France by the British Expeditionary Force in 1917:
I kept the part of the script that used jq to parse out each record’s identifier and print it into a document, then used jq again to parse out the “published” field. Because all I wanted was the year of publication, I used the sed command to removed all the letters and punctuations. This output was printed into a text file, followed by a comma and a space.
The next part of the script was a modification on the original. Rather than using ‘grep’ to present a key word in its context, I used ‘grep -c’ to count the frequency of a keyword in a each record, then printed this output into the text file after the date of publication and added a line feed.
These two steps were done for each record that came up in a search, leaving me with CSV file with identifiers, years, and frequencies:
Once I was able to produce CSV files, gnuplot did all the work graphing the frequencies by year. It took a bit of googling to find a tutorial suited to what I wanted to do, but it was pretty easy to figure out after that. Because I was working on a section on Japanese-Canadian enlistments during the First World War when I wrote this script, my first attempt was to track the frequency of the word “Japan*” between 1900 and 1920. The result was a bit surprising:
The frequency spikes in 1901, which seems like an odd year for so many publications to be discussing Japan. To see if my script might be off, I ran a search for “japan*” for 1900 and 1901 and, sure enough, there are 65 pages of results in 1900 but only 24 in 1901. In comparing the two sets of results, it seems that the database has a wider variety of publications in 1900 than 1901. Running the search again for 1899 returned 69 pages of results while 1898 returned 71 pages, which suggests that the database is much better stocked for publications before 1900 than after. I guess they call themselves EARLY Canadiana for a reason. Barring that, there’s a long spike in 1906 and another on in 1917. The 1906 spike might be explained by debates surrounding the passage of the Immigration Act that year, while the 1917 spike seems to have been fed by the appearance of wartime publications that mention Japan’s participation in the war. It’s strange that Vancouver’s race riots of 1907 did not seem to register a spike.
For a second attempt, I thought I would try a comparative search. I thought a good way to see how frequency changes over time would be to try searching for some of Canada’s Prime Ministers. I ran the script to search for ‘Laurier,’ ‘Borden,’ and ‘Mackenzie King’ and here is the result:
Because the whole point of the exercise was to see if it mattered what corpus an Ngram was drawn from, I ran the same query in Google Ngrams:
The two definitely gave different results. Laurier certainly comes up more than Borden before 1914, but this could a result of using ‘Borden’ as a query in Canadiana and ‘Robert Borden’ in Google Ngrams. Because Google Ngrams searches all English texts, I thought I should search for first and last names to make sure the results were indeed mentions of the Canadian Prime Ministers, but did not take the same precaution in the Canadian text, because I figured that with a smaller corpus, looking for both first and last names would diminish the results. Robert Borden’s brother Frederic is probably responsible for about a quarter of the hits graphed. It is interesting that in Canadiana, spikes in the frequency of ‘Borden’ and ‘Laurier’ mirror each other in 1903, 1905, and 1908. While there’s no apparent spike in 1911, when Borden and Laurier ran against each other in a federal election, the two seem on par with one another from 1915 to 1918. I imagine this wartime parity has something to do with debates over conscription, with both championing either side of the issue.
Most interesting is the near absence of Mackenzie King in the Canadiana ngram. King receives about 40 mentions in 1908 but is almost invisible, even after 1921 when he was elected as Prime Minister. After checking in Canadiana, the 1908 spike seems to be caused by Mackenzie King’s report on the 1907 riots in Vancouver. King’s absence later on, however, is a reflection of the corpus. Just as there are many more documents available for the years before 1900, the corpus starts to thin out some more after 1920. There are 1,393 results for Mackenzie King between 1901 and 1920, but only 29 between 1921 and 1930. The fact that all three Prime Ministers’ names taper off later on in the graph seems to confirm that fewer documents are available in later years.
My Ngrams may not have brought any huge revelations about the particular terms that I’ve searched, but I think they reveal a lot about the spread and quantity of sources available in the Early Canadiana database. Unlike Google Ngrams, which graphs the frequency of a query as a percentage of all the words in the Google corpus for a given year, the frequencies in my ngrams reflect the variations of the Canadiana corpus. Early Canadiana Online is probably a great place to get partisan rhetoric surrounding the conscription debates of the First World War, but it’s not going to be much help if you want to find out more about Mackenzie King’s time as Prime Minister.