I’ve been struggling to find a suitable topic for a second blog post, but a recent Twitter exchange with @JennyMacleod and @DavidUnderdown9 gave me the idea to write something up about Google Books’ Ngram Viewer. For those not familiar with the Ngram Viewer, it is a tool offered by Google Books that allows you to view the frequency that a word appears in Google Books’ enormous corpus of digitized texts. This tool provides an easy way for historians to examine word use over time and it is becoming more popular among academics, but Ngram Viewer offers a number of advanced functions and that many are still not familiar with. The recent Twitter exchange gave a perfect example to highlight how some of these functions can sharpen the analytical edge of the Ngram.
This started when @JennyMacleod posted a tweet about experimenting with Google Books’ Ngram Viewer and finding that the word ‘Anzac’ showed relatively few appearances after 1950. This was especially surprising after 1990, when historians noted a surge in popular interest for Anzac Day commemorations.
Surprised by these results, I suggested running the same search through a different corpus of books. Recognizing that the tool should work in more than one language, the designers of the Ngram Viewer have divided their digitized books into smaller corpora, such as French or German. The English Corpus, however, is divided into British English and American English. Knowing this, I suggested running the same word through the British English corpus, on the hypothesis that most books that use the word ‘Anzac’ are published in Australia, and Google Books probably lumped them into the British corpus. Switching corpus made a small difference:
There are definitely more fluctuations in the frequency of the word, which roughly match up to expectations. There is a bump in the mid-1960s, around the 50th anniversary of the Gallipoli landings and another slight bump in the 1990s, when interest began to pick up again. Most notably, there is a slight increase in the appearance of ‘Anzac’ in the mid-1990s. These fluctuations were more noticeable in the British English corpus because there probably are not as many books published for American markets that feature the word ‘Anzac.’ By excluding books written in American English, the proportion of books that are likely to mention the word ‘Anzac’ increases and the fluctuations in the frequency are more apparent on the Ngram.
@DavidUnderdown9 brought up a good point when he mentioned the ‘case-insensitive’ function:
And this, too, made a difference:
By making the keyword case-insensitive, the Ngram Viewer can show the different appearances of the word, according to the use of capitalized letters. In this case, it can be seen that ‘Anzac’ appears at a much higher frequency than ‘ANZAC.’ This is an interesting distinction, because the capitalization of the word tells us a lot about its usage. The word Anzac started as an acronym, but is overwhelmingly treated as a proper noun.
My last contribution to the conversation was to suggest running ‘Anzac *’ through the Ngram Viewer. Using and asterisk as a wildcard is one of the many additional functions that work in the Ngram Viewer, which are explained in the ‘About NGram Viewer’ link that is the bottom of every Ngram. The ‘*’ wildcard lets the Ngram viewer substitute the ‘*’ for any possible word and displays the ten most common words that appear in the corpus. A search for ‘Anzac *’ reveals that the most common word that appears after ‘Anzac’ is ‘Day’:
For historians, this function reveals a lot about the context in which the word ‘Anzac’ is being used. This Ngram shows a clear spike in the frequency of ‘Anzac Day’ in the corpus around 1990 and beyond, which certainly matches expectations more than the original Ngram that presented almost no fluctuations in the frequency of the word. More significantly, it reveals that the frequency of ‘Anzac Day’ rises disproportionately to any other pairing after the mid-1980s, suggesting that the literature is increasingly discussing commemorations of Anzac rather than operational considerations of Anzac Cove or the Anzac Corps in battle.
In this case, the Ngrams generated may not have led to any new discoveries, but they can serve as a useful tool to confirm academic hunches, to visually demonstrate trends in popular interest, or even to make inferences about the historiography. Some familiarity with the corpora and the advanced functions of the Ngram Viewer can add some depth to the analysis and point researchers toward other queries. For instance, the last Ngram shows that ‘Anzac and’ was the fourth most common pairing for the wildcard. A query for ‘Anzac and *’ can give a better idea of what that pairing refers to.
In some cases, the Ngrams open a new avenue of inquiry by revealing a surprising or unexpected pattern. A query for the keywords ‘Australian Imperial Force’ shows that outside of the two world wars, there a sudden and noticeable increase in the frequency of this phrase after the mid-1990s. I wonder what that means…