At the beginning of the year, I resolved to experiment more with Gephi to explore the possibilities of network analysis in historical research. Prompted largely by the need to organize some classroom activities for my course on the First World War, I delved back into the Canadian Images and Letters Project (CLIP) to scale up my efforts.
In my first attempt, I explored the letters of the Rooke brothers and used the Stanford NER tagger to recognize and tag the names of different people mentioned in each letter. I found this approach quite finicky, as many of the names tagged were place names – rather than people names, while many of the names mentioned were references to politicians or statesmen, such as Lord Kitchener or Jan Smuts – people unacquainted with the Rooke brothers or anyone in their family. It didn’t tell me much about the Rooke brothers’ personal networks to see that they mentioned Kitchener or Smuts in their letters. Instead of relying on the tagger, I decided to try again using the structure of the CLIP website.
To the right of *most* letters in the database, the website indicates who a letter was written to and from:
Studying the html code of these pages, I found that the names of the letter writer and addressee were bracketed by tags:
I started with a single collection of letters, those written by Wilbert Gilroy, to see if I could extract each the names of each letter writer and recipient. It worked pretty well, and I turned the csv file into a network graph on Google Fusion Tables:
In fact, it was so easy to extract these values that I wrote a script to crawl through all 4,521 of the First World War letters in the CLIP website and extract these names into a large csv file:
You can see on line 7 that this was not without problems. Not every letter has the writer and addressee identified on the website, while some letters do not have a date attached (I collected the dates of letters as well, just for good measure). Even though this was not the cleanest set of data, I entered this csv file into Fusion Tables to see how it turned out:
The network graph is not perfect by any means but it shows who wrote letters (in gold) and who received letters (in blue), with the size of the dot approximated the number of letters sent or received. On account of the blank spaces in the csv file, many of the nodes do not have names attached to them. But the instantaneous finding we get from this graph is that most letters in the database are addressed to “Mother.” This is a pretty interested, but not unexpected find, as it stands to reason that most soldiers wrote to their mother and that most mothers probably saved the letters they received, thus ensuring they would be stored in an archive and digitized on CLIP. The above graph only shows 161 of the possible 1062 nodes (or values) in the original csv file. We can get a more detailed look at the networks by raising the amount of nodes all the way up to 1062. This shows a much more complex network graph:
Moving around the graph, we can see that the big blue dot representing all the letters received by a mother is actually unrepresentative. We can also find blue dots for letters received by “Mamma,” “Ma,” or “Mom.” We can also find the same discrepancy for letters written to “Father,” with dots also appearing for “Dad,” and we can assume that the blue dot for “Sister” is equally under-representative, as many letters written to someone’s sister are identified in the database by the sister’s name.
Data straight from a scrape is always a little dirty, and I could use OpenRefine to go through the 4,521 lines of the csv file and clump all the Mother, Mom, Momma, into one single value. Not to throw out the baby with the bathwater, we can look around the network graph to discover relationships that might be hidden if we relied on the structure of the website. Each collection of letters is organized around the soldier (or nurse) who wrote the letters – all the letters in a collection are organized around the person who wrote them. But looking at the larger network graph, we can discover mutual friends. For instance, we can see that Helen Davis received letters from a few people in the database:
Helen Davis’ correspondence with the four men who wrote to her is acknowledged in each of the four soldiers’ biography, but their mutual acquaintance really pops out in the network graph. As does the correspondence of Ivy Redman:
We can see that Ivy Redman received letters from four of her male relatives, yet this is not reflected in the biography posted on the landing page for George Redman’s letters. In other cases, we find examples like Joy Smith, who received letters from John Oxborough and J.K. Moffat:
The letter from Moffat informed Smith of the Oxborough’s death, offering a grim description of the details. The instance of two people writing to the same person might not shed light on any personal networks, but this provides a pattern that might reveal other letters of condolence.
By finding out what her connection was to each soldier, we can get a deeper insight into the human relationships that developed over the course of the war. Personal correspondence becomes less of a two-way conversation and opens up into a larger network that sometimes reveals sorrow or tragedy.