apisearch

Trawling Trove

As some of my earlier posts suggest, I often supplement my archival research by drawing on repositories of digitized newspapers. The National Library of Australia’s Trove database is easily one of the largest, most innovative, and best-curated public repositories for digitized newspapers. I have finally gotten around to using the Trove API  to scrape a large corpus of newspaper articles.

The API (Application Programming Interface) allows users to query the Trove database and receive their results as structured data such as xml or JSON. While a normal search on the Trove website would produce this:

normalsearch

A query using the API returns this:

apisearch

Running queries through the API allows users to parse out specific pieces of information from the search results, such as the name of the publication, the publication date, or the format of the article. This would be useful, for instance, if someone wanted to run as search for the word ’empire’ to see how often the word appeared in a newspaper article compared to the amount of times it appeared in an advertisement.

Because I wanted to scrape newspaper articles that discussed the voluntary war effort, I was interested in parsing out the article urls so that I could compile these as a list of items to download. I ran a query for the search terms ‘patrioti*’ and ‘appeal’:

http://api.trove.nla.gov.au/result?key=[MYAPIKEY]=newspaper&q=patrioti*+appeal&anyWords=&notWords=&requestHandler=&dateFrom=1914-09-01&dateTo=1918-09-01&s=

The API query returned 13,288 articles. Even though I specified a date range between September 1914 and December 1918, the results obviously included articles from outside those dates. That threw a spanner in the works because I really didn’t want to download that many articles, but there’s an easy fix. The first step was to build a for loop that would return a csv file with the article’s titles, the date of publication, and the url. I used curl to bring up each page of results and xmlstarlet to parse out the desired values and print them into a text file:

for i in {0..13260..20}
do
     curl "http://api.trove.nla.gov.au/result?key=[MYAPI]=newspaper&q=patrioti*+appeal&anyWords=&notWords=&requestHandler=&dateFrom=1914-09-01&dateTo=1918-09-01&s="$i | xmlstarlet sel -T -t -m "/response/zone/records/article" -v "heading" -o "; "" -v "date" -o "; " -v "troveUrl" -o """ -n >> patrioti_appeal_csv2.txt

done

The resulting csv file gave me a good record of all the titles I would be downloading, but I needed delete all the results that were earlier than September 1914 or later than December 1918. I probably should have spent the time brushing up on regular expressions to find a way to delete the undesired lines with sed, but I just opened the csv file in Excel, arranged the files by date, and deleted everything outside of the desired date range. While I was in Excel, I also created a new csv file that only contained the article urls, because this was the only information I needed to download all of the corresponding articles. I parsed these down from full urls down to the numerical identifiers and printed them in a new file called patrioti_appeal_ids.txt.

The text version of each article in the Trove database can be called up with a url like:

http://trove.nla.gov.au/ndp/del/text/47493801

That’s pretty easy to call up if you have list of the identifiers you want. I created a variable called $url to call up an identifier from the list in patrioti_appeal_ids.txt and used wget to download the html code for each article’s page:

url=$(head -n1 patrioti_appeal_ids.txt)
wget "http://trove.nla.gov.au/ndp/del/text/"$url -O $url.html

To make things a little fancy, I thought I should append the citation at the bottom of every article. The url for the citation page of this particular article looks like this:

http://trove.nla.gov.au/ndp/del/cite/5282800/47493805

Each citation url has both the article identifier (47493805, in this case) and an identifier for the page it appears on (5282800, in this case). Because I didn’t have a list of page identifiers, I used curl and grep to pull this out of each article’s html code and saved this value as a variable $urlending then used that to retrieve the citation page and append it to the html with the article’s text:

urlending=$(curl "http://trove.nla.gov.au/ndp/del/article/"$url | grep -o "ndp/del/cite/.*/$url")
curl "http://trove.nla.gov.au/"$urlending >> $url.html

That would give me a html file for each article, with its citation at the bottom. I used html-to-text to convert that into a text file and used sed to tidy up what was left.

The complete scripts are:

#! /bin/bash
#There were 13288 results, the API returns them in pages with a max of 20 results per page. This notation counts up in batches of 20 to only grab each result once
for i in {0..13260..20}
do
# parse out article heading, date, and troveurl for each result using xmlstarlet
curl "http://api.trove.nla.gov.au/result?key=[MYAPI]&zone=newspaper&q=patrioti*+appeal&anyWords=&notWords=&requestHandler=&dateFrom=1914-09-01&dateTo=1918-09-01&s="$i | xmlstarlet sel -T -t -m "/response/zone/records/article" -v "heading" -o "; "" -v "date" -o "; " -v "troveUrl" -o """ -n >> patrioti_appeal_csv2.txt

done
#just a bit of tidying up - removing white spaces and quotation marks. 
tr -d '"' < patrioti_appeal_csv2.txt > patrioti_appeal_csv.txt

This first bit produced a csv file, from which I deleted results that did not fall within the September 1914 to December 1918 range. Then I grabbed the column with the urls as saved it as a new .txt file.

#! /bin/bash
#starting with a bit more cleaning and extracting the article identifiers from each url
tr -d ' *' < patrioti_appeal_csv3.txt | sed 's/?searchTerm=patrioti+appeal//g' | sed 's,http://trove.nla.gov.au/ndp/del/article/,,g' > patrioti_appeal_ids2.txt
#removing carriage returns because the csv file was made with MS Excel
tr -d 'r' < patrioti_appeal_ids2.txt > patrioti_appeal_ids.txt
#there should be another line here to remove blank lines from patrioti_appeal_ids.txt, the blank lines will derail the script.

#after deleting the results outside the 1918-1918 range, there were 6871 articles left.
for i in {1..6871}
do
    if [ -s patrioti_appeal_ids.txt ]
         then
         #take first identifier from top of patrioti_appeal_ids.txt
         url=$(head -n1 patrioti_appeal_ids.txt)
         # remove id from patrioti_appeal_ids.txt
         sed -i '1d' patrioti_appeal_ids.txt
         # append id to patrioti_appeas_ids-done.txt
         echo $url >> patrioti_appeals_ids-done.txt
         # download article as html
         wget "http://trove.nla.gov.au/ndp/del/text/"$url -O $url.html
         #add blank line at bottom of html
         echo 'n' >> $url.html
         #extract page identifier, to grab article citation
         urlending=$(curl "http://trove.nla.gov.au/ndp/del/article/"$url | grep -o "ndp/del/cite/.*/$url")
         #grab article citation, append to bottom of article html
         curl "http://trove.nla.gov.au/"$urlending >> $url.html
         #convert article html into a text file
         cat $url.html | html-to-text > $url-long.txt
         #chop off the html code in the first 6 lines
         sed '1,6d' *long.txt | fmt -70 > $url.txt
         #delete unnecessary files
         rm *long.txt
         rm *html
     fi
#pause for politeness
sleep 3

done

It’s a pretty messy script, but it works. While searching through these articles, I realized that grep read a good number of these as binary files because of null characters. After a bit of googling, I found that running tr -d ‘00’ effectively solved that problem without tampering with the content.

With that last hiccup solved, I can start sifting through 6871 newspaper articles…

Leave a Reply

Your email address will not be published. Required fields are marked *