Data sets from McGill’s digital collections now available for text-mining and data-intensive research

By Alexandra Kohn, Head, Copyright Office

Collections as Data is a movement in the cultural heritage sector to make our collections machine-actionable and available at scale for data-intensive research. Inspired by the Institute of Museum and Library Service-funded project Always Already Computational: Collections as Data and its successor, Part to Whole, McGill Library has entered the fray with a pilot project to provide researchers access to datasets extracted from a wide variety of our collections of digitized materials.   

Our digital collections provide access to the digital surrogates of some our fantastic rare and archival materials, allowing users to peruse and search items online. This project expands upon this access by making these collections available to be downloaded and used in a format suitable for computationally-driven research and teaching. With material ranging from files of the full text transcriptions of historical texts on gynaecology in traditional Chinese medicine to the text of the entire run of a late 19th century architectural trade publication and including the full text of the entire run of two of the earliest McGill student publications, the research questions that can be explored are fascinating and plentiful! 

Here is a preview of the data that is now available: 

Canadian Architecture and Builder 

This data set consists of plain text files containing the full text of the publication Canadian Architect and Builder (1888–1908), which was digitized by McGill University Libraries in the late 90s and is accessible at Canadian Architect and Builder Online. The Canadian Architect and Builder (CAB) was the only professional architectural journal published in Canada before World War I. With both advertisements and articles appearing in the text files, CAB provides a wealth of information on the state of architecture and building in Canada during the late 19th and early 20th centuries. 


Gynaecology in Traditional Chinese Medicine 

This data set consists of XML files from the digitization of a small collection of Chinese gynaecological works held by McGill University Library Rare Books and Special Collections. One of these texts is unique, others are well known works that exercised considerable influence in the practice of gynecology in late imperial China and were reprinted many times. The original digital collection project was carried out in the early 2000s and is accessible at Gynaecology in Traditional Chinese Medicine: Selected Texts


McGill Library Electronic Thesis and Dissertation (ETD) collection (1881–2018) 

This data set consists of metadata and (in some cases) full text of the McGill Thesis and Dissertation collections from 1881–2018. McGill holds theses and dissertations written by McGill students from 1881 to present day. The historical print collection is housed in the McGill University Library’s Rare Books and Special Collections. Since 2009, theses have been submitted electronically and are made available in our institutional repository. In 2016, a massive retrospective digitization project was completed, as a result of which the full text of the historical theses were also made available online in the institutional repository. All the digitized and born digital theses are now publicly available. Find more information about the collection on its website, Highlights from McGill theses and dissertations


Map Depicting the Principal Trading Stations of the North West Company in 1817.

Map Depicting the Principal Trading Stations of the North West Company in 1817. Rare Books and Special Collections, McGill University Library.

The Fur Trade in Canada and the North West Company 

The Fur Trade in Canada and the North West Company data set provides access to the full-text XML files of 38 manuscripts collectively known as the Masson Papers, held in McGill University Library Rare Books and Special Collections. The Masson Papers comprise letters, diaries, travel narratives, and other textual documents relating to the North West Company and the colonial-era fur trade more generally. The papers represent a settler perspective of North American places and peoples. The source site, In Pursuit of Adventure: The Fur Trade in Canada and the North West Company, was created in the late 1990s. More information about the manuscripts and the transcription standards is available on the website


Carleton County (Ontario Map Ref #39). Illustrated historical atlas of the county of Carleton (including city of Ottawa), Ont. Toronto : H. Belden & Co., 1879. Rare Books & Special Collections, elf G1148.C3H3 1879.

McGill County Atlas Project People Index 

In Search of Your Canadian Past: The Canadian County Atlas Digital Project, created by McGill University Library in the late 1990s, provides access to 43 Ontario county atlases which were produced between 1874 and 1881 and which are housed in McGill’s Rare Books and Special Collections. Of interest to genealogists, the atlases contain indexes of persons residing in each county and these have been digitized and are searchable on the above-mentioned website. This data set is an extract of the people index used by the website, along with URLs for each record. The CSV contains 172927 records with the following fields: title (e.g. Mr., Mrs., Prof.), first name, last name, township name, town name, county name, atlas date, URL. 


McGill Library Chapbook Collection TEI files 

This data set is the result of a Text Encoding Initiative (TEI) project built around the McGill Library’s Chapbook Digital Collection. Rare Books and Special Collections created a TEI XML file for most of the chapbooks on this site using TEI P5:Guidelines for Electronic Text Encoding and Interchange by the TEI Consortium. Level 4 coding from Best Practices for TEI in Libraries was used to guide the encoding. Note that the woodcuts in each chapbook were assigned a classification code from the Iconclass thesaurus to describe the subject of the image. The McGill Library’s Chapbook Collection was created from chapbooks from three special collections in the Rare Books and Special Collections Library. The majority of the imprints (955 titles) are from the 19th century, published in England and the Northeastern United States. There are 74 Scottish and 19 Irish chapbooks in the collection. Most of the collection’s 18th century titles were published in London, England. 


Front page of the The McGill Fortnightly Review first issue from 1925.

McGill Student Publications 

The current data set includes full text files of the OCR’d text from the full run of the student-run publications McGill University Gazette (1874-1890) and McGill Fortnightly (1892-1896). Digital copies of the full corpus of the McGill Student Publications are available in McGill Student Publications Collection in the Internet Archive.  We plan to add .txt files for the remainder of the publications in the collection to this repository in due course.  The physical collection is housed in the McGill University Archives. 


Please visit our Digital Collection Data webpage for description of the data sets that are currently available and links to where they can be downloaded with full documentation. 

Many thanks to the rest of the project team for making this project a reality: Gagandeep Dhillon, Greg Houston, Awais Mehmood Khalid, Svetlana Kochkina, Jenn Riley, Ana Rogers-Butterworth, and Elizabeth Thomson 

Leave a Reply

Library Matters seeks to exchange and encourage ideas, innovations and information from McGill Library staff for our on-campus readers and beyond.

Contact Us!

If you have any questions, comments, or even an idea for a story, let us know!