By Alexandra Kohn, Head, Copyright Office
Collections as Data is a movement in the cultural heritage sector to make our collections machine-actionable and available at scale for data-intensive research. Inspired by the Institute of Museum and Library Service-funded project Always Already Computational: Collections as Data and its successor, Part to Whole, McGill Library has entered the fray with a pilot project to provide researchers access to datasets extracted from a wide variety of our collections of digitized materials.
Our digital collections provide access to the digital surrogates of some our fantastic rare and archival materials, allowing users to peruse and search items online. This project expands upon this access by making these collections available to be downloaded and used in a format suitable for computationally-driven research and teaching. With material ranging from files of the full text transcriptions of historical texts on gynaecology in traditional Chinese medicine to the text of the entire run of a late 19th century architectural trade publication and including the full text of the entire run of two of the earliest McGill student publications, the research questions that can be explored are fascinating and plentiful!
Here is a preview of the data that is now available:
This data set consists of plain text files containing the full text of the publication Canadian Architect and Builder (1888–1908), which was digitized by McGill University Libraries in the late 90s and is accessible at Canadian Architect and Builder Online. The Canadian Architect and Builder (CAB) was the only professional architectural journal published in Canada before World War I. With both advertisements and articles appearing in the text files, CAB provides a wealth of information on the state of architecture and building in Canada during the late 19th and early 20th centuries.
This data set consists of XML files from the digitization of a small collection of Chinese gynaecological works held by McGill University Library Rare Books and Special Collections. One of these texts is unique, others are well known works that exercised considerable influence in the practice of gynecology in late imperial China and were reprinted many times. The original digital collection project was carried out in the early 2000s and is accessible at Gynaecology in Traditional Chinese Medicine: Selected Texts.
This data set consists of metadata and (in some cases) full text of the McGill Thesis and Dissertation collections from 1881–2018. McGill holds theses and dissertations written by McGill students from 1881 to present day. The historical print collection is housed in the McGill University Library’s Rare Books and Special Collections. Since 2009, theses have been submitted electronically and are made available in our institutional repository. In 2016, a massive retrospective digitization project was completed, as a result of which the full text of the historical theses were also made available online in the institutional repository. All the digitized and born digital theses are now publicly available. Find more information about the collection on its website, Highlights from McGill theses and dissertations.
The Fur Trade in Canada and the North West Company data set provides access to the full-text XML files of 38 manuscripts collectively known as the Masson Papers, held in McGill University Library Rare Books and Special Collections. The Masson Papers comprise letters, diaries, travel narratives, and other textual documents relating to the North West Company and the colonial-era fur trade more generally. The papers represent a settler perspective of North American places and peoples. The source site, In Pursuit of Adventure: The Fur Trade in Canada and the North West Company, was created in the late 1990s. More information about the manuscripts and the transcription standards is available on the website.
In Search of Your Canadian Past: The Canadian County Atlas Digital Project, created by McGill University Library in the late 1990s, provides access to 43 Ontario county atlases which were produced between 1874 and 1881 and which are housed in McGill’s Rare Books and Special Collections. Of interest to genealogists, the atlases contain indexes of persons residing in each county and these have been digitized and are searchable on the above-mentioned website. This data set is an extract of the people index used by the website, along with URLs for each record. The CSV contains 172927 records with the following fields: title (e.g. Mr., Mrs., Prof.), first name, last name, township name, town name, county name, atlas date, URL.
This data set is the result of a Text Encoding Initiative (TEI) project built around the McGill Library’s Chapbook Digital Collection. Rare Books and Special Collections created a TEI XML file for most of the chapbooks on this site using TEI P5:Guidelines for Electronic Text Encoding and Interchange by the TEI Consortium. Level 4 coding from Best Practices for TEI in Libraries was used to guide the encoding. Note that the woodcuts in each chapbook were assigned a classification code from the Iconclass thesaurus to describe the subject of the image. The McGill Library’s Chapbook Collection was created from chapbooks from three special collections in the Rare Books and Special Collections Library. The majority of the imprints (955 titles) are from the 19th century, published in England and the Northeastern United States. There are 74 Scottish and 19 Irish chapbooks in the collection. Most of the collection’s 18th century titles were published in London, England.
The current data set includes full text files of the OCR’d text from the full run of the student-run publications McGill University Gazette (1874-1890) and McGill Fortnightly (1892-1896). Digital copies of the full corpus of the McGill Student Publications are available in McGill Student Publications Collection in the Internet Archive. We plan to add .txt files for the remainder of the publications in the collection to this repository in due course. The physical collection is housed in the McGill University Archives.
Please visit our Digital Collection Data webpage for description of the data sets that are currently available and links to where they can be downloaded with full documentation.
Many thanks to the rest of the project team for making this project a reality: Gagandeep Dhillon, Greg Houston, Awais Mehmood Khalid, Svetlana Kochkina, Jenn Riley, Ana Rogers-Butterworth, and Elizabeth Thomson