Humanities Data in R

A textbook and digital resource for exploring networks, geospatial data, images, and text using the popular open-source programming language R.

The Book

A physical copy of the book Humanities Data in R is available through retailers such as Amazon or directly through the publisher, Springer. The Springer site also offers digital editions and free digital access to participating institutions through SpringerLink. The text is intended for a 1-2 semester introductory course on digital methods in the humanities and social sciences, or as an intermediate level self-study guide. While focused on humanities applications, the material is also a useful reference for anyone looking to apply exploratory data analysis methods to network, geospatial, image, and text data. A bulk download of the supplementary code and data is provided in the link below.

About the Authors

Taylor Arnold is Assistant Professor of Statistics at the University of Richmond. A recipient of grants from the NEH and ACLS, Arnold's research focuses on computational statistics, text analysis, image processing, and applications within the humanities. A particular interest is the study of data contained linked text and images, such as newspapers with embedding figures or television shows with associated closed captions.

Lauren Tilton is Assistant Professor of Digital Humanities and a member of the University of Richmond's Digital Scholarship Lab. Her current book project focuses on participatory media in the 1960s and 1970s. She is the Co-PI of the project Participatory Media, which interactively engages with and presents participatory community media from the 1960s and 1970s. She is also a director of Photogrammar, a web-based platform for organizing, searching and visualizing the 170,000 photographs from 1935 to 1945 created by the United States Farm Security Administration and Office of War Information (FSA-OWI).


1-2: Introduction to R The R programming language is a powerful and popular tool for interactive data analysis; it is open source software and available for a number of operating systems (see for download options). These chapters guide users through the installation of R and introduce the core syntax needed for the remainder of the text.

3-5: Exploratory Data Analysis These chapters introduce techniques for exploratory data analysis, a general approach to examining data through visualizations and broad summary statistics. Topics are introduced through example data, including the American Community Survey and election results from the French presidential elections. The focus shifts away from the language itself, which is ultimately just a tool, and towards the conceptual methods used for exploring data.

6: Networks Networks, also known as graphs, are abstract representations of relationships between a set of objects. Common applications in the humanities and social sciences include family trees, citation networks, and social networks. This chapter introduces techniques for working with such network data. Examples of family trees and citation networks from US Supreme Court cases are used to illustrate the basic concepts and programming syntax.

7: Geospatial Data Geospatial data can take many forms: digitized maps, georeferenced points, and shapefiles containing spatial polygons. All of these have prominent applications in the humanities, with some techniques being employed by hand decades before modern computing. This chapter shows how to work with a variety of spatial datasets, with a particular focus on how to merge disparate sources to enrich analyses.

8: Images A large amount of humanities data consists of digitized image data, and there is an active push to digitize even more. In this chapter we present methods for visualizing an entire corpus of images through dimension reduction algorithms. As an example, we show how these methods can be applied to a collection of outdoor photographs and the degree to which they successfully separate those taken during the day from those taken at night.

9: Natural Language Processing The analysis of text corpora is a popular application of digital methods in the humanities. This chapter presents a survey of topics in Natural Language Processing (NLP), with a focus on linking modern techniques from the field with humanities scholarship. Tasks such as tokenization, lemmatization, part of speech tagging, and coreference detection are described in relationship to text analysis. The methods are applied to a corpus of short stories by Sir Arthur Conan Doyle.

10: Text Analysis Building on the previous chapter, here we show how the raw data constructed from NLP can be used to study patterns and extract meaning from text corpora. Topics include information retrieval, topic modeling, and stylometrics. Applications include the text from wikipedia articles and 26 novels from Mark Twain, Charles Dickens, Nathaniel Hawthorne, and Sir Arthur Conan Doyle