Humanities Data in R

A textbook and digital resource for exploring networks, geospatial data, images, and text using the popular open-source programming language R.

The Book

A physical copy of the book Humanities Data in R is available through retailers such as Amazon or directly through the publisher, Springer. The Springer site also offers digital editions and free digital access to participating institutions through SpringerLink. The text is intended for a 1-2 semester introductory course on digital methods in the humanities and social sciences, or as an intermediate level self-study guide. While focused on humanities applications, the material is also a useful reference for anyone looking to apply exploratory data analysis methods to network, geospatial, image, and text data. A bulk download of the supplementary code and data is provided in the link below.

Taylor Arnold

contact:
website: http://www.stat.yale.edu/~tba3/

Taylor Arnold is currently a lecturer in the department of statistics at Yale and senior scientist at AT&T Labs. His research focuses on the analysis of large, complex datasets and the resulting computational challenges. A particular area of focus is the sparse representation of highly structured objects such as text corpora and digital images. He is the technical co-director of the NEH funded project Photogrammar.

Lauren Tilton

contact:
website: http://www.laurentilton.com

Lauren Tilton is a Ph.D. candidate at Yale University. Her interests include participatory media, twentieth-century history, and visual culture. She is the co-director of the NEH funded project Photogrammar.

Chapters 1-2: Introduction to R

The R programming language is a powerful and popular tool for interactive data analysis; it is open source software and available for a number of operating systems (see www.r-project.org for download options). These chapters guide users through the installation of R and introduce the core syntax needed for the remainder of the text.

Chapters 3-5: Exploratory Data Analysis

These chapters introduce techniques for exploratory data analysis, a general approach to examining data through visualizations and broad summary statistics. Topics are introduced through example data, including the American Community Survey and election results from the French presidential elections. The focus shifts away from the language itself, which is ultimately just a tool, and towards the conceptual methods used for exploring data.

Chapter 6: Networks

Networks, also known as graphs, are abstract representations of relationships between a set of objects. Common applications in the humanities and social sciences include family trees, citation networks, and social networks. This chapter introduces techniques for working with such network data. Examples of family trees and citation networks from US Supreme Court cases are used to illustrate the basic concepts and programming syntax.

Chapter 7: Geospatial Data

Geospatial data can take many forms: digitized maps, georeferenced points, and shapefiles containing spatial polygons. All of these have prominent applications in the humanities, with some techniques being employed by hand decades before modern computing. This chapter shows how to work with a variety of spatial datasets, with a particular focus on how to merge disparate sources to enrich analyses.

Chapter 8: Images

A large amount of humanities data consists of digitized image data, and there is an active push to digitize even more. In this chapter we present methods for visualizing an entire corpus of images through dimension reduction algorithms. As an example, we show how these methods can be applied to a collection of outdoor photographs and the degree to which they successfully separate those taken during the day from those taken at night.

Chapter 9: Natural Language Processing

The analysis of text corpora is a popular application of digital methods in the humanities. This chapter presents a survey of topics in Natural Language Processing (NLP), with a focus on linking modern techniques from the field with humanities scholarship. Tasks such as tokenization, lemmatization, part of speech tagging, and coreference detection are described in relationship to text analysis. The methods are applied to a corpus of short stories by Sir Arthur Conan Doyle.

Chapter 10: Text Analysis

Building on the previous chapter, here we show how the raw data constructed from NLP can be used to study patterns and extract meaning from text corpora. Topics include information retrieval, topic modeling, and stylometrics. Applications include the text from wikipedia articles and 26 novels from Mark Twain, Charles Dickens, Nathaniel Hawthorne, and Sir Arthur Conan Doyle.