Humanities Data in R

Second Edition The second edition is a significant revision, with almost every aspect of the text rewritten in some way. The biggest difference is the incorporation of the set of R packages commonly known as the tidyverse, consisting at its core of the packages ggplot2 and dplyr. These packages have grown significantly in stability and popularity over the past decade. They allow the kinds of functionality that we wanted to highlight in the first version of the book, but do so with less code while being backed by theoretical models of how data processing should work. These features make them perfect elements to use for an introduction to R for working with humanities data.

Topics As before, Part I introduces the R programming language and key concepts for working with data. Exploratory data analysis (EDA) remains a key concept and philosophy. EDA is an approach for analyzing and summarizing to identify patterns (and outliers) in data. It is also a way of knowing that is amenable to the kinds of questions and heuristics that animate how humanistic fields approach studying the human experience. Based on years of teaching, we have come to realize how important understanding data collection is to data analysis yet how few resources there are, so we have added Chapter 5: Collecting Data and Chapter 12: Data Formats to address perhaps the most time consuming part, collecting and organizing data.

Part II of the text is still organized around data types. We have decided to reorder the chapters because of our approach to data. In this edition, we wanted to show how one can layer types of analysis using the same data set. Rather than each chapter introducing a new data set, we build our analysis of Wikipedia data from Chapter 6 through Chapter 8 as we move from text to networks to temporal data. Chapter 8: Temporal Data is a new chapter given the importance of time information, particularly if we want to study change over time. Chapter 9: Spatial Data returns to the data that was used in Part I to show how we can layer the information with additional data. Chapter 10: Image Data introduces a new data set of 1940s photographs to apply computer vision. While we are always hesitant of hype about technological change, particularly given all the current (generative) AI boosterism, a significant methodological shift in the last ten years is the advances in computer vision, particularly the ascent of deep learning. We now focus on several of the most popular tasks such as object detection, and how we can also layer them with additional methods such as networks. The reorganization, additional chapters, and new data sets are a part of trying to demonstrate how layering methods can add context and nuance to our analysis.

About the Authors Taylor Arnold is Professor of Data Science and Statistics at the Universtity of Richmond. His research focuses on corpus-based techniques to study how messages are communicated through visual and multimodal forms. Lauren Tilton is the E. Claiborne Robins Professor of Liberal Arts and Professor of Digital Humanities at the University of Richmond. Her research focuses on analyzing, developing, and applying digital and computational methods to the study of 20th and 21st century documentary expression and visual culture. Arnold and Tilton co-direct the Distant Viewing Lab and have collaborated on a number of projects, including Layered Lives (Stanford University Press, 2022) and Distant Viewing (MIT Press, 2024).

Humanities Data in R

Exploring Networks, Geospatial Data, Images, and Text

Taylor Arnold & Lauren Tilton

Springer

August 23, 2024 (2e); October 1, 2015 (1e)