--- title: '"Unstacking" Bars' author: "lindbrook" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{"Unstacking" Bars} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = ">") library("cholera") library("HistData") library("ggplot2") library("KernSmooth") bandwidth <- 0.5 top <- c(1:12, 14) right <- c(37, 62, 74, 142, 147, 205, 240, 248, 280, 360, 405, 419, 465) bottom <- c(483, seq(487, 495, 2), 498, 500, seq(503, 519, 2)) left <- c(31, 79, 114, 285, 348, 397, 469) border <- sort(c(bottom, left, top, right)) map.border <- Snow.streets[Snow.streets$street %in% border == TRUE, ] border.list <- split(map.border[, c("x", "y")], map.border$street) ## Roads ## roads.list <- split(roads[, c("x", "y")], roads$street) road.segments <- lapply(unique(roads$street), function(i) { dat <- roads[roads$street == i, ] names(dat)[names(dat) %in% c("x", "y")] <- c("x1", "y1") seg.data <- dat[-1, c("x1", "y1")] names(seg.data) <- c("x2", "y2") dat <- cbind(dat[-nrow(dat), ], seg.data) dat$id <- paste0(dat$street, "-", seq_len(nrow(dat))) dat }) road.segments <- do.call(rbind, road.segments) ``` ## Overview In his map of the 1854 cholera outbreak in London, John Snow uses stacks of bars to represent the number of fatalities at a given address. A location with one fatality is represented by a single, horizontal bar that lies parallel to road where the fatality occurred. A location with five fatalities is represented by five horizontally stacked bars:^[The map was originally published in Snow's 1855 book, "On The Mode Of Communication Of Cholera", and was reprinted as John Snow et. al., 1936. _Snow on Cholera: Being a Reprint of Two Papers_. New York: The Common Wealth Fund. You can also find the map online (a high resolution version is available on the Internet Archive's Wayback Machine, https://web.archive.org/web/20230124072836/https://www.ph.ucla.edu/epi/snow/highressnowmap.html (the original site, which no longer seems available, was www.ph.ucla.edu/epi/snow/highressnowmap.html) and in many books, including Edward Tufte's 1997 "Visual Explanations: Images and Quantities, Evidence and Narrative".] ![](msu-snows-mapB.jpg) In 1992, Rusty Dodson and Waldo Tobler digitized the map. Their data and software are preserved in [Internet Archive's Wayback Machine](https://web.archive.org/web/20100703153945/https://ncgia.ucsb.edu/Publications/Software/cholera/streets).^[The original URL, www.ncgia.ucsb.edu/pubs/snow/snow.html, no longer works.] Their data are also available in Michael Friendly's ['HistData'](https://cran.r-project.org/package=HistData) R package, which is the starting point for the ['cholera'](https://cran.r-project.org/package=cholera) package. These data are plotted below: Each bar and pump is assigned a unique x-y coordinate. Each road is translated into a series of straight line segments, defined by the segment's endpoints. These data are plotted below: ```{r, fig.width = 5, fig.height = 5, fig.align = "left", echo = FALSE} roads.list <- split(roads[, c("x", "y")], roads$street) plot(fatalities[, c("x", "y")], xlim = range(roads$x), ylim = range(roads$y), pch = 15, cex = 0.5, col = "gray", asp = 1) invisible(lapply(roads.list, lines, col = "gray")) points(HistData::Snow.pumps[, c("x", "y")], pch = 17, cex = 1, col = "blue") ``` Despite its appeal, I would argue that stacked bars are visually and computationally problematic. The reason, simply put, is that not all bars are created equal. Even though they are identical in terms of their appearance and the only thing that appears to distinguish them is their location, bars can actually play different roles. Sometimes a bar represents the location of a fatality, sometimes it doesn't. Standalone bars, a stack with a single bar (i.e., an addresses with one fatality), or the bar at the base of a stack represent a location and a count. Bars above the base case do not. They exist only to create the stacking effect to visually represent the number of fatalities at the address. This duality is problematic. Because a map is a visual device that illustrates spatial relationships, it's natural to assume that the position of each element (e.g., each bar) reflects an actual, physical location. When we violate this assumption, we undermine the visual integrity of the map. This can handicap our analysis. This is particularly true given that 44% (257/578) of the bars in Snow's map fall into this second, geographically uninformative category. To address these problems, I "unstack" Dodson and Tobler's data. I do so in two ways. In the first, I give all all cases in a stack (i.e., at the same "address") the same x-y coordinate. These data are available in `fatalities.unstacked`. In the second, I make the address rather than the the case the unit of observation: each address is a single observation with a single x-y coordinate, and the number of cases observed at that location is an attribute of that address. These data are available in `fatalities.address`. ## Data details To illustrate the differences between these two data sets, consider how they handle the largest outlier on Snow's map: the eighteen cases at 38 Broad Street. With `fatalities`, all members of the stack have different coordinates: ```{r} ## The 18 cases at 38 Broad Street ## broad38 <- c(239, 12, 310, 398, 562, 397, 421, 190, 290, 61, 174, 547, 523, 521, 138, 59, 340, 508) fatalities[fatalities$case %in% broad38, ] ``` With `fatalities.unstacked`, all members of the stack have the same coordinate: ```{r} fatalities.unstacked[fatalities.unstacked$case %in% broad38, ] ``` With `fatalities.address`, the 18 cases are represented by a single observation, case 239, which serves as the "address": ```{r} fatalities.address[136:140, ] ``` ## Virtues of "unstacking" To illustrate the virtues of "unstacked" data, consider the following. ### Kernel density contours The graphs below plot the bivariate kernel density contours, of varying bandwidths, on the "stacked" and "unstacked" data. The contours help illustrate the spatial distribution or topography of fatalities, and provide an estimate of the epicenter of the outbreak. With the "stacked" data, `fatalities`, the contours are looser (reflecting lower proximity) and the epicenter is further south than we might expect given that the Broad Street pump (blue triangle)^[The blue triangle is the "correct" location of the pump as included in the amended, second version of the map that appears in the Vestry report. The empty green triangle is the pump's "wrong" location from the original map.] is the accepted source of the outbreak. The problem is that the "vertical" stack of 18 cases (west of the pump at 38 Broad Street) and the "horizontal" stack of 5 cases (south of the pump at 10 Cambridge Street) pull the fit downward in a southerly direction. ```{r, fig.width = 5, fig.height = 5, fig.align = "left", echo = FALSE} roads.list <- split(roads[, c("x", "y")], roads$street) ## Graph parameters ## bw <- 1:4 facets <- paste("Bandwidth =", bw) x.range <- c(11.5, 13.5) y.range <- c(10.5, 12.5) ## Data ## Snow.deathsB <- lapply(rep("fatalities", max(bw)), get) Snow.pumpsB <- lapply(rep("Snow.pumps", max(bw)), get) for (i in seq_along(Snow.deathsB)) { Snow.deathsB[[i]]$facet <- facets[i] } Snow.deathsB2 <- Snow.deathsB Snow.deathsB <- do.call(rbind, Snow.deathsB) # Cambridge Street # street.name <- "Cambridge Street" cambridge.data <- roads[roads$name == street.name, ] cambridge.data <- cambridge.data[order(cambridge.data$x), ] d1 <- cambridge.data[-nrow(cambridge.data), c("x", "y")] d2 <- cambridge.data[-1, c("x", "y")] intercept.slope <-lapply(seq_len(nrow(cambridge.data) - 1), function(i) { coef(lm(y ~ x, data = rbind(d1[i, ], d2[i, ]))) }) sel <- 3 cambridge.angle <- atan(intercept.slope[[sel]][2]) * 180L / pi cambridge.x <- mean(cambridge.data[sel:(sel + 1), "x"]) cambridge.y <- intercept.slope[[sel]][1] + intercept.slope[[sel]][2] * cambridge.x cambridge.df <- data.frame(x = cambridge.x, y = cambridge.y) # Broad Street # street.name <- "Broad Street" broad.data <- roads[roads$name == street.name, ] broad.list <- roads.list[paste(unique(broad.data$street))] broad.list <- lapply(broad.list, function(df) { df[order(df$x, decreasing = TRUE), ] }) broad.pts.data <- do.call(rbind, broad.list) broad.pts.data <- broad.pts.data[seq_len(nrow(broad.pts.data)) %% 2 != 0, ] segment.ols <- lapply(broad.list, function(x) { coef(lm(y ~ x, data = x)) }) sel <- "193" seg.id <- do.call(rbind, strsplit(rownames(broad.pts.data), "[.]"))[, 1] i <- which(seg.id == sel) broad.angle <- atan(segment.ols[[sel]]["x"]) * 180 / pi broad.x <- median(broad.pts.data[i:(i + 1), "x"]) broad.y <- segment.ols[[sel]][1] + segment.ols[[sel]][2] * broad.x broad.df <- data.frame(x = broad.x, y = broad.y) ## Graph ## p <- ggplot(data = Snow.deathsB, aes(x = x, y = y)) + geom_point(color = "gray") + geom_point(data = pumps.vestry, aes(x = x, y = y), color = "blue", pch = 2, size = 2.5, stroke = 0.75) + geom_point(data = pumps, aes(x = x, y = y), color = "#009E73", pch = 2, size = 2.5, stroke = 0.75) + coord_fixed(xlim = x.range, ylim = y.range) + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5)) + facet_wrap(~ facet, nrow = 2) + ggtitle('"Stacked" Fatalities') for (i in seq_along(roads.list)) { p <- p + geom_path(data = roads.list[[i]], aes(x = x, y = y), color = "lightgray") } for (i in seq_along(bw)) { p <- p + geom_density_2d(data = Snow.deathsB2[[i]], aes(x = x, y = y), color = "red", linewidth = 1/3, h = bw[i]) } p + geom_text(data = broad.df, aes(x = x, y = y), label = "Broad St", angle = broad.angle) + geom_text(data = cambridge.df, aes(x = x, y = y), label = "Cambridge St", angle = cambridge.angle) ``` With `fatalities.unstacked`, the contours are "tighter" (reflecting greater proximity) and the epicenter is located further north, nearer to the pump and to Broad Street: ```{r, fig.width = 5, fig.height = 5, fig.align = "left", echo = FALSE} ## Data ## fatalities.addressB <- lapply(rep("fatalities.address", max(bw)), get) fatalities.unstackedB <- lapply(rep("fatalities.unstacked", max(bw)), get) for (i in seq_along(fatalities.addressB)) { fatalities.addressB[[i]]$facet <- facets[i] } for (i in seq_along(fatalities.unstackedB)) { fatalities.unstackedB[[i]]$facet <- facets[i] } fatalities.addressB <- do.call(rbind, fatalities.addressB) ## Graph ## p <- ggplot(data = fatalities.addressB, aes(x = x, y = y)) + geom_point(color = "gray") + geom_point(data = pumps.vestry, aes(x = x, y = y), color = "blue", pch = 2, size = 2.5, stroke = 0.75) + geom_point(data = pumps, aes(x = x, y = y), color = "#009E73", pch = 2, size = 2.5, stroke = 0.75) + coord_fixed(xlim = x.range, ylim = y.range) + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(hjust = 0.5)) + facet_wrap(~ facet, nrow = 2) + ggtitle('"Unstacked" Fatalities') for (i in seq_along(roads.list)) { p <- p + geom_path(data = roads.list[[i]], aes(x = x, y = y), color = "lightgray") } for (i in seq_along(bw)) { p <- p + geom_density_2d(data = fatalities.unstackedB[[i]], aes(x = x, y = y), color = "red", linewidth = 1/3, h = bw[i]) } p + geom_text(data = broad.df, aes(x = x, y = y), label = "Broad St", angle = broad.angle) + geom_text(data = cambridge.df, aes(x = x, y = y), label = "Cambridge St", angle = cambridge.angle) ``` ## "Unstacking" via classification The main roadblock to "unstacking" is that there is no notion of an "address" in the data: bars are merely points and the streets are merely line segments.^[In Friendly's 'HistData' package, these data are called `Snow.deaths` and `Snow.streets`.] Nothing links a point to a segment. And nothing connects one bar in a stack to another bar in the same stack. All elements exist independently of one another. The only reason why the map "works" is that the fatalities and roads data have proximate x-y coordinates. To "unstack" the data, we need to match each bar to a specific road (segment) and to a specific stack. To accomplish these tasks, I use two types of classification. For those interested, the details are found in "computing street addresses", which is available [online](https://github.com/lindbrook/cholera/blob/master/docs/notes/unstacking.bars.notes.md) in this package's GitHub repository.