--- title: "Duplicate and Missing Cases" author: "lindbrook" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Duplicate and Missing Cases} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = ">") library(cholera) library(HistData) ``` John Snow's map of the 1854 cholera outbreak in London is a canonical example of data visualization:^[The map was originally published in Snow's 1855 book, "On The Mode Of Communication Of Cholera", and was reprinted as John Snow et. al., 1936. _Snow on Cholera: Being a Reprint of Two Papers_. New York: The Common Wealth Fund. You can also find the map online (a high resolution version is available on the Internet Archive's Wayback Machine, https://web.archive.org/web/20230124072836/https://www.ph.ucla.edu/epi/snow/highressnowmap.html (the original site, which no longer seems available, was www.ph.ucla.edu/epi/snow/highressnowmap.html) and in many books, including Edward Tufte's 1997 "Visual Explanations: Images and Quantities, Evidence and Narrative".] ![](msu-snows-mapB.jpg) In 1992, Rusty Dodson and Waldo Tobler digitized the map. Their data and software are preserved in [Internet Archive's Wayback Machine](https://web.archive.org/web/20100703153945/http://ncgia.ucsb.edu/Publications/Software/cholera/streets).^[The original URL, www.ncgia.ucsb.edu/pubs/snow/snow.html, no longer works.] Their data are also available in Michael Friendly's ['HistData'](https://cran.r-project.org/package=HistData) R package, which is the starting point for the ['cholera'](https://cran.r-project.org/package=cholera) package. These data are plotted below: ```{r, fig.width = 5, fig.height = 5, fig.align = "center", echo = FALSE} street.list <- split(Snow.streets[, c("x", "y")], Snow.streets$street) plot(Snow.deaths$x, Snow.deaths$y, pch = 20, cex = 0.5, xlim = range(Snow.streets$x), ylim = range(Snow.streets$y), xlab = "x", ylab = "y", asp = 1) invisible(lapply(street.list, lines, lwd = 0.75)) points(Snow.pumps$x, Snow.pumps$y, pch = 17, col = "blue") ``` ### Data problems However, I would argue that there are two apparent coding errors in these data that stem from three misplaced cases. While the data record 578 bars, only 575 of them have a unique x-y coordinate.^[There is a lack of consensus about the actual number of cases represented in Snow's map. For what it's worth, I manually recounted the data on Snow's map and the result I got matches Dodson and Tobler's.] Three pairs have identical coordinates: 1) 93 and 214; 2) 91 and 241; and 3) 209 and 429. Within the scheme of stacking bars to represent the number of fatalities at a given "address", this should not occur. Each bar should have its own unique x-y coordinate. For this reason, I believe that any duplicate coordinates are likely to be coding errors. ```{r} duplicates <- HistData::Snow.deaths[(duplicated(HistData::Snow.deaths[, c("x", "y")])), ] duplicates.id <- lapply(duplicates$x, function(i) { HistData::Snow.deaths[HistData::Snow.deaths$x == i, "case"] }) HistData::Snow.deaths[unlist(duplicates.id), ] ``` Fortunately, a careful comparison of Snow's map and the map generated by Dodson and Tobler's data reveals that there are also three "missing" bars in the latter. An expedient "fix" would be to simply use the duplicates to fill in for the "missing" bars: ```{r} fatalities <- HistData::Snow.deaths fix <- data.frame(x = c(12.56974, 12.53617, 12.33145), y = c(11.51226, 11.58107, 14.80316)) fatalities[c(91, 93, 209), c("x", "y")] <- fix ``` This fixed data set is available as `fatalities` in this package and as `Snow.deaths2` in 'HistData' (>= ver. 0.7-8). For those interested, details about how I arrived at these values can be found in `fixFatalities()` and in the "note on duplicate and missing cases", available [online](https://github.com/lindbrook/cholera/blob/master/docs/notes/duplicate.missing.cases.notes.md) in this package's GitHub repository.