Today I learned about exploring multivariate data using tours of projections into lower dimensions. The tourr R package makes it easy to experiment with different tours. Let’s go on a grand tour!
The authors introduce data tours to interactively visualize high-dimensional data. (And also highlight the rich history of this field, including the PRIM-9 system created at Stanford in the early 1970s ).
The tourr R package provides user-friendly functions to run a tour.
A gorilla hiding in plain sight
In 2020, Itai Yanai and Martin Lercher asked whether “focus on a specific hypothesis prevents the exploration of other aspects of the data”. 1 They simulated a dataset with two variables, bmi and steps for both male and female subjects. Let’s start with a similar dataset 2.
gorilla <-read.csv(paste0("https://gist.githubusercontent.com/tomsing1/","d29496382e8b8f4163c34df46b00686f/raw/","40c0b7b5d25fff188a7365df59aa8634fef9adb9/gorilla.csv"))with(gorilla, plot(steps, bmi, col =ifelse(group =="M", "navy", "firebrick")))
Here, we want to examine only the numerical measurements (e.g. bmi and steps ), so let’s remove the categorical group variable and add two noise variables to create a dataset with five numerical variables.
for (dimension inpaste0("noise", 1:2)) { gorilla[[dimension]] <-rnorm(n =nrow(gorilla))}numeric_cols <-setdiff(colnames(gorilla), "group")head(gorilla)
bmi steps group noise1 noise2
1 29.96000 145.6311 F 0.2262401 -0.2269717
2 29.89818 10048.5437 M -1.0365464 -0.9440436
3 23.46909 3859.2233 M 0.5676465 -0.6378377
4 26.03455 7718.4466 M -2.3049200 -0.5047930
5 19.51273 10776.6990 M 0.4935926 -0.5288485
6 29.65091 3932.0388 M 1.0021255 -1.1318410
Plotting all pairwise combinations of the 5 variables quickly reveals the gorilla hidden in the bmi ~ steps relationship:
pairs(gorilla[, numeric_cols], pch =".")
Taking tours
library(tourr)library(gifski) # to create animated gifs
The little tour cycles through all axis parallel projections, reproducing all of the static plots we obtained with the pairs() call above (corresponding to 90 degree angles between the axes) as well as additional projections in between.
As expected, the gorilla cartoon reveals itself whenever the steps and bmi variables are projected into the x and y coordinates.
The grand tour picks a new projection at random and smoothly interpolates between them, eventually showing every possible projection of the data into the selected number of dimensions (here: 2). With a very high dimensional dataset, traversing all possibilities can take quite a while.