# Introduction

The results of statistical analyses often depend on analysts’ (sometimes arbitrary) decisions, such as which covariates to model or what subsets of data to analyse. Multiverse, or specification curve, analysis is a method whereby the analysts don’t only conduct and report the results from one model, but instead conduct all the relevant and plausible analyses and report all the results (Simonsohn, Simmons, and Nelson 2020; Steegen et al. 2016).

For example, Orben and Przybylski (2019) showed, through analyzing the same datasets in thousands of different ways, that conclusions regarding the association between the psychological well-being of adolescents and their digital technology use critically depend on (mostly) arbitrary decisions in how and which data are analysed (Figure 1).

This blog entry is about the technical aspects of conducting multiverse analyses in R. Specifically, I want to find out easy and flexible methods of specifying and conducting multiverse analyses *in parallel*. I have briefly examined the landscape of R packages that facilitate multiverse analyses, and found that none suited my needs perfectly. In this entry, I therefore try to outline a general and flexible tidyverse-centric (Wickham et al. 2019) multiverse analysis pipeline. I eschew using external packages to maximize flexibility and speed (parallel processing).

Currently, I am aware of three R packages for conducting multiverse analyses. The multiverse package (Sarma et al. 2021) provides extensive functionality for conducting and reporting multiverse analyses, including a “domain specific language” for analyses and reporting. However, while powerful, the package seems somewhat complicated (for the use cases that I have in mind). Frankly, after briefly reviewing the documentation, I don’t know how to use it (but it seems very cool!) mverse aims to make the multiverse package easier to use (Moon et al. 2022). I haven’t explored it much but it only seems to offer `lm()`

and `glm()`

models. specr (maybe most relevant for my use cases in psychology) provides a much simpler set of functions (with less flexibility, however (Masur and Scharkow 2020)).

Another downside of these packages is that they, with multiverse being an exception, don’t provide options for parallel computations. Parallelization is quite important because multiverse analyses can include (tens, hundreds) of thousands of analyses and can therefore take a long time to complete. I started a pull request that aimed to add that functionality to specr, but along the way found that it wasn’t so easy to implement with the current specr syntax and codebase, and my limited R skills.

While thinking about how to best contribute to specr, I realized that multiverse analyses don’t necessarily need extra functions, but can be easily implemented in familiar data analysis pipelines (dplyr and `%>%`

(Wickham et al. 2022); depending on how familiar you are with the tidyverse). This entry is part of my journey of trying to figure out how to flexibly conduct multiverse analyses in parallel in R, and demonstrates a flexible syntax for parallelizing multiverse analyses with `%>%`

lines.

I am not an expert in parallel processing by any means, so would love to know if you have any feedback on how I’ve implemented it below! Let me know in the comments 😄

# Example multiverse analysis with specr

Let’s start with a simple toy example with two outcomes, two predictors, two covariates, and four subgroups, and no prior reason to choose between specifications. That is, we think that `y1`

and `y2`

are equally likely to represent our outcome construct of interest, `x1`

and `x2`

are equally likely to represent the predictor construct, and we can’t choose if or how to include the covariates `c1`

and `c2`

in the model. We might also consider the subgroups defined by `group`

separately (and are not willing to do hierarchical models.) Let’s load the required libraries and show the example data (Table 1):

## Code

```
# Packages
library(kableExtra)
library(scales)
library(ggthemes)
library(tictoc)
library(tidyverse)
# Pretty plots
theme_set(
theme_few(
base_family = "Comic Sans MS",
base_size = 12
)
)
# Pretty tables
<- function(x, full_width = FALSE) {
k2 %>%
x kbl(digits = 2) %>%
kable_classic_2(
html_font = "Arial",
lightable_options = "striped",
full_width = full_width
)
}
# Data generation
<- function(seed = NA, n = 1e5) {
generate_data if (!is.na(seed)) set.seed(seed)
<- tibble(
dat x1 = rnorm(n),
x2 = rnorm(n),
y1 = rnorm(n) + x1*.1,
y2 = rnorm(n) + x1*.2,
c1 = rnorm(n) + x1*.3,
c2 = rnorm(n),
group = sample(c("a", "b", "c", "d"), n, replace = TRUE)
)
}<- generate_data(9) dat
```

x1 | x2 | y1 | y2 | c1 | c2 | group |
---|---|---|---|---|---|---|

-0.77 | 1.10 | -0.36 | -1.23 | 0.78 | -0.77 | d |

-0.82 | -1.68 | -0.50 | -0.79 | -1.08 | -0.81 | b |

-0.14 | -1.89 | -0.67 | 0.71 | -0.89 | -0.66 | b |

-0.28 | -0.98 | 0.68 | -1.40 | 1.24 | -0.25 | c |

0.44 | -0.10 | 0.83 | 0.11 | -0.78 | -0.45 | d |

-1.19 | -0.54 | -0.38 | 2.10 | 1.42 | 1.11 | b |

We can specify a fully crossed multiverse analysis over outcomes, predictors, and covariates, easily with specr. Also, to make the example a bit more interesting (slower!) for later examples, I’ll estimate the model using two functions (`lm()`

and `glm()`

which in this case give the same results, but the latter is much slower). I time the multiverse analysis using tictoc. Table 2 shows the first few rows of the results.

## Code

```
library(specr)
tic()
<- run_specs(
results_specr df = dat,
y = c("y1", "y2"),
x = c("x1", "x2"),
model = c("lm", "glm"),
controls = c("c1", "c2"),
subsets = list(group = unique(dat$group))
)toc()
```

`28.683 sec elapsed`

## Code

```
%>%
results_specr head() %>%
1:10] %>%
.[,kbl(
digits = 2
%>%
) kable_classic_2(html_font = "Arial", full_width = FALSE)
```

x | y | model | controls | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|---|---|

x1 | y1 | lm | c1 + c2 | 0.09 | 0.01 | 13.96 | 0.00 | 0.08 | 0.11 |

x2 | y1 | lm | c1 + c2 | 0.00 | 0.01 | 0.48 | 0.63 | -0.01 | 0.02 |

x1 | y2 | lm | c1 + c2 | 0.20 | 0.01 | 29.79 | 0.00 | 0.18 | 0.21 |

x2 | y2 | lm | c1 + c2 | 0.01 | 0.01 | 0.79 | 0.43 | -0.01 | 0.02 |

x1 | y1 | glm | c1 + c2 | 0.09 | 0.01 | 13.96 | 0.00 | 0.08 | 0.11 |

x2 | y1 | glm | c1 + c2 | 0.00 | 0.01 | 0.48 | 0.63 | -0.01 | 0.02 |

Another great thing about this package is that you can easily draw specification curve figures (Figure 2)

## Code

```
plot_specs(
results_specr, choices = c("x", "y", "model", "controls", "subsets")
)
```

However, even with this modest data set and 160 specifications, this took a while.

I first decided to take a stab at parallelizing `run_specs()`

. This turned out to be a bit of a dead end because I couldn’t make the parallelization fit in with how `run_specs()`

works in the back-end.^{1} So instead of shoehorning a parallel back-end to specr, I decided to implement the parallelization in a tidy pipeline. This pipeline, with no additional dependencies (apart from the tidyverse 😉), works pretty well. It of course does not provide specr’s one-liners, but I believe the flexibility of this approach pays back for it.

# Tidymultiverse

## Specification table

The first step in a multiverse analysis is defining the grid of specifications.

The one difficulty here is that the dataset can also be part of the specifications (e.g. different outlier removal thresholds, or more generally any subsets or transformations of the data). If you include the dataset in the table of specifications, you would easily run out of memory (I learned this the hard way). So we will still iterate over the specs table, and pull relevant subsets of the data inside the function that iterates over the specs.

A flexible and easy way to declare the specifications is `expand_grid()`

. This allows creating tables that cross all the variables declared therein. (There are related functions such as `expand()`

, `crossing()`

, and `nesting()`

that allow for more flexibility.)

## Code

```
<- expand_grid(
specs x = c("x1", "x2"),
y = c("y1", "y2"),
covariate = c("x1", "x2"),
model = c("lm", "glm")
)
```

x | y | covariate | model |
---|---|---|---|

x1 | y1 | x1 | lm |

x1 | y1 | x1 | glm |

x1 | y1 | x2 | lm |

x1 | y1 | x2 | glm |

x1 | y2 | x1 | lm |

x1 | y2 | x1 | glm |

But we could also just as well create a grid of formulas. Depending on your analysis, this might be a viable option

## Code

```
expand_grid(
formula = c("y1 ~ x1", "y1 ~ x2", "y1 ~ x1 + c1"), # And so on
model = c("lm", "glm")
)
```

We will stick with specifying variables instead, for this example. We can include subgroups as well:

## Code

```
<- expand_grid(
specs x = c("x1", "x2"),
y = c("y1", "y2"),
covariate = c("x1", "x2"),
model = c("lm", "glm"),
# Cross with all the unique values of `group` in the data
distinct(dat, group)
)
```

x | y | covariate | model | group |
---|---|---|---|---|

x1 | y1 | x1 | lm | d |

x1 | y1 | x1 | lm | b |

x1 | y1 | x1 | lm | c |

x1 | y1 | x1 | lm | a |

x1 | y1 | x1 | glm | d |

x1 | y1 | x1 | glm | b |

Now each row in the table specifies the modelling function (e.g. `lm()`

), the subgroup, and the left-hand and right-hand side variables of the formula to put in the modelling function. Next, we need a function to also expand the covariates to all their combinations (I lifted much of this from the specr source, I found it surprisingly hard to write):

## Code

```
#' Expand a vector of covariate names to all their combinations
#'
#' For example expand_covariate(c("age", "sex")) returns
#' c("1", "age", "sex", "age + sex")
#'
#' @param covariate vector of covariate(s) e.g. c("age", "sex")
#'
#' @return a character vector of all predictor combinations
<- function(covariate) {
expand_covariate list(
"1",
do.call(
"c",
map(
seq_along(covariate),
~combn(covariate, .x, FUN = list))
%>%
) map(~paste(.x, collapse = " + "))
%>%
)
unlist }
```

Do let me know if you come up with something easier!

### The specification table

Putting all this together, and creating the formulas from `y`

, `x`

, and `c`

with `str_glue()`

, we have completed the first part of our pipeline, creating the specifications:

## Code

```
<- expand_grid(
specs x = c("x1", "x2"),
y = c("y1", "y2"),
covariate = expand_covariate(c("c1", "c2")),
model = c("lm", "glm"),
distinct(dat, group)
%>%
) mutate(formula = str_glue("{y} ~ {x} + {covariate}"))
```

x | y | covariate | model | group | formula |
---|---|---|---|---|---|

x1 | y1 | 1 | lm | d | y1 ~ x1 + 1 |

x1 | y1 | 1 | lm | b | y1 ~ x1 + 1 |

x1 | y1 | 1 | lm | c | y1 ~ x1 + 1 |

x1 | y1 | 1 | lm | a | y1 ~ x1 + 1 |

x1 | y1 | 1 | glm | d | y1 ~ x1 + 1 |

x1 | y1 | 1 | glm | b | y1 ~ x1 + 1 |

## Estimating the specifications

Having set up the specifications, all that is left to do is to iterate over them, while at the same time using the correct subsets of data. But before we do so, let’s first think about what we want the output to look like.

### Outputs and errors

Currently, the output of `lm()`

or `glm()`

on each row will be a (g)lm object, from which we need to pull the information we need. In addition, the object will include the data used to estimate the model, and so the output might grow very large very quickly.

So it is best to just get the parameter(s) of interest when iterating over specs. To do that, we create functions to replace the model fitting functions with ones that estimate the model and then only return a table of parameters, and a count of observations in the model.

## Code

```
<- function(formula, data) {
lm2 <- lm(formula = formula, data = data)
fit <- tidy(fit, conf.int = TRUE) # Tidy table of parameters
out <- slice(out, 2) # Second row (slope parameter)
out bind_cols(out, n = nobs(fit))
}lm2(y1 ~ x1, data = dat)
```

term | estimate | std.error | statistic | p.value | conf.low | conf.high | n |
---|---|---|---|---|---|---|---|

x1 | 0.1 | 0 | 31.23 | 0 | 0.09 | 0.1 | 1e+05 |

We now have a neat function (`lm2()`

) that fits the model and extracts the key parameter (Table 6).

In addition, for a general solution, we should be able to handle errors. For example, some specifications might return 0 rows of data, which would break the iteration. To do so, we replace `lm2()`

with a version that returns the output, or a tibble that says that zero observations were found (Table 7).

## Code

```
<- possibly(lm2, otherwise = tibble(n = 0))
lm2 # See what it return when it gets bad input
lm2(group ~ x1, data = dat)
```

n |
---|

0 |

We also do this for `glm()`

.

## Code

```
<- function(formula, data) {
glm2 <- glm(formula = formula, data = data)
fit <- tidy(fit, conf.int = TRUE)
out <- slice(out, 2)
out bind_cols(out, n = nobs(fit))
}<- possibly(glm2, otherwise = tibble(n = 0)) glm2
```

Generally, I would have done this before creating the specs table, but I was trying to start easy 😄. For now, I just replace the model names in specs:

## Code

`<- mutate(specs, model = paste0(model, "2")) specs `

### Iterating over specs with `pmap()`

We are now ready to iterate over specs, and apply `model`

therein to the data and formula specified on each row. To do so, we pipe specs into `pmap()`

(inside `mutate()`

, which means that we are operating inside the specs data frame). `pmap()`

takes a list of arguments, and passes them to a function, `pmap(list(a, b, c), ~some_function())`

. But since we need to pull our function from a string within the list of arguments, our function is in fact the `do.call()`

function caller. We can then pass all our arguments to the function called by `do.call()`

. Freaky.

We will pass `list(model, formula, group)`

to `do.call()`

, that then uses the shorthand `..1`

, `..2`

, etc to take the first, second, etc, argument from the list. Critically, we can also put in another function (`filter()`

) inside the `do.call()`

argument list that will help us subset the data, based on the original arguments.

## Code

```
tic()
<- specs %>%
results_dplyr mutate(
out = pmap(
list(model, formula, group),
~do.call(
1,
..list(
formula = ..2,
data = filter(dat, group == ..3)
)
)
)
)toc()
```

`18.178 sec elapsed`

This then returns a copy of the specs table (`results_dplyr`

) with an additional column `out`

. But `out`

is a data frame column, so to show the values next to our original specs, we can call `unnest()`

(Table 8).

## Code

```
<- results_dplyr %>%
results_dplyr unnest(out)
```

x | y | covariate | model | group | formula | term | estimate | std.error | statistic | p.value | conf.low | conf.high | n |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

x1 | y1 | 1 | lm2 | d | y1 ~ x1 + 1 | x1 | 0.09 | 0.01 | 14.78 | 0 | 0.08 | 0.11 | 24800 |

x1 | y1 | 1 | lm2 | b | y1 ~ x1 + 1 | x1 | 0.11 | 0.01 | 16.98 | 0 | 0.09 | 0.12 | 25257 |

x1 | y1 | 1 | lm2 | c | y1 ~ x1 + 1 | x1 | 0.10 | 0.01 | 15.61 | 0 | 0.09 | 0.11 | 25045 |

x1 | y1 | 1 | lm2 | a | y1 ~ x1 + 1 | x1 | 0.10 | 0.01 | 15.10 | 0 | 0.08 | 0.11 | 24898 |

x1 | y1 | 1 | glm2 | d | y1 ~ x1 + 1 | x1 | 0.09 | 0.01 | 14.78 | 0 | 0.08 | 0.11 | 24800 |

x1 | y1 | 1 | glm2 | b | y1 ~ x1 + 1 | x1 | 0.11 | 0.01 | 16.98 | 0 | 0.09 | 0.12 | 25257 |

If you noticed above, we already saw an improvement in the run-time of this pipeline over `run_specs()`

, but note that my implementation does not estimate models for the complete data (`subsets`

= `all`

in specr), so it is not a fair comparison.

Nevertheless, now that we have the basic building blocks of the tidy multiverse pipeline collected, let’s focus on what matters; *speed*.

# Parallelizing the tidymultiverse

## multidplyr

To start, we load multidplyr, create a new cluster, and send the required libraries and variables to it.

## Code

```
library(multidplyr)
# Create a new cluster with eight nodes
<- new_cluster(8)
cluster
# Load libraries in and send variables to nodes in the cluster
cluster_library(cluster, c("purrr", "broom", "tidyr", "dplyr"))
cluster_copy(cluster, c("dat", "lm2", "glm2"))
```

Multidplyr integrates seamlessly into `%>%`

lines by sending groups in the passed data to nodes in the cluster. It is therefore important to think a bit about how to group your data. For us, we want to equally divide the `lm()`

and `glm()`

calls across nodes, because `glm()`

is considerably slower. If one node got all the `glm()`

calls, we would have to wait for that one node even after the others had completed.

Here, it makes sense for us to group the data by `formula`

and `group`

. After grouping the data, we `partition()`

it across the nodes in the cluster, run our computations, and then `collect()`

the results back to our main R process. Notice that the `pmap()`

call is identical to above.

## Code

```
tic()
<- specs %>%
results_multidplyr group_by(formula, group) %>%
partition(cluster) %>%
mutate(
out = pmap(
list(model, formula, group),
~do.call(
1,
..list(
formula = ..2,
data = filter(dat, group == ..3)
)
)
)%>%
) collect() %>%
ungroup() %>%
unnest(out)
toc()
```

`3.561 sec elapsed`

This particular parallelization scheme (8 cores working on subsets defined by `formula`

and `group`

in `dat`

) sped up our computations about 8 times compared to the original implementation, and about 4 times compared to the non-parallelized equivalent. Good stuff.

## furrr

I like multidplyr a lot because I can manually specify how the data and computations are assigned across the cluster. I also like that you need to explicitly tell what packages and objects to send to the cluster. As a consequence the syntax grows a bit verbose, however.

As an alternative, the furrr package promises drop-in replacements to purrr’s `map()`

functions that parallelize the computations (Vaughan and Dancho 2022). To use furrr’s functions, we first need to specify the parallelization scheme with `plan()`

. We can then replace `pmap()`

above with `future_pmap()`

. Also, we need to pass objects from the global environment and packages using `furrr_options()`

as shown below. Otherwise we can keep our `%>%`

line exactly the same.

## Code

```
library(furrr)
plan(multisession, workers = 8)
# Pass these global objects to `future_pmap()`
<- furrr_options(
opts globals = list(dat = dat, lm2 = lm2, glm2 = glm2),
packages = c("dplyr", "broom")
)
tic()
<- specs %>%
results_furrr mutate(
out = future_pmap(
list(model, formula, group),
~do.call(
what = ..1,
args = list(
formula = ..2,
data = filter(dat, group == ..3)
)
), .options = opts
)%>%
) unnest(out)
toc()
```

`4.879 sec elapsed`

This worked great. While we don’t have to partition our data, and collect the computations afterwards, furrr does require passing stuff using the `.options`

argument. But this is still a bit less verbose than multidplyr, and perhaps therefore preferred. I like it!

## Checking results

I also spot check that the results are consistent across the methods. I am a bit paranoid with what comes to parallel computation. Table 9 shows that everything is as it should be.

Method | estimate | std.error | conf.low | conf.high | group |
---|---|---|---|---|---|

specr | 0.09 | 0.01 | 0.08 | 0.11 | a |

tidymultiverse | 0.09 | 0.01 | 0.08 | 0.11 | a |

tidymultiverse multidplyr | 0.09 | 0.01 | 0.08 | 0.11 | a |

tidymultiverse furrr | 0.09 | 0.01 | 0.08 | 0.11 | a |

# A visualization

Finally, like any analysis, multiverse analyses need to be visualized for understanding and communicating. Here, we use some ggplot2 magic to create a standard specification curve analysis figure (Figure 3).

## Code

```
library(patchwork)
<- arrange(results_furrr, estimate) %>% mutate(spec = 1:n())
results <- results %>%
p_dash select(spec, p.value, x:group) %>%
pivot_longer(-c(spec, p.value), values_transform = as.character) %>%
ggplot(aes(spec, value, col = p.value < 0.05)) +
scale_color_brewer(palette = "Set1") +
scale_x_continuous(
"Specification"
+
) geom_point(size = 0.5) +
facet_grid(rows = vars(name), scales = "free_y", space = "free_y") +
theme(axis.title.y = element_blank())
<- results %>%
p_curve ggplot(aes(spec, estimate, col = p.value < 0.05)) +
scale_color_brewer(palette = "Set1") +
geom_pointrange(aes(ymin = conf.low, ymax = conf.high), size = .2) +
theme(axis.title.x = element_blank())
/ p_dash) &
(p_curve theme(legend.position = "none")
```

# What else?

Using this method, we can pass whatever modelling functions (e.g. `lmer()`

, `brm()`

) and arguments to them (e.g. append the formula with `(1 | participant)`

for `lmer()`

hierarchical models) and parallelize the iterations quite easily. We can also imagine more complex data subsetting scenarios. For example, we could expand the specs table to include various conditions for filtering data (e.g. outliers). We could then pre-compute those (or do it in `do.call()`

) to dynamically subset data differently in each row of specs.

I hope you found this helpful. If you’ve any feedback, comments are open below and I’d appreciate your thoughts!

## References

*Mverse: Tidy Multiverse Analysis Made Simple*. https://CRAN.R-project.org/package=mverse.

*Nature Human Behaviour*3 (2, 2): 173–82. https://doi.org/10.1038/s41562-018-0506-1.

*OSF Preprints*. https://github.com/MUCollective/multiverse.

*Nature Human Behaviour*, July, 1–7. https://doi.org/10.1038/s41562-020-0912-z.

*Perspectives on Psychological Science*, September. https://doi.org/10.1177/1745691616658637.

*Furrr: Apply Mapping Functions in Parallel Using Futures*. https://CRAN.R-project.org/package=furrr.

*Journal of Open Source Software*4 (43): 1686. https://doi.org/10.21105/joss.01686.

*Dplyr: A Grammar of Data Manipulation*. https://CRAN.R-project.org/package=dplyr.

## Footnotes

It first creates a data frame with the specs, then the requested subsets, and then either applies

`run_spec()`

to all the datasets and specs using`map()`

, or if no subsets were requested, runs the`run_spec()`

on the specs only. So it wasn’t straightforward to parallelize over both data subsets and specs. Parallelizing over specs was simple.↩︎

## Reuse

## Citation

```
@online{vuorre2022,
author = {Matti Vuorre},
title = {Tidymultiverse},
date = {2022-12-07},
url = {https://vuorre.netlify.app/posts/parallel-multiverse},
langid = {en}
}
```