Lots of researchers use code

…but few researchers share code…


Of 352 studies, despite ~80% sharing data only ~20% shared code

…and only some code works correctly


Whether a code script reproduces results usually has a lower chance than a coin toss

Code reproducibility is hard


  • Self-trained coders
  • Packages constantly update to newer versions
  • The environment and organisms we study have lots of variation
  • Academic institutions & funding schemes incentivise speed & “novel” results over reproducibility & longevity
  • Researchers have lots to do

…but maybe it’s just hard to change…


Code availability requirements

  • Journals require code availability statements
  • Peer reviewers can require code is provided

Reproducibility is fundamental to science

Supporting reproducibility at the ALA

Let’s make a code snippet reproducible

Our code snippet

library(tidyverse)
library(galah)
setwd("C:\Users\KEL329\OneDrive - CSIRO\Documents\ALA\Talks\ESA2024")

galah_config(email = "dax.kellie@csiro.au")
alaData <- galah_call()|>identify("perameles")|>filter(year == 2003) |>
  select(group="basic",cl22)|>
  atlas_occurrences()|>
  select(recordID,scientificName,decimalLongitude,decimalLatitude,eventDate,cl22) |>
  janitor::clean_names()|>rename(state=cl22) |>
  mutate(event_date=lubridate::ymd(event_date)) |>
  group_by(state) |>count() |>drop_na()
mmap <- ozmaps::ozmap_states |>
  sf::st_transform(crs=4326) |>
  left_join(alaData, join_by(NAME==state)) |>
  replace_na(list(n=0))

ggplot() + geom_sf(data=mmap,aes(fill=n),colour="grey60")+
  viridis::scale_fill_viridis(option="F",begin=0.2,direction=-1)+theme_void()+theme(legend.position="right")

Our code snippet

# Title: Map - number of bandicoot observations
# Author: Dax Kellie
# Date: 2024-11-16

setwd("C:\Users\KEL329\OneDrive - CSIRO\Documents\ALA\Talks\ESA2024")

# packages
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")

# download map of Australia
aus <- ozmap_states |>
  st_transform(crs = 4326) # fix projection

# download bandicoot records
bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2003) |>
  select(group = "basic", cl22) |>
  atlas_occurrences() 

# filter data, rename column, fix date class
bandicoots |>
  select(recordID, scientificName, decimalLongitude, 
         decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> 
  rename(state = cl22) |>
  mutate(
    event_date = lubridate::ymd(event_date)
    ) 

# counts by state/territory
state_counts <- 
  bandicoots_cleaned |>
  group_by(state) |> 
  count() |> 
  drop_na()

# join map with counts
aus_counts <- 
  aus |>
  left_join(state_counts, join_by(NAME == state)) |>
  replace_na(list(n = 0))

# Map
ggplot() + 
  geom_sf(data = aus_counts,
          aes(fill = n),
          colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", 
                              begin = 0.2, 
                              direction = -1) + 
  theme_void() + 
  theme(legend.position = "right")

This code reads better, but it’s no more reproducible!
Code reproducibilty depends on a reproducible work environment

It’s all about the setup
aka a reproducible project environment

6 simple steps
to make your R code run again

1. Create an R Project

1. Create an R Project

  • R projects use .Rproj files to tell R where your project’s top folder directory is

2. Use GitHub

An online platform for storing project repositories

2. Use GitHub

GitHub is useful for reasons other than collaborative code writing, too!

  • Front-facing README files improve project documentation
  • Easy to share projects with others
  • Find old code. Copy/Paste it back in

2. Use GitHub

To get setup:

Tip: Setup GitHub with {usethis}

use_git() + use_github() can initialise & link a local directory to a GitHub repository and it’s fast

3. Organised folder structure

  • Straightforward folder structure
  • This might require ongoing maintenance

Example folder structure

4. Readable file names


Jenny Bryan’s file name holy trinity

  • machine readable
  • human readable
  • plays well with default ordering

4. Readable file names


Bad:

  • dat2024_bsrFinalDK-new.csv
  • script.R

Good:

  • 2024-16-11_bandicoots.csv
  • map_counts-by-state.R

5. Record packages & versions

In order of comprehensiveness/ease:

  • {renv}

    • init(), snapshot(), restore()
  • {groundhog}

    • groundhog.library(pkg-name, date)
  • sessionInfo()

    • sessionInfo() |> report::report()

6. Back up your data

Locally & online

  • Zenodo, Open Science Framework


  • Generate a DOI for your data

    • DOIs persist even if urls change


6. Back up your data

  • Generate a DOI for your data
library(galah)
galah_config(email = "dax.kellie@csiro.au", verbose = FALSE) # ALA email

# download data
bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2004) |>
  atlas_occurrences(mint_doi = TRUE) # add a data DOI

attributes(bandicoots)$doi # see DOI
[1] "https://doi.org/10.26197/ala.78e21acd-7516-4fb2-91fe-9b86f5fcd83b"


# retrieve data
galah_call() |>
  filter(doi == "https://doi.org/10.26197/ala.974f0355-5ad8-439e-91c1-709a49df1086") |>
  atlas_occurrences()

Let’s return to our code snippet

Our code snippet

# Title: Map - number of bandicoot observations
# Author: Dax Kellie
# Date: 2024-11-16

setwd("C:\Users\KEL329\OneDrive - CSIRO\Documents\ALA\Talks\ESA2024")

# packages
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")

# download map of Australia
aus <- ozmap_states |>
  st_transform(crs = 4326) # fix projection

# download bandicoot records
bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2003) |>
  select(group = "basic", cl22) |>
  atlas_occurrences() 

# filter data, rename column, fix date class
bandicoots |>
  select(recordID, scientificName, decimalLongitude, 
         decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> 
  rename(state = cl22) |>
  mutate(
    event_date = lubridate::ymd(event_date)
    ) 

# counts by state/territory
state_counts <- 
  bandicoots_cleaned |>
  group_by(state) |> 
  count() |> 
  drop_na()

# join map with counts
aus_counts <- 
  aus |>
  left_join(state_counts, join_by(NAME == state)) |>
  replace_na(list(n = 0))

# Map
ggplot() + 
  geom_sf(data = aus_counts,
          aes(fill = n),
          colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", 
                              begin = 0.2, 
                              direction = -1) + 
  theme_void() + 
  theme(legend.position = "right")

Our code snippet

# Title: Map - number of bandicoot observations
# Author: Dax Kellie
# Date: 2024-11-16



# packages
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")

# download map of Australia
aus <- ozmap_states |>
  st_transform(crs = 4326) # fix projection

# download bandicoot records
bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2003) |>
  select(group = "basic", cl22) |>
  atlas_occurrences() 

# filter data, rename column, fix date class
bandicoots |>
  select(recordID, scientificName, decimalLongitude, 
         decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> 
  rename(state = cl22) |>
  mutate(
    event_date = lubridate::ymd(event_date)
    ) 

# counts by state/territory
state_counts <- 
  bandicoots_cleaned |>
  group_by(state) |> 
  count() |> 
  drop_na()

# join map with counts
aus_counts <- 
  aus |>
  left_join(state_counts, join_by(NAME == state)) |>
  replace_na(list(n = 0))

# Map
ggplot() + 
  geom_sf(data = aus_counts,
          aes(fill = n),
          colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", 
                              begin = 0.2, 
                              direction = -1) + 
  theme_void() + 
  theme(legend.position = "right")

Summary

Your code will run again!

Making code reproducibility depends on making a reproducible working environment

- R projects
- GitHub
- Organised folder
- Well-named files
- Document package versions
- Data stored locally & online

On reproducibility

When 174 analyst teams were asked to use 2 datasets to answer 2 ecology/evolution questions, results were all over the grid.

On reproducibility

Every decision a researcher makes affects the result and its interpretation.
To interpret scientific evidence, one must be able to reproduce & interrogate the analytic steps that led to a result.

Other useful resources