We all want to write good code

But good code doesn’t just magically happen


Despite many researchers using code, very few are trained in programming.
This increases the chance for messy or broken code in science.

Supporting good coding practices at the ALA

How to write good code in R
for science



Example code

# bad example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email="dax.kellie@csiro.au")
alaData<-galah_call()|>identify("perameles")|>filter(year==2003) |>
  select(group="basic",cl22)|>
  atlas_occurrences()|>
  select(recordID,scientificName,decimalLongitude,decimalLatitude,eventDate,cl22)|>
  janitor::clean_names()|>rename(state=cl22)|>
  mutate(event_date=lubridate::ymd(event_date))|>
  group_by(state)|>count()|>drop_na()
aus<-ozmaps::ozmap_states|>
  sf::st_transform(crs=4326)|>
  left_join(alaData, join_by(NAME==state))|>
  replace_na(list(n=0))

ggplot()+sf::geom_sf(data=aus,aes(fill=n),colour="grey60")+
  viridis::scale_fill_viridis(option="F",begin=0.2,direction=-1)+theme_void()+theme(legend.position="right")

Code style

  • Write code like you write sentences, add spaces
# better example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")
alaData <- galah_call() |> identify("perameles") |> filter(year == 2003) |>
  select(group = "basic", cl22)|>
  atlas_occurrences() |>
  select(recordID, scientificName, decimalLongitude, decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> rename(state = cl22) |>
  mutate(event_date = lubridate::ymd(event_date)) |>
  group_by(state) |> count() |> drop_na()
aus <- ozmaps::ozmap_states |>
  sf::st_transform(crs = 4326) |>
  left_join(alaData, join_by(NAME == state)) |>
  replace_na(list(n = 0))

ggplot() + sf::geom_sf(data = aus,aes(fill = n),colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", begin = 0.2, direction = -1) + theme_void() + theme(legend.position = "right")

Code style

  • Break tasks into smaller code chunks
  • Line breaks allow code to breathe
# best example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")

## Data

aus <- ozmap_states |>
  st_transform(crs = 4326) # fix projection

bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2003) |>
  select(group = "basic", cl22) |>
  atlas_occurrences() 

bandicoots_clean <- bandicoots |>
  select(recordID, scientificName, decimalLongitude, 
         decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> 
  rename(state = cl22) |>
  mutate(
    event_date = lubridate::ymd(event_date)
    ) 

state_counts <- bandicoots_clean |>
  group_by(state) |> 
  count() |> 
  drop_na()

aus_counts <- aus |>
  left_join(state_counts, join_by(NAME == state)) |>
  replace_na(list(n = 0))

## Map

ggplot() + 
  geom_sf(data = aus_counts,
          aes(fill = n),
          colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", 
                              begin = 0.2, 
                              direction = -1) + 
  theme_void() + 
  theme(legend.position = "right")

Naming things

Bad

Objects

DT_3_final # unclear
map1       # vague
mapOfMarineInvertsFilteredByYear # too long


Functions

func1() # unclear
filter() # unclear, already used
remove_outlier_if_out_of_range() # too long
Good

Objects

cockatoos_grouped
map_inverts
map_inverts_marine


Functions

get_counts() 
remove_outliers() 
has_outlier()

Naming things

Datset variable names

  • Variable names align with terms in final paper
  • Define variables in a README or in your Quarto/R markdown file

Example:

site_id = Unique site ID
individual_id = Unique ID of each individual
scientific_name = genus and species name
group = Unique group ID (control, high_condition, low_condition)
leaf_length = Length of leaf (mm)
leaf_width = Width of leaf (mm)
leaf_mass_per_area = Leaf dry mass per area, ratio of leaf size to density (g/m2)
temp_max = Maximum temperature, 2023 (C°)
soil_n = Soil nitrogen (mg/kg)

site_id individual_id scientific_name group leaf_length leaf_width leaf_mass_per_area temp_max soil_n
AAC101 B-31 Acacia acuminata control 70.4 3.3 105.66 42.23 25.67
AAC101 B-32 Acacia acuminata high_condition 84.0 2.4 102.26 42.23 25.76
ABB011 L-12 Acacia bidwillii low_condition 8.2 4.2 83.98 43.04 37.34

Code comments

  • Short but informative
  • Too many long comments is a good indicator that code is unclear (our example from earlier needs minimal notes)
# best example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")

## Data

aus <- ozmap_states |>
  st_transform(crs = 4326) # fix projection

bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2003) |>
  select(group = "basic", cl22) |>
  atlas_occurrences() 

bandicoots_clean <- bandicoots |>
  select(recordID, scientificName, decimalLongitude, 
         decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> 
  rename(state = cl22) |>
  mutate(
    event_date = lubridate::ymd(event_date)
    ) 

state_counts <- bandicoots_clean |>
  group_by(state) |> 
  count() |> 
  drop_na()

aus_counts <- aus |>
  left_join(state_counts, join_by(NAME == state)) |>
  replace_na(list(n = 0))

## Map

ggplot() + 
  geom_sf(data = aus_counts,
          aes(fill = n),
          colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", 
                              begin = 0.2, 
                              direction = -1) + 
  theme_void() + 
  theme(legend.position = "right")

Code comments

  • Use long comments to discuss why you did something
    • e.g., Interpret results of a statistical test

Example:

# Variables bio_12 and bio_16 are very highly correlated (94%). 
# To avoid multicollinearity, we’ll include only bio_12 and bio_4

Explaining variables

  • Sometimes you can’t avoid complex code
  • Breaking up steps into smaller variables or functions can make it easier to understand your code
# Without explaining variable

library(ozmaps)
library(sf)
library(dplyr)

grid <- st_make_grid(ozmap_country,
                     what = "polygons",
                     cellsize = 1.0,
                     square = FALSE,
                     flat_topped = TRUE)

# subset to grid cells that are within land
aus_grid <- oz_grid[as.data.frame(st_intersects(oz_grid, ozmap_country))$row.id]
# With explaining variable

library(ozmaps)
library(sf)
library(dplyr)

grid <- st_make_grid(ozmap_country,
                     what = "polygons",
                     cellsize = 1.0,
                     square = FALSE,
                     flat_topped = TRUE)

hex_on_land <- grid |>
  st_intersects(ozmap_country) |>
  as.data.frame() |>
  pull(row.id)

aus_grid <- grid[hex_on_land]

What if you are a busy scientist short on time?

  • Making code understandable is more important than making it brief or fancy

👇 Actual terrible code from my PhD, but I still can tell what it does and could use it again (and that’s the important part)!

Summary

  • Use a clean and readable code style
  • Naming things
    • Name things with simple, understandable names + avoid acronyms
    • Make sure variable names are documented, and align with the final paper
  • Code comments
    • Code comments should be short but clear
    • Use long code comments to document analytic interpretations
  • Explaining variables can simplify complex code


When you’re busy, prioritise readability over brevity!