A guide to writing good code
for the busy scientist

Dax Kellie, Shandiya Balasubramaniam, Martin Westgate

@daxkellie.bsky.social

I acknowledge the Traditional Owners of the lands on which I live and work, the Dharawal, Wandi Wandian, and Gadigal people of the Yuin Nation, and pay my respects to Elders past and present. I recognise the significance of this environment to Traditional Owners, and acknowledge their continuing connection to Country.

We all want to write good code

But good code doesn’t just magically happen

Despite many researchers using code, very few are trained in programming.
This increases the chance for messy or broken code in science.

Research waste

Supporting good coding practices at the ALA

https://galah.ala.org.au/ https://galaxias.ala.org.au/

https://labs.ala.org.au/

https://cleaning-data-r.ala.org.au/

How to write good code in R
for science

✖ Example code

# bad example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email="dax.kellie@csiro.au")
alaData<-galah_call()|>identify("perameles")|>filter(year==2003) |>
  select(group="basic",cl22)|>
  atlas_occurrences()|>
  select(recordID,scientificName,decimalLongitude,decimalLatitude,eventDate,cl22)|>
  janitor::clean_names()|>rename(state=cl22)|>
  mutate(event_date=lubridate::ymd(event_date))|>
  group_by(state)|>count()|>drop_na()
aus<-ozmaps::ozmap_states|>
  sf::st_transform(crs=4326)|>
  left_join(alaData, join_by(NAME==state))|>
  replace_na(list(n=0))

ggplot()+sf::geom_sf(data=aus,aes(fill=n),colour="grey60")+
  viridis::scale_fill_viridis(option="F",begin=0.2,direction=-1)+theme_void()+theme(legend.position="right")

Code style

Write code like you write sentences, add spaces

# better example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")
alaData <- galah_call() |> identify("perameles") |> filter(year == 2003) |>
  select(group = "basic", cl22)|>
  atlas_occurrences() |>
  select(recordID, scientificName, decimalLongitude, decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> rename(state = cl22) |>
  mutate(event_date = lubridate::ymd(event_date)) |>
  group_by(state) |> count() |> drop_na()
aus <- ozmaps::ozmap_states |>
  sf::st_transform(crs = 4326) |>
  left_join(alaData, join_by(NAME == state)) |>
  replace_na(list(n = 0))

ggplot() + sf::geom_sf(data = aus,aes(fill = n),colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", begin = 0.2, direction = -1) + theme_void() + theme(legend.position = "right")

Code style

Break tasks into smaller code chunks
Line breaks allow code to breathe

# best example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")

## Data

aus <- ozmap_states |>
  st_transform(crs = 4326) # fix projection

bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2003) |>
  select(group = "basic", cl22) |>
  atlas_occurrences() 

bandicoots_clean <- bandicoots |>
  select(recordID, scientificName, decimalLongitude, 
         decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> 
  rename(state = cl22) |>
  mutate(
    event_date = lubridate::ymd(event_date)
    ) 

state_counts <- bandicoots_clean |>
  group_by(state) |> 
  count() |> 
  drop_na()

aus_counts <- aus |>
  left_join(state_counts, join_by(NAME == state)) |>
  replace_na(list(n = 0))

## Map

ggplot() + 
  geom_sf(data = aus_counts,
          aes(fill = n),
          colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", 
                              begin = 0.2, 
                              direction = -1) + 
  theme_void() + 
  theme(legend.position = "right")

Naming things

Bad

Objects

DT_3_final # unclear
map1       # vague
mapOfMarineInvertsFilteredByYear # too long

Functions

func1() # unclear
filter() # unclear, already used
remove_outlier_if_out_of_range() # too long

Good

Objects

cockatoos_grouped
map_inverts
map_inverts_marine

Functions

get_counts() 
remove_outliers() 
has_outlier()

Naming things

Datset variable names

Variable names align with terms in final paper
Define variables in a README or in your Quarto/R markdown file

Example:

site_id = Unique site ID
individual_id = Unique ID of each individual
scientific_name = genus and species name
group = Unique group ID (control, high_condition, low_condition)
leaf_length = Length of leaf (mm)
leaf_width = Width of leaf (mm)
leaf_mass_per_area = Leaf dry mass per area, ratio of leaf size to density (g/m²)
temp_max = Maximum temperature, 2023 (C°)
soil_n = Soil nitrogen (mg/kg)

site_id	individual_id	scientific_name	group	leaf_length	leaf_width	leaf_mass_per_area	temp_max	soil_n
AAC101	B-31	Acacia acuminata	control	70.4	3.3	105.66	42.23	25.67
AAC101	B-32	Acacia acuminata	high_condition	84.0	2.4	102.26	42.23	25.76
ABB011	L-12	Acacia bidwillii	low_condition	8.2	4.2	83.98	43.04	37.34

Code comments

Short but informative
Too many long comments is a good indicator that code is unclear (our example from earlier needs minimal notes)

# best example
library(tidyverse)
library(galah)
library(janitor)
library(sf)
library(ozmaps)

galah_config(email = "dax.kellie@csiro.au")

## Data

aus <- ozmap_states |>
  st_transform(crs = 4326) # fix projection

bandicoots <- galah_call() |>
  identify("perameles") |>
  filter(year == 2003) |>
  select(group = "basic", cl22) |>
  atlas_occurrences() 

bandicoots_clean <- bandicoots |>
  select(recordID, scientificName, decimalLongitude, 
         decimalLatitude, eventDate, cl22) |>
  janitor::clean_names() |> 
  rename(state = cl22) |>
  mutate(
    event_date = lubridate::ymd(event_date)
    ) 

state_counts <- bandicoots_clean |>
  group_by(state) |> 
  count() |> 
  drop_na()

aus_counts <- aus |>
  left_join(state_counts, join_by(NAME == state)) |>
  replace_na(list(n = 0))

## Map

ggplot() + 
  geom_sf(data = aus_counts,
          aes(fill = n),
          colour = "grey60") +
  viridis::scale_fill_viridis(option = "F", 
                              begin = 0.2, 
                              direction = -1) + 
  theme_void() + 
  theme(legend.position = "right")

Code comments

Use long comments to discuss why you did something
- e.g., Interpret results of a statistical test

Example:

# Variables bio_12 and bio_16 are very highly correlated (94%). 
# To avoid multicollinearity, we’ll include only bio_12 and bio_4

Explaining variables

Sometimes you can’t avoid complex code
Breaking up steps into smaller variables or functions can make it easier to understand your code

# Without explaining variable

library(ozmaps)
library(sf)
library(dplyr)

grid <- st_make_grid(ozmap_country,
                     what = "polygons",
                     cellsize = 1.0,
                     square = FALSE,
                     flat_topped = TRUE)

# subset to grid cells that are within land
aus_grid <- oz_grid[as.data.frame(st_intersects(oz_grid, ozmap_country))$row.id]

# With explaining variable

library(ozmaps)
library(sf)
library(dplyr)

grid <- st_make_grid(ozmap_country,
                     what = "polygons",
                     cellsize = 1.0,
                     square = FALSE,
                     flat_topped = TRUE)

hex_on_land <- grid |>
  st_intersects(ozmap_country) |>
  as.data.frame() |>
  pull(row.id)

aus_grid <- grid[hex_on_land]

What if you are a busy scientist short on time?

Making code understandable is more important than making it brief or fancy

👇 Actual terrible code from my PhD, but I still can tell what it does and could use it again (and that’s the important part)!

Summary

Use a clean and readable code style
Naming things
- Name things with simple, understandable names + avoid acronyms
- Make sure variable names are documented, and align with the final paper
Code comments
- Code comments should be short but clear
- Use long code comments to document analytic interpretations
Explaining variables can simplify complex code

When you’re busy, prioritise readability over brevity!

Thank you

Dax Kellie
Data Analyst & Science Lead
Science & Decision Support | ALA
: dax.kellie@csiro.au
: @daxkellie.bsky.social
: @daxkellie

Science & Decision Support team
Martin Westgate, Shandiya Balasubramaniam
Juliet Seers, Amanda Buyan

These slides were made using Quarto & RStudio

Slides:

A guide to writing good code for the busy scientist

We all want to write good code

But good code doesn’t just magically happen

Supporting good coding practices at the ALA

How to write good code in Rfor science

Code style

Code style

Naming things

Naming things

Code comments

Code comments

Explaining variables

What if you are a busy scientist short on time?

Summary

Thank you

A guide to writing good code
for the busy scientist

How to write good code in R
for science