Pipeline Definition

library(SWMadagascar)

Preamble

The primary distribution mechanism of our data in the lab is via a targets pipeline. targets is a powerful R package for orchestrating data analysis pipelines. It helps manage dependencies, ensures reproducibility, and optimizes performance by only rerunning the parts of the pipeline that have changed.

In the lab, we separate the technical concerns of data storage, processing, analysis, and reporting into distinct workspaces:

Storage: Each project’s raw data is stored in a dedicated data repository on Google Drive (Drive / <PROJECT> / 4. Datasets / <DATASET NAME>). That dataset is mirrored to FASRC via rclone for processing. To work on your own project, you must be on FASRC – the pipeline will then symlink the google drive data to your local “data/” directory using the link_inputs() function defined in the pipeline.
Processing and Analysis: The targets pipeline handles preliminary data processing and analysis. It reads raw data from the “data/” directory, processes it, and generates intermediate datasets and analysis results in your local workspace. To do this, simply:
- Install this package in your local project space using remotes::install_github().
- Load the package using library(your_package_name).
- Run the targets pipeline by executing tar_make() in the R console. This will execute all defined targets, processing the data and generating clean datasets.
- Access the processed data and analysis results stored in the _targets/ directory using tar_load() or other targets functions.
Analysis: Once you’ve run the targets pipeline, you can access the processed data and analysis results stored in the _targets/ directory. This data is then used for further analysis and reporting.
Reporting: Your modelling and analysis are conducted on your own in the project space, using the processed data outputs from the targets pipeline. You can create reports, visualizations, and other outputs based on the analysis results. I recommend using Quarto for this purpose.
Sharing: Finally, when you are ready to share your work, use the respective project’s 5. Data & Code Elements Google Drive folder. This folder is intended for sharing intermediate data products — not raw or final data. You should always be able to reproduce any data product shared here by running the targets pipeline from raw data, and then rerunning your own analysis code.

build_targets_pipeline

This is a convenience function that can build the targets pipeline for the user in their local project. It does this by providing skeleton code for the targets file, as well as the list of necessary targets in a separate Quarto file, which is converted to targets objects via tarchetypes::tar_tangle()

#build_targets_pipeline()

Below we define the actual targets of the pipeline using tar_tangle().

data <- mtcars |>
  filter(mpg < 30)

model <- lm(mpg ~ wt + hp, data = data)

Config

A YAML config file is used to define parameters for the pipeline. At present, the only important parameter is the path to the raw data repository on FASRC.

cfg <- system.file("config.yml", package = "SWMadagascar")

Raw Data Linking

We use the function link_inputs() to symlink the raw data from the data repository to your local “data/” directory. If successful, stores the full list of files in the raw data.

raw_data <- link_inputs(cfg_path = cfg)