Pipeline Definition
pipeline-definition.Rmd
library(SWMadagascar)Preamble
The primary distribution mechanism of our data in the lab is via a targets pipeline. targets is a powerful R package for orchestrating data analysis pipelines. It helps manage dependencies, ensures reproducibility, and optimizes performance by only rerunning the parts of the pipeline that have changed.
In the lab, we separate the technical concerns of data storage, processing, analysis, and reporting into distinct workspaces:
- Storage: Each project’s raw data is stored in a dedicated data repository on Google Drive (
Drive / <PROJECT> / 4. Datasets / <DATASET NAME>). That dataset is mirrored to FASRC viarclonefor processing. To work on your own project, you must be on FASRC – the pipeline will then symlink the google drive data to your local “data/” directory using thelink_inputs()function defined in the pipeline. - Processing and Analysis: The
targetspipeline handles preliminary data processing and analysis. It reads raw data from the “data/” directory, processes it, and generates intermediate datasets and analysis results in your local workspace. To do this, simply:- Install this package in your local project space using
remotes::install_github(). - Load the package using
library(your_package_name). - Run the
targetspipeline by executingtar_make()in the R console. This will execute all defined targets, processing the data and generating clean datasets. - Access the processed data and analysis results stored in the
_targets/directory usingtar_load()or othertargetsfunctions.
- Install this package in your local project space using
- Analysis: Once you’ve run the
targetspipeline, you can access the processed data and analysis results stored in the_targets/directory. This data is then used for further analysis and reporting. - Reporting: Your modelling and analysis are conducted on your own in the project space, using the processed data outputs from the
targetspipeline. You can create reports, visualizations, and other outputs based on the analysis results. I recommend using Quarto for this purpose. - Sharing: Finally, when you are ready to share your work, use the respective project’s
5. Data & Code ElementsGoogle Drive folder. This folder is intended for sharing intermediate data products — not raw or final data. You should always be able to reproduce any data product shared here by running thetargetspipeline from raw data, and then rerunning your own analysis code.
build_targets_pipeline
This is a convenience function that can build the targets pipeline for the user in their local project. It does this by providing skeleton code for the targets file, as well as the list of necessary targets in a separate Quarto file, which is converted to targets objects via tarchetypes::tar_tangle()
#build_targets_pipeline()Pipeline Definition
Below we define the actual targets of the pipeline using tar_tangle().
data <- mtcars |>
filter(mpg < 30)
model <- lm(mpg ~ wt + hp, data = data)Config
A YAML config file is used to define parameters for the pipeline. At present, the only important parameter is the path to the raw data repository on FASRC.
cfg <- system.file("config.yml", package = "SWMadagascar")Raw Data Linking
We use the function link_inputs() to symlink the raw data from the data repository to your local “data/” directory. If successful, stores the full list of files in the raw data.
raw_data <- link_inputs(cfg_path = cfg)