Start Here – Golden Lab Data Science Handbook

1 Why this documentation exists

Most graduate students do not struggle to manage their research because the science is impossible. They struggle because their workflows are fragile: important files live on personal laptops, results of analyses are copied by hand, versions of scripts are scattered, and the explanation for a key decision is trapped in the memory of someone who has long left the lab.

The goal of this site is to make the robust path the easy path.

2 The working philosophy

I believe¹ that effective computational academic research should adhere to a few core principles:

Always prefer remote and reproducible workflows over local and manual ones;
The code that produces outputs — not the outputs themselves — are the source of scientific truth;
Scientific communication is hard; you can make it easy by narrating your analyses with notebooks;
Invest early in structure and version control so that later work gets easier to manage as your project grows.

R for Data Science makes the same argument in a more general form: for larger projects and collaboration, the lasting record of an analysis should be the scripts, because scripts plus data can recreate the environment, while an IDE session alone cannot.²

3 Principle 1: Remote and Reproducible Workflows 🌐

What is the difference between a fragile workflow and a robust workflow?

Fragile workflow	Robust workflow
Lives on one machine	Lives on supported infrastructure and in version control
Depends on clicking and memory	Depends on scripts, notebooks, and documented steps
Stores only outputs	Stores the method that creates outputs
Breaks when paths, packages, or people change	Survives new machines, new users, and reruns

4 Principle 2: Code is The Source of Truth 💎

What does, “source of truth,” mean in practice?

If I handed you a scientific paper — any paper — and asked you where the real core of the scientific finding was, where would you point me to? The conclusion? No, the conclusion is an interpretation. The figures? Again, no — they are a product of a particular analysis. The reasoning in the introduction? Not really — that’s just a motivating narrative. The real core of the scientific finding is that the authors told you HOW to get from the data to the conclusion. This is found in the methods section, and more specifically the exact procedure that creates the result, repeatably, from the data. It is the repeatability of the method producing the same result that makes it science, in the first place after all. If I cannot repeat your experiment, then I cannot verify your claim, and your conclusion is therefore not scientific.

Important

In the modern data science era, the methods section is the code, and code is the source of truth. The code is the core record of an experiment, which puts it at the heart of what you should be writing, documenting, and sharing as a scientist. It is — without question — your most valuable asset.

Ask yourself: if someone stole your laptop today and tossed it into the Charles River, Google went under and your Scholar ID disappeared, and Dr. Strange erased everyone’s memory of you, would your contribution to science survive? If the answer is “no,” then you have a fragile workflow. If the answer is “yes,” then you have a robust workflow.

To ensure you would survive this totally plausible catastrophe, treat these basic artifacts as invaluable:

A README that clearly describes what the project is and how to run it.
Scripts and/or notebooks that describe how the analysis is conducted.
A lockfile or environment specification such as rv.lock, environment.yml, or requirements.txt that tell us which tools and versions of those tools are needed.
Data provenance that explains where the data came from.
git history that clearly shows what changed and why, each step of the way.

On the other hand, derived products such as:

figures,
rendered reports,
slides,
manuscripts,
and intermediate outputs

Should all be treated as ephemeral details that can be regenerated whenever asked You should never lose sleep over these things because your code should be able to regenerate them at a moment’s notice.

5 Principle 3: Analysis Should Be Narrated 📇

Why should scientists narrate their thinking?

David McCullough is credited with the quote, “Writing is thinking. To write well is to think clearly. That’s why it’s so hard.” If you’ve ever found yourself struggling to articulate an idea on paper that just moments before felt crystal clear in your head, you know exactly what he means. Donald Knuth, the father of algorithm analysis and creator of TeX, felt similarly about programming, and has evangelized literate programming — the practice of writing code that is meant to be read by humans, and that tells a story about the analysis — for decades.³. Today, we’re fortunate to have tools like R Markdown, Jupyter Notebooks, and Quarto that make it easy to write code that is meant to be read by humans, and that tells a story about the analysis. If there is anywhere appropriate for literate programming to improve our work, it’s in science, where the goal is not just to get the right answer, but to explain how we got there and why it matters. By narrating our analysis with notebooks, we afford our brains the opportunity to think through the analysis step by step, making our work more transparent, reproducible, and less prone to error or bias.

6 Principle 4: Robust Habits Compound 🌱

Why invest in engineering as a scientist?

technical debt is a term that software engineers use to describe the cost of maintaining and updating code that was written in a way that is not scalable or maintainable. Because scientists are often less motivated to invest in robust software design patterns, they can end up with a lot of technical debt in their code, which can make it difficult to understand, reproduce, or extend as the science progresses. This can lead to wasted time and effort, or worse, lost insight. By investing early in robust habits such as version control, clean project structure, and clear documentation, you can reduce technical debt and make your work easier to manage as your project grows. These habits compound over time, making it easier to maintain and extend your work in the future.

7 Start-Here Checklist

So now that we know our end goal and working philosophy, what prerequisites do we need to be able to achieve them? Here is a checklist of the most important things to have done:

Complete the Harvard CITI training and share your completion certificate. This is required for you to have access to the human participants data we work with, and to be able to contribute to the research. There are two modules:

Social and Behavioral/Biomedical Human Subject Training
Internet Privacy and Security (IPS) Training

If you’ve never registered for CITI training before, you will need to create an account and affiliate with Harvard University. When you are asked to select a course, select “Social and Behavioral/Biomedical Human Subject Training,” and then select the modules listed above. If you have already completed CITI training for another institution, you can log in with your existing account, affiliate with Harvard University, and then add the two modules listed above to your existing training.

In the section below, you’ll be directed to a lab induction form where you will upload your completion certificate, so make sure to download the certificate from CITI when you finish the training. If you have any questions about the training, please reach out to your RSE or PI.

CITI Retraining

CITI training must be renewed periodically. If you are being asked to retrain, please complete the retraining as soon as possible and upload your new certificate to the lab using this retraining form

Officially be inducted into the Golden Lab by filling out the new member intake form — we use this to grant you access to:

Google Drive, for managing shared documents and non-code, office-oriented work 📁
GitHub, for managing and collaborating on code 🔌
Slack, for work-related communication about projects, sharing information, and questions 💬
WhatsApp, for non-work-related communication and socializing 🎭

…and more!

Set up your FASRC account by following the instructions available here:

Carefully read the instructions; you’ll use the Account Request Tool to request an account using your HarvardKey, and when you are asked to specify the PI, select “Christopher Golden,” and “Nutrition,” as the department. Chris will receive a notification and approve your request as soon as possible, but don’t hesitate to send him an email letting him know you have submitted the request and are excited to get started.

Follow the workflow roadmap guide on the next page to set up your first project and learn how to do reproducible data science.

8 Why Can’t I Just Do This On My Laptop By Myself?

Following this guide will help students:

adhere to security, privacy, and ethical standards for working with human subjects data at Harvard,
set up the optimal compute environment to speed up their work,
avoid making critical mistakes that duplicate, mismanage, expose, or worse, lose sensitive data,
eliminate future technical debt of maintaining your projects by adhering to best practices early,
lower the cost of collaboration across and outside of the lab, and
focus on the science instead of wrestle with technical issues

9 Example of Common Failures You’ll Avoid by Following This Guide

Waiting all day for a piece of code to run on your laptop that could have run in 10 minutes on the cluster,
Losing your work due to inadequate backup strategies,
Writing hardcoded scripts that only work for one person,
Saving important objects only in the interactive environment, only to lose them when the session ends,
Infinitely renaming scripts final_v2_really_final every time you receive feedback from your collaborators,
Accidentally pushing sensitive data to GitHub because the .gitignore was never set up,
Waiting to document the project until after you’ve already forgotten what you did and why,
Struggling to get help because you don’t have a reproducible example of the problem,
Using the wrong version of a package and getting different results than your collaborators…

And much, much more.

10 Next pages

Go through the Workflow Roadmap to practice a full data science project lifecycle.
Review the HPC Infrastructure & Software Guide to get help deciding where work should run and what tools to use.
Go to Projects & Reproducibility Guide to learn about best practices for managing and sharing your work.

11 Further reading

Footnotes

Tinashe, your RSE, who has seen too many projects break because of fragile workflows.↩︎
Hadley Wickham, Mine Cetinkaya-Rundel, and Garrett Grolemund, R for Data Science (2e): Workflow: scripts and projects.↩︎
Donald Knuth, Literate Programming.↩︎