----------------------------------------------------------------------
This is the API documentation for the google_drive_prospector library.
----------------------------------------------------------------------


## Core

Core Functionality


----------------------------------------------------------------------
This is the User Guide documentation for the package.
----------------------------------------------------------------------

# Defining the Google Drive Rclone Prospector


This package is a simple wrapper around the `rclone` command line tool
to help sync the Golden Lab Google Drive to FASRC. It simply calls
`rclone` under the hood, while managing configuration and logs.

The original script that ran the sync was written in bash, but I thought
it would be wise to create a Python wrapper that may or may not be
extensible in the future.

This is a pretty straightforward function, so the Python wrapper will be
pretty simple as well. The main goal is to run this script periodically
using a `scrontab` job, and keep logs of the sync process.

This package uses notebook-driven-development to create modules from a
notebook (this notebook, to be specific). The code is written in Quarto
notebook cells, and then extracted into a Python module using the Quarto
extensions `sorting-hat` and `ripper`. In this way, we can keep the code
and documentation together in a single notebook, while still producing a
clean Python module for use in our projects. This affords us the
flexibility to easily update the code and documentation in one place,
and then extract it into a module for use in our projects. Documentation
is built with `great-docs`, which allows us to extract documentation
from docstrings in the code.

## Imports and Libraries

The functionality is pretty straightforward, using `argparse` for
command line argument parsing, `subprocess` for running the `rclone`
commands, and `shutil` to check for the presence of `rclone`. We also
use `re` for regular expression matching to find the dataset
directories.

``` python
from pathlib import Path
from omegaconf import DictConfig, OmegaConf

import hydra
import time
import logging
import argparse
import re
import shutil
import subprocess
import sys
```

## Hydra Config

We do want to ensure that this module is modular, so we have put our
hard coded variables in a `config.yaml` that we read in with Hydra.

To read in the config, we will use Hydra’s `@hydra.main` decorator on
our main function, which will allow us to access the configuration as a
`DictConfig` object. This will make it easy to access our configuration
variables throughout the code.

Looks good to me. We’ll also be using Hydra logging behind the scenes.
Now we can move to the functionality of the module.

``` python
log = logging.getLogger(__name__)
```

## RClone Functionality

We’ll add a quick checker to make sure `rclone` is available:

Awesome!

``` python
def ensure_rclone() -> None:
    if shutil.which("rclone") is None:
        log.error("Error: rclone not found in PATH.")
        raise SystemExit(2)
    else:
        log.info("rclone is available.")
```

Now the basic functionality of `rclone` is implemented as a subcommand.
We can see it in action running the `lsf` command:

``` python
def list_dataset_dirs(remote: str, dataset_pattern: re.Pattern) -> list[str]:
    result = subprocess.run(
        ["rclone", "lsf", "-R", remote, "--dirs-only"],
        text=True,
        capture_output=True,
        check=False,
    )
    if result.returncode != 0:
        print(result.stderr, file=sys.stderr)
        raise SystemExit(result.returncode)

    return [
        line.rstrip("/")
        for line in result.stdout.splitlines()
        if dataset_pattern.search(line)
    ]
```

Lastly, we set up the sync function:

``` python
def sync(
    remote: str,
    destination: str,
    dataset_pattern: re.Pattern[str],
    dry_run: bool = False,
    ) -> int:
    """
    Sync dataset directories from a remote location to a local destination.

    Parameters
    ----------
    remote : str
        The remote location to sync from (e.g., "csph-googledrive:2. Projects").
    destination : str
        The local destination to sync to (e.g., "/n/holylabs/LABS/cgolden_lab/Lab/data_freeze/golden_googledrive_rclone").
    dataset_pattern : re.Pattern
        A regular expression pattern to match dataset directories (e.g., r"/[0-9]+\\. Datasets/$").
    dry_run : bool, optional
        If True, perform a dry run without making any changes (default is False).
    Returns
    -------
    int
        Returns 0 if the sync was successful, or 1 if there were any failures.
    """
    log.info("Starting sync: %s -> %s", remote, destination)
    start_time = time.time()

    ensure_rclone()
    dataset_dirs = list_dataset_dirs(remote, dataset_pattern)
    
    if not dataset_dirs:
        log.warning("No dataset directories found.")
        return 0

    failures = 0

    for dataset_path in dataset_dirs:
        dest_path = Path(destination) / dataset_path
        cmd = [
            "rclone",
            "copy",
            f"{remote}/{dataset_path}",
            str(dest_path),
            "--create-empty-src-dirs",
            "--progress",
        ]

        if dry_run:
            cmd.append("--dry-run")

        log.info("Copying: %s/%s -> %s", remote, dataset_path, dest_path)

        result = subprocess.run(cmd, check=False)

        if result.returncode != 0:
            failures += 1
            log.error("Failed: %s", dataset_path)

    elapsed = time.time() - start_time

    log.info("Sync complete in %.2f seconds", elapsed)

    return 1 if failures else 0
```

That looks great to me. Finally, we’ll put in a main function for the
CLI:

``` python
@hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig) -> int:
    
    dataset_pattern = re.compile(cfg.dataset_pattern)
    if cfg.run == "sync":
        return sync(
            remote=cfg.remote,
            destination=cfg.destination,
            dataset_pattern=dataset_pattern,
            dry_run=cfg.dry_run,
        )
    raise ValueError(f"Unknown action: {cfg.run}")

if __name__ == "__main__":
    sys.exit(main())
```

After running Quarto render, the module should now be exported to
`src/google_drive_prospector/cli.py` and can be run from the command
line after we install it with `uv`:

# Script file

The code for this document can be found here:

- [../src/google_drive_prospector/cli.py](../src/google_drive_prospector/cli.py)


# Sync Logs


The last successful sync was: