Appendix A — Tooling reference

This appendix is a one-stop reference for the tools the book uses, grouped by the job they do. For each, it gives what the tool is for, the command or entry point that gets you started, and the chapter that introduces it. You do not need all of them, and you certainly do not need them all at once — reach for each when the problem it solves actually appears in your work. Where two tools do a similar job, the more modern or widely used default is listed first.

A.1 Version control and collaboration

Tool	What it’s for	Where to start	Ch.
Git	Versioning code and recording the history of decisions	`git init`, `git add`, `git commit`	2
nbdime	Notebook-aware diffs and merges (readable `.ipynb` changes)	`nbdiff notebook.ipynb`	2
Jupytext	Pairing a notebook with a plain-text script that diffs cleanly	`jupytext --set-formats ipynb,py:percent nb.ipynb`	2
nbstripout	Stripping notebook outputs before commit	`nbstripout --install`	2
GitHub (or GitLab)	Hosting repositories, pull requests, and code review	open a pull request	17

A.2 Environments and packaging

Tool	What it’s for	Where to start	Ch.
venv	Isolating a project’s Python environment	`python -m venv .venv`	3
pip-tools	Compiling an abstract spec into a pinned lockfile	`pip-compile requirements.in`	3
uv	A fast, modern alternative for environments and locking	`uv pip compile`, `uv venv`	3
conda / mamba	Environments with heavy binary or GPU dependencies	`conda create -n proj`	3
pyproject.toml	Declaring an installable package and its dependencies	`pip install -e .`	6

A.3 Code quality

Tool	What it’s for	Where to start	Ch.
ruff	Fast linting and formatting in one tool	`ruff check .`, `ruff format .`	5
black	Opinionated code formatting	`black .`	5
mypy	Static type checking from your type hints	`mypy src/`	5

A.4 Testing, debugging, and profiling

Tool	What it’s for	Where to start	Ch.
pytest	Running tests; the standard test runner	`pytest`	7
Hypothesis	Property-based testing (generates inputs for you)	`@given(...)`	7
pdb	Interactive debugging at a breakpoint	`breakpoint()`	8
logging	Levelled, structured diagnostics that scale past `print`	`logging.getLogger(__name__)`	8
cProfile / line_profiler	Finding where the time actually goes	`python -m cProfile -s cumtime script.py`	8
memray / scalene	Finding where the memory goes	`memray run script.py`, `scalene script.py`	8

A.5 Project structure and pipelines

Tool	What it’s for	Where to start	Ch.
cookiecutter	Scaffolding a standard project layout	`cookiecutter <template>`	9
make	A simple task runner and dependency graph	a `Makefile` with named targets	4, 10
Snakemake / Prefect / Dagster	Orchestrating complex, scheduled pipelines with retries	declare stages and dependencies	10
pandera / Great Expectations	Validating a DataFrame against a declared schema	`schema.validate(df)`	10

A.6 Configuration, secrets, and APIs

Tool	What it’s for	Where to start	Ch.
pydantic	Typed, validated configuration and data models	`class Config(BaseModel): ...`	11, 12
PyYAML	Reading and writing YAML configuration files	`yaml.safe_load(...)`	11
python-dotenv	Loading secrets from a local, untracked `.env`	`load_dotenv()`	11
Hydra	Composing and sweeping over experiment configurations	`@hydra.main(...)`	11
FastAPI	Serving a model behind a typed HTTP API	`@app.post("/predict")`	12
uvicorn	The server that runs a FastAPI app	`uvicorn main:app`	12
httpx	Making HTTP requests; backs FastAPI’s `TestClient`	`httpx.post(...)`	12

A.7 Operations: containers, CI/CD, deployment, monitoring

Tool	What it’s for	Where to start	Ch.
Docker	Packaging the whole environment as a portable image	a `Dockerfile`, `docker build`	14
docker-compose	Running several services together as one stack	`compose.yaml`, `docker compose up`	14
GitHub Actions	Running tests and checks automatically on every change	`.github/workflows/ci.yml`	13
pre-commit	Fast local checks before a commit is recorded	`.pre-commit-config.yaml`, `pre-commit install`	13
cron / Airflow	Scheduling batch jobs (simple to complex)	a `cron` entry, or an Airflow DAG	15
Prometheus / Grafana	Collecting and dashboarding service metrics	scrape `/metrics`, build a dashboard	16
Evidently	Off-the-shelf data and prediction drift reports	compare a reference to live data	16

A.8 Documentation, data, and model versioning

Tool	What it’s for	Where to start	Ch.
MkDocs / Sphinx / Quarto	Generating documentation from your code and prose	`mkdocs serve` / `quarto render`	18
DVC	Versioning large data and models alongside Git	`dvc add data/raw/...`	22
MLflow	Tracking experiments and registering model versions	`mlflow.log_metric(...)`, model aliases (`models:/name@champion`)	23
joblib	Serialising a trained model to a portable artefact	`joblib.dump(model, path)`	15, 21

A closing note in the spirit of the book: this list is deliberately a menu, not a checklist. A throwaway analysis needs almost none of it; a model that real users depend on eventually touches most of it. The skill is choosing the smallest set of tools that makes a given piece of work reliable enough for what it has to do — and reaching for the next one only when the problem it solves is the problem you actually have.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md title: "Tooling reference" --- This appendix is a one-stop reference for the tools the book uses, grouped by the job they do. For each, it gives what the tool is for, the command or entry point that gets you started, and the chapter that introduces it. You do not need all of them, and you certainly do not need them all at once — reach for each when the problem it solves actually appears in your work. Where two tools do a similar job, the more modern or widely used default is listed first. ## Version control and collaboration {#sec-tools-version-control} | Tool | What it's for | Where to start | Ch. | | :--- | :--- | :--- | :---: | | **Git** | Versioning code and recording the history of decisions | `git init`, `git add`, `git commit` | 2 | | **nbdime** | Notebook-aware diffs and merges (readable `.ipynb` changes) | `nbdiff notebook.ipynb` | 2 | | **Jupytext** | Pairing a notebook with a plain-text script that diffs cleanly | `jupytext --set-formats ipynb,py:percent nb.ipynb` | 2 | | **nbstripout** | Stripping notebook outputs before commit | `nbstripout --install` | 2 | | **GitHub** (or GitLab) | Hosting repositories, pull requests, and code review | open a pull request | 17 | ## Environments and packaging {#sec-tools-environments} | Tool | What it's for | Where to start | Ch. | | :--- | :--- | :--- | :---: | | **venv** | Isolating a project's Python environment | `python -m venv .venv` | 3 | | **pip-tools** | Compiling an abstract spec into a pinned lockfile | `pip-compile requirements.in` | 3 | | **uv** | A fast, modern alternative for environments and locking | `uv pip compile`, `uv venv` | 3 | | **conda / mamba** | Environments with heavy binary or GPU dependencies | `conda create -n proj` | 3 | | **pyproject.toml** | Declaring an installable package and its dependencies | `pip install -e .` | 6 | ## Code quality {#sec-tools-code-quality} | Tool | What it's for | Where to start | Ch. | | :--- | :--- | :--- | :---: | | **ruff** | Fast linting and formatting in one tool | `ruff check .`, `ruff format .` | 5 | | **black** | Opinionated code formatting | `black .` | 5 | | **mypy** | Static type checking from your type hints | `mypy src/` | 5 | ## Testing, debugging, and profiling {#sec-tools-testing} | Tool | What it's for | Where to start | Ch. | | :--- | :--- | :--- | :---: | | **pytest** | Running tests; the standard test runner | `pytest` | 7 | | **Hypothesis** | Property-based testing (generates inputs for you) | `@given(...)` | 7 | | **pdb** | Interactive debugging at a breakpoint | `breakpoint()` | 8 | | **logging** | Levelled, structured diagnostics that scale past `print` | `logging.getLogger(__name__)` | 8 | | **cProfile** / **line_profiler** | Finding where the time actually goes | `python -m cProfile -s cumtime script.py` | 8 | | **memray** / **scalene** | Finding where the *memory* goes | `memray run script.py`, `scalene script.py` | 8 | ## Project structure and pipelines {#sec-tools-pipelines} | Tool | What it's for | Where to start | Ch. | | :--- | :--- | :--- | :---: | | **cookiecutter** | Scaffolding a standard project layout | `cookiecutter <template>` | 9 | | **make** | A simple task runner and dependency graph | a `Makefile` with named targets | 4, 10 | | **Snakemake / Prefect / Dagster** | Orchestrating complex, scheduled pipelines with retries | declare stages and dependencies | 10 | | **pandera** / **Great Expectations** | Validating a DataFrame against a declared schema | `schema.validate(df)` | 10 | ## Configuration, secrets, and APIs {#sec-tools-config-api} | Tool | What it's for | Where to start | Ch. | | :--- | :--- | :--- | :---: | | **pydantic** | Typed, validated configuration and data models | `class Config(BaseModel): ...` | 11, 12 | | **PyYAML** | Reading and writing YAML configuration files | `yaml.safe_load(...)` | 11 | | **python-dotenv** | Loading secrets from a local, untracked `.env` | `load_dotenv()` | 11 | | **Hydra** | Composing and sweeping over experiment configurations | `@hydra.main(...)` | 11 | | **FastAPI** | Serving a model behind a typed HTTP API | `@app.post("/predict")` | 12 | | **uvicorn** | The server that runs a FastAPI app | `uvicorn main:app` | 12 | | **httpx** | Making HTTP requests; backs FastAPI's `TestClient` | `httpx.post(...)` | 12 | ## Operations: containers, CI/CD, deployment, monitoring {#sec-tools-operations} | Tool | What it's for | Where to start | Ch. | | :--- | :--- | :--- | :---: | | **Docker** | Packaging the whole environment as a portable image | a `Dockerfile`, `docker build` | 14 | | **docker-compose** | Running several services together as one stack | `compose.yaml`, `docker compose up` | 14 | | **GitHub Actions** | Running tests and checks automatically on every change | `.github/workflows/ci.yml` | 13 | | **pre-commit** | Fast local checks before a commit is recorded | `.pre-commit-config.yaml`, `pre-commit install` | 13 | | **cron** / **Airflow** | Scheduling batch jobs (simple to complex) | a `cron` entry, or an Airflow DAG | 15 | | **Prometheus / Grafana** | Collecting and dashboarding service metrics | scrape `/metrics`, build a dashboard | 16 | | **Evidently** | Off-the-shelf data and prediction drift reports | compare a reference to live data | 16 | ## Documentation, data, and model versioning {#sec-tools-docs-data} | Tool | What it's for | Where to start | Ch. | | :--- | :--- | :--- | :---: | | **MkDocs / Sphinx / Quarto** | Generating documentation from your code and prose | `mkdocs serve` / `quarto render` | 18 | | **DVC** | Versioning large data and models alongside Git | `dvc add data/raw/...` | 22 | | **MLflow** | Tracking experiments and registering model versions | `mlflow.log_metric(...)`, model aliases (`models:/name@champion`) | 23 | | **joblib** | Serialising a trained model to a portable artefact | `joblib.dump(model, path)` | 15, 21 | A closing note in the spirit of the book: this list is deliberately a menu, not a checklist. A throwaway analysis needs almost none of it; a model that real users depend on eventually touches most of it. The skill is choosing the smallest set of tools that makes a given piece of work reliable enough for what it has to do — and reaching for the next one only when the problem it solves is the problem you actually have.