13 Continuous integration

13.1 The change that broke everything

A colleague refactors a shared cleaning function to handle a new data source. Their notebook runs, the numbers look right, they merge. Three days later your model retrains on a schedule and the accuracy has collapsed, because the refactor changed a default that your features quietly depended on. Nobody re-ran your code when theirs changed; there was nothing to re-run it. By the time anyone notices, the change is buried under a week of other work and the bad model has been serving predictions the whole time.

The gap here isn’t a missing test — you might both have had tests — it’s that nothing automatically ran the checks when the code changed. Continuous integration closes that gap. It is the practice of running your checks automatically on every change, on a clean machine, before the change is allowed to merge, so that “did this break anything?” is answered in minutes by a machine rather than in days by a production incident.

13.2 What continuous integration is

Mechanically, CI is simple: a service watches your repository, and every time someone pushes, it spins up a fresh machine, installs your locked dependencies, runs your checks — tests, linters, type checks — and reports a single pass or fail status on that change. The whole thing rests on the exit-code contract from Chapter 4: each check returns zero for success or non-zero for failure, and CI reads those codes to decide whether the change is green or red.

import subprocess
import sys
import tempfile
from pathlib import Path

# CI runs your test suite and reads its exit code. Here are the two
# outcomes it acts on, produced by running pytest on a tiny test file.
work = Path(tempfile.mkdtemp())
(work / "test_pass.py").write_text("def test_ok(): assert 1 + 1 == 2\n")
(work / "test_fail.py").write_text("def test_broken(): assert 1 + 1 == 3\n")

for name in ("test_pass.py", "test_fail.py"):
    result = subprocess.run([sys.executable, "-m", "pytest", "-q", str(work / name)],
                            capture_output=True, text=True)
    verdict = "green (merge allowed)" if result.returncode == 0 else "red (merge blocked)"
    print(f"{name:14} exit code {result.returncode} -> {verdict}")

test_pass.py   exit code 0 -> green (merge allowed)
test_fail.py   exit code 1 -> red (merge blocked)

The passing test exits zero and the failing one exits non-zero, and that single bit — green or red — is the entire basis of the gate. Crucially, this runs on a clean machine with freshly installed locked dependencies, which is what catches the “works on my machine” failures from Chapter 3: if a change accidentally relies on something only your laptop has, CI fails where you didn’t.

Data Science Bridge

CI is the reflex of re-running your evaluation suite whenever something changes, automated and made mandatory. You already know never to trust a model after you’ve altered the features without re-checking it on the holdout — the change invalidates the previous verdict, so you re-evaluate. CI applies exactly that reflex to code: a change invalidates the previous “it works”, so the checks re-run before anyone relies on the result. It’s the same instinct — changed it, re-verify it — enforced by a machine so it can’t be forgotten under deadline pressure.

Where the analogy breaks down: re-running an evaluation gives you a graded score that you interpret with judgement (0.82 — good enough?). A CI run gives a binary gate that blocks the merge — there’s no “82% of the tests passed, ship it”. Same trigger, different kind of verdict: a number to weigh versus a door that’s open or shut.

13.3 A workflow, concretely

On the most common platform, GitHub Actions, a CI pipeline is a YAML file in .github/workflows/. It reads almost like a description of what you’d do by hand: on every push, check out the code, set up Python, install the locked dependencies, and run the checks.

# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-python@v6
        with:
          python-version: "3.12"
          cache: pip
      - run: pip install -r requirements.txt
      - run: ruff check .
      - run: pytest

Each run step is a command whose exit code becomes part of the verdict; if ruff or pytest fails, the job goes red and — if you enable branch protection — the merge is blocked. Branch protection is a setting on the hosting platform (GitHub, GitLab) that makes rules about a branch enforceable rather than advisory: no pushing directly to main, no merging until the named CI checks pass, and optionally no merging without a review. Without it, CI is a report; with it, CI is a gate. The cache: pip line reuses installed dependencies between runs so the pipeline stays fast, and a matrix (not shown) can run the same steps across several Python versions at once, catching a version-specific break before a user does.

13.4 Catching it earlier with pre-commit

CI is the authoritative, shared gate, but waiting for a remote run to tell you about a formatting slip is slow. Pre-commit hooks run fast checks locally, before a commit is even recorded, so the cheap problems never make it as far as CI. A small .pre-commit-config.yaml wires up the formatter, the linter, and a few hygiene checks:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.15.22
    hooks:
      - id: ruff-check    # lint
      - id: ruff-format    # format
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v6.0.0
    hooks:
      - id: check-added-large-files   # stop a stray dataset entering Git
      - id: detect-private-key        # stop a secret being committed

The division of labour is the point: pre-commit is the fast local gate that keeps trivial issues out of the history (including a large dataset or a private key — the mistakes from Chapter 2 and Chapter 11), and CI is the slower, authoritative gate that the whole team’s changes must pass. Together they mean a problem is caught at the cheapest possible moment.

13.5 When the work is too slow to run on every push

Here the standard CI advice collides with the reality of data science work, and it’s the objection most readers will have already formed: the test suite for a web service runs in ninety seconds, but your pipeline trains for six hours on a dataset that doesn’t fit on the runner and isn’t allowed to leave the warehouse. “Run the tests on every push” sounds like advice written by someone who has never trained a model.

The resolution is to stop thinking of CI as “run the pipeline” and start thinking of it as “run the checks whose verdict you need before merging”. Those are very different sets, and almost nothing in the slow one belongs on the fast path.

Separate the logic from the volume. Most of what can break in a training pipeline is not the training — it’s the feature engineering, the joins, the encoding, the split logic, the serialisation. All of that is testable on a few hundred synthetic or sampled rows in seconds, and a test that add_features handles a zero denominator does not become more true for having run on fifty million rows. Commit a small fixture dataset to the repository (small enough to be genuinely committable — this is one of the few dataset exceptions to Chapter 2) and let the fast suite run against it on every push.

Test that the model trains, not that it trains well. A “smoke test” fits the real pipeline on the tiny fixture for one iteration, or a handful of trees, and asserts only that it completes, produces an artefact of the right shape, and that the artefact survives a save/load round-trip. That catches the overwhelming majority of real breakages — a renamed column, an incompatible sklearn version, a shape mismatch — in under a minute. It deliberately does not check accuracy, for the reason Chapter 7 gave: a metric threshold is not a pass/fail property, and wiring one into CI produces a gate that fails for reasons nobody can act on.

Tier the rest by cost. Anything genuinely expensive moves off the per-push path and onto a schedule or an explicit trigger:

on:
  push:                          # fast: lint, types, unit tests, smoke train
  schedule:
    - cron: "0 3 * * *"          # nightly: full training on real data, metrics logged
  workflow_dispatch:             # on demand: kick off the expensive run manually

The nightly job trains properly, records its metrics somewhere you can track over time, and alerts on a genuine regression — but it never blocks a merge, because a six-hour feedback loop is not a gate, it’s a report.

And where the data can’t move, bring CI to the data. If the dataset can’t leave your infrastructure for governance reasons, the answer is a self-hosted runner: the same workflow file, executed on a machine inside your network with access to the warehouse and the GPUs. Nothing about the CI concepts changes; only where the job runs does.

The principle underneath all of this is worth stating plainly, because it generalises past CI: a check belongs on the fast path if its verdict would change your decision to merge, and if you’d actually wait for it. A check that takes six hours fails the second test regardless of how valuable it is — people merge anyway, and a gate everyone routes around is worse than no gate, because it manufactures the appearance of rigour. Put those checks on a schedule where their slowness costs nothing, and keep the merge gate fast enough that nobody resents it.

Author’s Note

Data science has no “build”. There’s no compile step that fails, no moment built into the workflow where a machine checks your work before it counts — a notebook either runs or it doesn’t, and “runs” is not the same as “correct”. So the very idea of an automatic, mandatory gate on every change is foreign, and it’s easy to experience CI as bureaucracy: ceremony and waiting imposed on work that felt fine without it.

The reframe is that CI is not there to slow you down; it’s the thing that lets you change code without fear. The reason software engineers refactor aggressively — rename things, restructure, delete dead code — is that the test suite runs automatically on every change and tells them within minutes if they broke something. That safety net converts “I’m afraid to touch this notebook in case it breaks something downstream” into “change it; if it breaks, the checks will tell me before it reaches anyone”. For a data science codebase, where so much is interconnected and under-tested, that fearlessness is the difference between a project that keeps improving and one that calcifies because nobody dares to touch it.

13.6 Summary

Continuous integration makes “did this break anything?” a question a machine answers in minutes:

CI runs your checks automatically on every change. A fresh machine installs the locked dependencies and runs tests, linters, and type checks, reporting one pass/fail verdict — and the clean environment catches “works on my machine”.
It rests on exit codes. Each check returns zero or non-zero; CI reads that to set the gate green or red, the same contract the command line uses.
A workflow is a short YAML file. On push, check out, set up Python, install dependencies, run the checks — with caching for speed and a matrix for multiple versions.
Pre-commit catches the cheap problems earlier. Fast local hooks keep formatting slips, large files, and secrets out of the history before CI ever runs.
Slow work gets tiered, not skipped. Test the pipeline’s logic on a small fixture and smoke-test that training completes; move full training onto a nightly schedule. A check belongs on the merge gate only if you’d actually wait for it.

The next chapter packages the code CI has verified so it runs identically everywhere: containerisation.

13.7 Exercises

Add a minimal CI workflow to one of your own repositories: on every push, install the locked dependencies and run your test suite. Then push a change that deliberately breaks a test and confirm the status goes red before you would have noticed the problem any other way.
Add a lint and format gate to the workflow (ruff check and ruff format --check). On its first run, what did it flag, and how much was a real defect versus pure style?
Set up pre-commit with a few fast hooks — a formatter, and a check for large files or private keys. Try to commit a deliberately mis-formatted file (or a stray large file) and confirm the hook stops the commit.
Conceptual: The Data Science Bridge pairs “changed the features, re-run the holdout” with “changed the code, re-run the checks”. But you re-run an evaluation for reasons that have nothing to do with code. Name a change that would invalidate your model’s previous verdict while leaving CI permanently green, and say what you would need beyond CI to notice it.
Apply the tiering to a real pipeline of your own. Write down every check you’d want run against it, then time each one and sort them into per-push, nightly, and on-demand. Where did you disagree with the chapter’s placement, and what about your situation justifies the difference? Then build the fixture: extract a small enough slice of your data to commit, and get one smoke test running on it in under a minute.

--- # Content: CC BY-NC-SA 4.0 | Code: MIT - see /LICENSE.md --- # Continuous integration {#sec-ci} ## The change that broke everything {#sec-change-broke-everything} A colleague refactors a shared cleaning function to handle a new data source. Their notebook runs, the numbers look right, they merge. Three days later your model retrains on a schedule and the accuracy has collapsed, because the refactor changed a default that your features quietly depended on. Nobody re-ran your code when theirs changed; there was nothing to re-run it. By the time anyone notices, the change is buried under a week of other work and the bad model has been serving predictions the whole time. The gap here isn't a missing test — you might both have had tests — it's that nothing *automatically ran the checks when the code changed*. Continuous integration closes that gap. It is the practice of running your checks automatically on every change, on a clean machine, before the change is allowed to merge, so that "did this break anything?" is answered in minutes by a machine rather than in days by a production incident. ## What continuous integration is {#sec-what-ci-is} Mechanically, CI is simple: a service watches your repository, and every time someone pushes, it spins up a fresh machine, installs your locked dependencies, runs your checks — tests, linters, type checks — and reports a single pass or fail status on that change. The whole thing rests on the exit-code contract from @sec-command-line: each check returns zero for success or non-zero for failure, and CI reads those codes to decide whether the change is green or red. ```{python} #| label: ci-exit-codes #| echo: true import subprocess import sys import tempfile from pathlib import Path # CI runs your test suite and reads its exit code. Here are the two # outcomes it acts on, produced by running pytest on a tiny test file. work = Path(tempfile.mkdtemp()) (work / "test_pass.py").write_text("def test_ok(): assert 1 + 1 == 2\n") (work / "test_fail.py").write_text("def test_broken(): assert 1 + 1 == 3\n") for name in ("test_pass.py", "test_fail.py"): result = subprocess.run([sys.executable, "-m", "pytest", "-q", str(work / name)], capture_output=True, text=True) verdict = "green (merge allowed)" if result.returncode == 0 else "red (merge blocked)" print(f"{name:14} exit code {result.returncode} -> {verdict}") ``` The passing test exits zero and the failing one exits non-zero, and that single bit — green or red — is the entire basis of the gate. Crucially, this runs on a *clean* machine with freshly installed locked dependencies, which is what catches the "works on my machine" failures from @sec-environments: if a change accidentally relies on something only your laptop has, CI fails where you didn't. ::: {.callout-note} ## Data Science Bridge CI is the reflex of re-running your evaluation suite whenever something changes, automated and made mandatory. You already know never to trust a model after you've altered the features without re-checking it on the holdout — the change invalidates the previous verdict, so you re-evaluate. CI applies exactly that reflex to code: a change invalidates the previous "it works", so the checks re-run before anyone relies on the result. It's the same instinct — *changed it, re-verify it* — enforced by a machine so it can't be forgotten under deadline pressure. Where the analogy breaks down: re-running an evaluation gives you a graded score that you interpret with judgement (0.82 — good enough?). A CI run gives a binary gate that blocks the merge — there's no "82% of the tests passed, ship it". Same trigger, different kind of verdict: a number to weigh versus a door that's open or shut. ::: ## A workflow, concretely {#sec-ci-workflow} On the most common platform, GitHub Actions, a CI pipeline is a YAML file in `.github/workflows/`. It reads almost like a description of what you'd do by hand: on every push, check out the code, set up Python, install the locked dependencies, and run the checks. ```yaml # .github/workflows/ci.yml name: CI on: [push, pull_request] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v6 - uses: actions/setup-python@v6 with: python-version: "3.12" cache: pip - run: pip install -r requirements.txt - run: ruff check . - run: pytest ``` Each `run` step is a command whose exit code becomes part of the verdict; if `ruff` or `pytest` fails, the job goes red and — if you enable *branch protection* — the merge is blocked. Branch protection is a setting on the hosting platform (GitHub, GitLab) that makes rules about a branch enforceable rather than advisory: no pushing directly to `main`, no merging until the named CI checks pass, and optionally no merging without a review. Without it, CI is a report; with it, CI is a gate. The `cache: pip` line reuses installed dependencies between runs so the pipeline stays fast, and a *matrix* (not shown) can run the same steps across several Python versions at once, catching a version-specific break before a user does. ## Catching it earlier with pre-commit {#sec-pre-commit} CI is the authoritative, shared gate, but waiting for a remote run to tell you about a formatting slip is slow. Pre-commit hooks run fast checks *locally*, before a commit is even recorded, so the cheap problems never make it as far as CI. A small `.pre-commit-config.yaml` wires up the formatter, the linter, and a few hygiene checks: ```yaml # .pre-commit-config.yaml repos: - repo: https://github.com/astral-sh/ruff-pre-commit rev: v0.15.22 hooks: - id: ruff-check # lint - id: ruff-format # format - repo: https://github.com/pre-commit/pre-commit-hooks rev: v6.0.0 hooks: - id: check-added-large-files # stop a stray dataset entering Git - id: detect-private-key # stop a secret being committed ``` The division of labour is the point: pre-commit is the fast local gate that keeps trivial issues out of the history (including a large dataset or a private key — the mistakes from @sec-version-control and @sec-config-secrets), and CI is the slower, authoritative gate that the whole team's changes must pass. Together they mean a problem is caught at the cheapest possible moment. ## When the work is too slow to run on every push {#sec-slow-ci} Here the standard CI advice collides with the reality of data science work, and it's the objection most readers will have already formed: the test suite for a web service runs in ninety seconds, but *your* pipeline trains for six hours on a dataset that doesn't fit on the runner and isn't allowed to leave the warehouse. "Run the tests on every push" sounds like advice written by someone who has never trained a model. The resolution is to stop thinking of CI as "run the pipeline" and start thinking of it as "run the checks whose verdict you need before merging". Those are very different sets, and almost nothing in the slow one belongs on the fast path. **Separate the logic from the volume.** Most of what can break in a training pipeline is not the training — it's the feature engineering, the joins, the encoding, the split logic, the serialisation. All of that is testable on a few hundred synthetic or sampled rows in seconds, and a test that `add_features` handles a zero denominator does not become more true for having run on fifty million rows. Commit a small fixture dataset to the repository (small enough to be genuinely committable — this is one of the few dataset exceptions to @sec-version-control) and let the fast suite run against it on every push. **Test that the model trains, not that it trains well.** A "smoke test" fits the real pipeline on the tiny fixture for one iteration, or a handful of trees, and asserts only that it completes, produces an artefact of the right shape, and that the artefact survives a save/load round-trip. That catches the overwhelming majority of real breakages — a renamed column, an incompatible sklearn version, a shape mismatch — in under a minute. It deliberately does not check accuracy, for the reason @sec-testing gave: a metric threshold is not a pass/fail property, and wiring one into CI produces a gate that fails for reasons nobody can act on. **Tier the rest by cost.** Anything genuinely expensive moves off the per-push path and onto a schedule or an explicit trigger: ```yaml on: push: # fast: lint, types, unit tests, smoke train schedule: - cron: "0 3 * * *" # nightly: full training on real data, metrics logged workflow_dispatch: # on demand: kick off the expensive run manually ``` The nightly job trains properly, records its metrics somewhere you can track over time, and alerts on a genuine regression — but it never blocks a merge, because a six-hour feedback loop is not a gate, it's a report. **And where the data can't move, bring CI to the data.** If the dataset can't leave your infrastructure for governance reasons, the answer is a *self-hosted runner*: the same workflow file, executed on a machine inside your network with access to the warehouse and the GPUs. Nothing about the CI concepts changes; only where the job runs does. The principle underneath all of this is worth stating plainly, because it generalises past CI: **a check belongs on the fast path if its verdict would change your decision to merge, and if you'd actually wait for it.** A check that takes six hours fails the second test regardless of how valuable it is — people merge anyway, and a gate everyone routes around is worse than no gate, because it manufactures the appearance of rigour. Put those checks on a schedule where their slowness costs nothing, and keep the merge gate fast enough that nobody resents it. ::: {.callout-tip} ## Author's Note Data science has no "build". There's no compile step that fails, no moment built into the workflow where a machine checks your work before it counts — a notebook either runs or it doesn't, and "runs" is not the same as "correct". So the very idea of an automatic, mandatory gate on every change is foreign, and it's easy to experience CI as bureaucracy: ceremony and waiting imposed on work that felt fine without it. The reframe is that CI is not there to slow you down; it's the thing that lets you change code *without fear*. The reason software engineers refactor aggressively — rename things, restructure, delete dead code — is that the test suite runs automatically on every change and tells them within minutes if they broke something. That safety net converts "I'm afraid to touch this notebook in case it breaks something downstream" into "change it; if it breaks, the checks will tell me before it reaches anyone". For a data science codebase, where so much is interconnected and under-tested, that fearlessness is the difference between a project that keeps improving and one that calcifies because nobody dares to touch it. ::: ## Summary {#sec-ci-summary} Continuous integration makes "did this break anything?" a question a machine answers in minutes: 1. **CI runs your checks automatically on every change.** A fresh machine installs the locked dependencies and runs tests, linters, and type checks, reporting one pass/fail verdict — and the clean environment catches "works on my machine". 2. **It rests on exit codes.** Each check returns zero or non-zero; CI reads that to set the gate green or red, the same contract the command line uses. 3. **A workflow is a short YAML file.** On push, check out, set up Python, install dependencies, run the checks — with caching for speed and a matrix for multiple versions. 4. **Pre-commit catches the cheap problems earlier.** Fast local hooks keep formatting slips, large files, and secrets out of the history before CI ever runs. 5. **Slow work gets tiered, not skipped.** Test the pipeline's logic on a small fixture and smoke-test that training completes; move full training onto a nightly schedule. A check belongs on the merge gate only if you'd actually wait for it. The next chapter packages the code CI has verified so it runs identically everywhere: *containerisation*. ## Exercises {#sec-ci-exercises} 1. Add a minimal CI workflow to one of your own repositories: on every push, install the locked dependencies and run your test suite. Then push a change that deliberately breaks a test and confirm the status goes red before you would have noticed the problem any other way. 2. Add a lint and format gate to the workflow (`ruff check` and `ruff format --check`). On its first run, what did it flag, and how much was a real defect versus pure style? 3. Set up `pre-commit` with a few fast hooks — a formatter, and a check for large files or private keys. Try to commit a deliberately mis-formatted file (or a stray large file) and confirm the hook stops the commit. 4. **Conceptual:** The Data Science Bridge pairs "changed the features, re-run the holdout" with "changed the code, re-run the checks". But you re-run an evaluation for reasons that have nothing to do with code. Name a change that would invalidate your model's previous verdict while leaving CI permanently green, and say what you would need beyond CI to notice it. 5. Apply the tiering to a real pipeline of your own. Write down every check you'd want run against it, then time each one and sort them into per-push, nightly, and on-demand. Where did you disagree with the chapter's placement, and what about your situation justifies the difference? Then build the fixture: extract a small enough slice of your data to commit, and get one smoke test running on it in under a minute.