13  Continuous integration

13.1 The change that broke everything

A colleague refactors a shared cleaning function to handle a new data source. Their notebook runs, the numbers look right, they merge. Three days later your model retrains on a schedule and the accuracy has collapsed, because the refactor changed a default that your features quietly depended on. Nobody re-ran your code when theirs changed; there was nothing to re-run it. By the time anyone notices, the change is buried under a week of other work and the bad model has been serving predictions the whole time.

The gap here isn’t a missing test — you might both have had tests — it’s that nothing automatically ran the checks when the code changed. Continuous integration closes that gap. It is the practice of running your checks automatically on every change, on a clean machine, before the change is allowed to merge, so that “did this break anything?” is answered in minutes by a machine rather than in days by a production incident.

13.2 What continuous integration is

Mechanically, CI is simple: a service watches your repository, and every time someone pushes, it spins up a fresh machine, installs your locked dependencies, runs your checks — tests, linters, type checks — and reports a single pass or fail status on that change. The whole thing rests on the exit-code contract from Chapter 4: each check returns zero for success or non-zero for failure, and CI reads those codes to decide whether the change is green or red.

import subprocess
import sys
import tempfile
from pathlib import Path

# CI runs your test suite and reads its exit code. Here are the two
# outcomes it acts on, produced by running pytest on a tiny test file.
work = Path(tempfile.mkdtemp())
(work / "test_pass.py").write_text("def test_ok(): assert 1 + 1 == 2\n")
(work / "test_fail.py").write_text("def test_broken(): assert 1 + 1 == 3\n")

for name in ("test_pass.py", "test_fail.py"):
    result = subprocess.run([sys.executable, "-m", "pytest", "-q", str(work / name)],
                            capture_output=True, text=True)
    verdict = "green (merge allowed)" if result.returncode == 0 else "red (merge blocked)"
    print(f"{name:14} exit code {result.returncode} -> {verdict}")
test_pass.py   exit code 0 -> green (merge allowed)
test_fail.py   exit code 1 -> red (merge blocked)

The passing test exits zero and the failing one exits non-zero, and that single bit — green or red — is the entire basis of the gate. Crucially, this runs on a clean machine with freshly installed locked dependencies, which is what catches the “works on my machine” failures from Chapter 3: if a change accidentally relies on something only your laptop has, CI fails where you didn’t.

NoteData Science Bridge

CI is the reflex of re-running your evaluation suite whenever something changes, automated and made mandatory. You already know never to trust a model after you’ve altered the features without re-checking it on the holdout — the change invalidates the previous verdict, so you re-evaluate. CI applies exactly that reflex to code: a change invalidates the previous “it works”, so the checks re-run before anyone relies on the result. It’s the same instinct — changed it, re-verify it — enforced by a machine so it can’t be forgotten under deadline pressure.

Where the analogy breaks down: re-running an evaluation gives you a graded score that you interpret with judgement (0.82 — good enough?). A CI run gives a binary gate that blocks the merge — there’s no “82% of the tests passed, ship it”. Same trigger, different kind of verdict: a number to weigh versus a door that’s open or shut.

13.3 A workflow, concretely

On the most common platform, GitHub Actions, a CI pipeline is a YAML file in .github/workflows/. It reads almost like a description of what you’d do by hand: on every push, check out the code, set up Python, install the locked dependencies, and run the checks.

# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: pip
      - run: pip install -r requirements.txt
      - run: ruff check .
      - run: pytest

Each run step is a command whose exit code becomes part of the verdict; if ruff or pytest fails, the job goes red and — if you protect the branch — the merge is blocked. The cache: pip line reuses installed dependencies between runs so the pipeline stays fast, and a matrix (not shown) can run the same steps across several Python versions at once, catching a version-specific break before a user does.

13.4 Catching it earlier with pre-commit

CI is the authoritative, shared gate, but waiting for a remote run to tell you about a formatting slip is slow. Pre-commit hooks run fast checks locally, before a commit is even recorded, so the cheap problems never make it as far as CI. A small .pre-commit-config.yaml wires up the formatter, the linter, and a few hygiene checks:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.6.0
    hooks:
      - id: ruff           # lint
      - id: ruff-format    # format
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.6.0
    hooks:
      - id: check-added-large-files   # stop a stray dataset entering Git
      - id: detect-private-key        # stop a secret being committed

The division of labour is the point: pre-commit is the fast local gate that keeps trivial issues out of the history (including a large dataset or a private key — the mistakes from Chapters 2 and 11), and CI is the slower, authoritative gate that the whole team’s changes must pass. Together they mean a problem is caught at the cheapest possible moment.

TipAuthor’s Note

Data science has no “build”. There’s no compile step that fails, no moment built into the workflow where a machine checks your work before it counts — a notebook either runs or it doesn’t, and “runs” is not the same as “correct”. So the very idea of an automatic, mandatory gate on every change is foreign, and it’s easy to experience CI as bureaucracy: ceremony and waiting imposed on work that felt fine without it.

The reframe is that CI is not there to slow you down; it’s the thing that lets you change code without fear. The reason software engineers refactor aggressively — rename things, restructure, delete dead code — is that the test suite runs automatically on every change and tells them within minutes if they broke something. That safety net converts “I’m afraid to touch this notebook in case it breaks something downstream” into “change it; if it breaks, the checks will tell me before it reaches anyone”. For a data science codebase, where so much is interconnected and under-tested, that fearlessness is the difference between a project that keeps improving and one that calcifies because nobody dares to touch it.

13.5 Summary

Continuous integration makes “did this break anything?” a question a machine answers in minutes:

  1. CI runs your checks automatically on every change. A fresh machine installs the locked dependencies and runs tests, linters, and type checks, reporting one pass/fail verdict — and the clean environment catches “works on my machine”.

  2. It rests on exit codes. Each check returns zero or non-zero; CI reads that to set the gate green or red, the same contract the command line uses.

  3. A workflow is a short YAML file. On push, check out, set up Python, install dependencies, run the checks — with caching for speed and a matrix for multiple versions.

  4. Pre-commit catches the cheap problems earlier. Fast local hooks keep formatting slips, large files, and secrets out of the history before CI ever runs.

The next chapter packages the code CI has verified so it runs identically everywhere: containerisation.

13.6 Exercises

  1. Add a minimal CI workflow to one of your own repositories: on every push, install the locked dependencies and run your test suite. Then push a change that deliberately breaks a test and confirm the status goes red before you would have noticed the problem any other way.

  2. Add a lint and format gate to the workflow (ruff check and ruff format --check). On its first run, what did it flag, and how much was a real defect versus pure style?

  3. Set up pre-commit with a few fast hooks — a formatter, and a check for large files or private keys. Try to commit a deliberately mis-formatted file (or a stray large file) and confirm the hook stops the commit.

  4. Conceptual: The Data Science Bridge compares CI to re-running your evaluation suite on every change. Give one way the analogy holds and one way it breaks down. How does the verdict CI produces differ from the one a model evaluation produces?

  5. Conceptual: You can’t run a six-hour model training on every push. Describe what should run on every change versus what belongs in an occasional or nightly job, and state the principle that decides which checks go where.